01_RAID_High-PerformanceReliableSecondaryStorage.pdf

RAID: High-Performance, Reliable Secondary Storage

PETER M. CHEN

Department of Electr~cal Engineering and Computer Sctence, 1301 Beal Avenue,

University of Michigan, Ann Arbor, Michigan, 48109-2122

EDWARD K. LEE

DECSystems Research Center, 130 Lytton Avenue. Palo Alto, California 94301-1044

GARTH A. GIBSON

School of Computer Sctence, Carnegie Mellon University, 5000 Forbes Aven ue,

Pittsburgh, Pennsylvania 15213-3891

RANDY H. KATZ

Department of Electrical Engineering and Computer Sctence. 571 Evans Hall,

University of California, Berkeley, California 94720

DAVID A. PATTERSON

Department of Electrical Engineering and Computer Science, 571 Euans Hall,

University of Cahfornia, Berkeley, California 94720

Disk arrays were proposed in the 1980s as a way to use parallelism between multiple

disks to improve aggregate 1/0 performance. Today they appear in the product lines of

most major computer manufacturers. This article gives a comprehensive overview of

disk arrays and provides a framework in which to organize current and future work.

First, the article introduces disk technology and reviews the driving forces that have

popularized disk arrays: performance and reliability. It discusses the two architectural

techniques used in disk arrays: striping across multiple disks to improve performance

and redundancy to improve reliability. Next, the article describes seven disk array

architectures, called RAID (Redundant Arrays of Inexpensive Disks) levels O–6 and

compares their performance, cost, and reliability. It goes on to discuss advanced

research and implementation topics such as refining the basic RAID levels to improve

performance and designing algorithms to maintain data consistency. Last, the article

describes six disk array prototypes or products and discusses future opportunities for

research, with an annotated bibliography of disk array-related literature.

Categories and Subject Descriptors: B.4.2 [Input/ Output and Data

Communications]: Input/Output Devices; B.4.5 [Input/ Output and Data

Communications]: Reliability, Testing, and Fault-Tolerance; D.4.2 [Operating

Systems]: Storage Management; E.4 [Data]: Coding and Information Theory;

General Terms: Design, Performance, Reliability

Additional Key Words and Phrases: Disk Array, parallel 1/0, RAID, redundancy,

storage, striping

Permission to copy without fee all or part of this material is granted provided that the copies are not made

or distributed for direct commercial advantage, the ACM copyright notice and the title of the publicationand its data appear, and notice is given that copying is by permission of the Association for ComputingMachinery, To copy otherwise, or to republish, requires a fee and/or specific permission.01994 ACM 0360-0300/94/0600-0145 $03,50

ACM Computmg Surveys, Vol 26, No. 2, June 1994

46 ● Peter M. Chen et al.

CONTENTS

1 INTRODUCTION2 BACKGROUND

2.1 Disk Termmology2.2 Data Paths2.3 Technology Trends

3 DISK ARRAY BASICS3.1 Data StrlpIng and Redundancy32 Basic RAID Orgamzations

33 Performance and C!ost Comparisons34 Reliability35 Implementation Considerations

4 ADVANCED TOPICS

41 Impruvmg Small Write Performance forRAID Leve15

42 Declustered Parity43 Exploltmg On-I,lne Spare Disks44 Data Strip] ngm Dlsli Arrays45 Performance and Rellabdlty Modellng

5. CASE STUDIES51 Thmkmg Mach] nes Corporation ScaleArray

52 StorageTek Iceherg 9200 D]sk Array Subsystem5.3 NCR 62985.4 l’lckerTAIP/DataMesh

5.5 The RAID-11 Storage Server56 IBM Hagar Disk Array Controller

6 OPPORTUNITIES F’OR FUTURE RESEARCH61 Experience with Disk Arrays62 InteractIon among New Orgamzatlons63 Scalabdlty, Massively Parallel Computers,

and Small Disks

64 Latency7 CONCLUSIONSACKNOWLEDGMENTSANNOTATED BIBLIOGRAPHY

1. INTRODUCTION

In recent years, interest in RAID, Redun-dant Arrays of Inexpensive Disks,l hasgrown explosively. The driving force be-hind this phenomenon is the sustainedexponential improvements in the per-formance and density of semiconductortechnology. Improvements in semicon-ductor technology make possible fastermicroprocessors and larger primarymemory systems which in turn require

1Because of the restrictiveness of “Inexpensive,”sometimes RAID is said to stand for “RedundantArrays of Independent Disks.”

larger, higher-performance secondarystorage systems. These improvementshave both quantitative and qualitativeconsequences.

On the quantitative side, Amdahl’sLaw [Amdahl 1967] predicts that largeimprovements in microprocessors will re-sult in only marginal improvementsin overall system performance unlessaccompanied by corresponding improve-ments in secondary storage systems. Un-fortunately, while RISC microprocessorperformance has been improving 50~0 ormore per year [Patterson and Hennessy1994, p. 27], disk access times, whichdepend on improvements of mechanicalsystems, have been improving less than10% per year. Disk transfer rates, whichtrack improvements in both mechanicalsystems and magnetic-media densities,have improved at the faster rate of ap-proximately 20% per year, but this isstill far slower than the rate of processorimprovement. Assuming that semicon-ductor and disk technologies continuetheir current trends, we must concludethat the performance gap between micro-processors and magnetic disks will con-tinue to widen.

In addition to the quantitative effect, asecond, perhaps more important, qualita-tive effect is driving the need for higher-performance secondary storage systems.As microprocessors become faster, theymake possible new applications andgreatly expand the scope of existing ap-plications. In particular, image-intensiveapplications such as video, hypertext, andmultimedia are becoming common. Evenin existing application areas such ascomputer-aided design and scientificcomputing, faster microprocessors makeit possible to tackle new problems requir-ing faster access to larger data sets. Thisshift in applications, along with a trendtoward large, shared, high-performance,network-based storage systems, is caus-ing us to reevaluate the way we designand use secondary storage systems [Katz1992].

Disk arrays, which organize multiple,independent disks into a large, high-per-formance logical disk, are a natural solu-

ACM Computing Surveys, Vol 26, No 2, June 1994

RAID ● 147

tion to the problem. Disk arrays stripedata across multiple disks and accessthem in parallel to achieve both higherdata transfer rates on large data access-es and higher 1/0 rates on small dataaccesses [Salem and Garcia-Molina 1986;Livny et al. 1987]. Data striping also re-sults in uniform load balancing across allof the disks, eliminating hot spots thatotherwise saturate a small number ofdisks while the majority of disks sit idle.

Large disk arrays, are highly vulnera-ble to disk failures however. A disk arraywith 100 disks is 100 times more likely tofail than a single-disk array. An MTTF(mean-time-to-failure) of 200,000 hours,or approximately 23 years, for a singledisk implies an MTTF of 2000 hours, orapproximately three months, for a diskarray with 100 disks. The obvious solu-tion is to employ redundancy in the formof error-correcting codes to tolerate diskfailures. This allows a redundant diskarray to avoid losing data for much longerthan an unprotected single disk. How-ever, redundancy has negative conse-quences. Since all write operations mustupdate the redundant information, theperformance of writes in redundant diskarrays can be significantly worse thanthe performance of writes in nonredun-dant disk arrays. Also, keeping the re-dundant information consistent in theface of concurrent 1/0 operations andsystem crashes can be difficult.

A number of different data-striping andredundancy schemes have been devel-oped. The combinations and arrange-ments of these schemes lead to a bewil-dering set of options for users anddesigners of disk arrays. Each option pre-sents subtle tradeoffs among reliability,performance, and cost that are difficultto evaluate without understanding thealternatives. To address this problem,this article presents a systematic tutorialand survey of disk arrays. We describeseven basic disk array organizationsalong with their advantages and disad-vantages and compare their reliability,performance, and cost. We draw atten-tion to the general principles governingthe design and configuration of disk ar-

rays as well as practical issues that mustbe addressed in the implementation ofdisk arrays. A later section describes op-timization and variations to the sevenbasic disk array organizations. Finally,we discuss existing research in the mod-eling of disk arrays and fruitful avenulesfor future research. This article should beof value to anyone interested in disk ar-rays, including students, researchers, de-signers, and users of disk arrays.

2. BACKGROUND

This section provides basic backgroundmaterial on disks, 1/0 data paths, anddisk technology trends for readers whoare unfamiliar with secondary storagesystems.

2.1 Disk Terminology

Figure 1 illustrates the basic componentsof a simplified magnetic disk drive. Adisk consists of a set of platters coatedwith a magnetic medium rotating at aconstant angular velocity and a set ofdisk arms with magnetic read/writeheads that are moved radially across theplatters’ surfaces by an actuator. Oncethe heads are correctly positioned, datais read and written in small arcs calledsectors on the platters’ surfaces as theplatters rotate relative to the heads. Al-though all heads are moved collective] y,in almost every disk drive, only a singlehead can read or write data at any giventime. A complete circular swath of data isreferred to as a track, and each platter’ssurface consists of concentric rings oftracks. A vertical collection of tracks atthe same radial position is logically re-ferred to as a cylinder. Sectors are numb-ered so that a sequential scan of allsectors traverses the entire disk in theminimal possible time.

Given the simplified disk describedabove, disk service times can be brok-en into three primary components: seek

time, rotational latency, and data traris-

fer time. Seek time is the amount of timeneeded to move a head to the correctradial position and typically ranges from

ACM Computing Surveys, Vol. 26, No 2, June 1994

148 * Peter M. Chen et al.

Plattel-

Inner Track l+tld

__..–— .——-

…~-k-. — —-.. ——– ~——._ ._.. ._ ___ ———-

-Actuator

Figure 1. Disk terminology Heads res] de on arms which are positioned by actuators. Tracks areconcentric rings cm a platter. A sector is the basic unit of reads and writes A cylinder is a stack of tracks atone actuator positron. An HDA (head-disk assembly) is everything in the figure plus the air-tight casingIn some devices it M possible to transfer data from multiple surfaces simultaneously, but this is both rareand expensive. The collection of heads that participate m a single logical transfer that is suread over.-multiple surfaces is called a head groap.

1 to 30 milliseconds depending on theseek distance and the particular disk.Rotational latency is the amount of timeneeded for the desired sector to rotateunder the disk head. Full rotation timesfor disks vary currently from 8 to 28milliseconds. The data transfer time isdependent on the rate at which data canbe transferred to/from a platter’s surfaceand is a function of the platter’s rate ofrotation, the density of the magnetic me-dia, and the radial distance of the headfrom the center of the platter—somedisks use a technique called zone-bit-re-cording to store more data on the longeroutside tracks than the shorter insidetracks. Typical data transfer rates rangefrom 1 to 5 MB per second. The seek timeand rotational latency are sometimes col-lectively referred to as the heacl-position-

ing time. Table 1 tabulates the statisticsfor a typical high-end disk available in1993.

The slow head-positioning time andfast data transfer rate of disks lead tovery different performance for a se-quence of accesses depending on the sizeand relative location of each access. Sup-pose we need to transfer 1 MB from thedisk in Table 1, and the data is laid outin two ways: sequential within a singlecylinder or randomly placed in 8 KBblocks. In either case the time for the

Table 1. Speclflcatlons for the Seagate ST43401 N

Elite-3 SCSI D!sk Drive

Form Factor/Disk Diameter 5,25 inch

Capxity 2.8 GB

Cylinders 2627

Tracks Per Cylinder 21

Sec[ors Pcr Tmck -99

Bytes Pcr Sector 512

Full Rolahon Time 11.lms

Mlnunum Seek

(single cylinder)1,7 ms

Average Seek11.Oms

(random cylinder to cylmdcr)

~Average seek in this table 1s calculated assuming aumform distribution of accesses. This is the stan-dard way manufacturers report average seek times.In reality, measurements of production systemsshow that spatial locality sigmficantly lowers theeffective average seek distance [Hennessy and Pat-terson 1990, p 559]

actual data transfer of 1 MB is about 200ms. But the time for positioning the headgoes from about 16 ms in the sequentiallayout to about 2000 ms in the randomlayout. This sensitivity to the workload iswhy I/O-intensive applications are cate-

ACM Comput]ng Surveys, Vol 26, No 2, June 1994

RAID – 14’9

gorized as high data rate, meaning mini-mal head positioning via large, sequen-tial accesses, or high 1/0 rate, meaninglots of head positioning via small, morerandom accesses. For example, scientificprograms that manipulate large arraysof data fall in the high data rate cate-gory, while transaction-processing pro-grams fall in the high 1/0 rate category.

2.2 Data Paths

A hierarchy of industry standard inter-faces has been defined for transferringdata recorded on a disk platter’s surfaceto or from a host computer. In this sec-tion we review the complete data path,from a disk to a users’ application (Fig-ure 2). We assume a read operation forthe purposes of this discussion.

On the disk platter’s surface, informa-tion is represented as reversals in thedirection of stored magnetic fields. These“flux reversals” are sensed, amplified,and digitized into pulses by the lowest-level read electronics. The protocolST506/ 412 is one standard that definesan interface to disk systems at this low-est, most inflexible, and technology-de-pendent level. Above this level of the readelectronics path, pulses are decoded toseparate data bits from timing-relatedflux reversals. The bit-level ESDI (En-hanced Small Device Interface) and SMD(Storage Module Interface) standards de-fine an interface at this more flexible,encoding-independent level. Then, to betransformed into the highest, most flexi-ble packet-level, these bits are alignedinto bytes, error-correcting codes applied,and the extracted data delivered to thehost as data blocks over a peripheral businterface such as SCSI (Small ComputerStandard Interface), or IPI-3 (the thirdlevel of the Intelligent Peripheral Inter-face). These steps are performed today byintelligent on-disk controllers, which of-ten include speed matching and caching“track buffers.” SCSI and IPI-3 also in-clude a level of data mapping: the com-puter specifies a logical block number,and the controller embedded on the diskmaps that block number to a physical

cylinder, track, and sector. This mappingallows the embedded disk controller toavoid bad areas of the disk by remappinglogical blocks that are affected to newareas of the disk.

Topologies and devices on the data pathbetween disk and host computer varywidely depending on the size and type of1/0 system. Mainframes have the rich-est 1/0 systems, with many devices andcomplex interconnection schemes to ac-cess them. An IBM channel path, whichencompasses the set of cables and associ-ated electronics that transfer data andcontrol information between an 1/0 de-vice and main memory, consists of achannel, a storage director, and a head

of string. The collection of disks thatshare the same pathway to the head ofstring is called a string. In the worksta-tion/file server world, the channel pro-cessor is usually called an 1/0 controlleror host-bus adaptor (HBA), and the func-tionality of the storage director and headof string is contained in an embeddedcontroller on the disk drive. As in themainframe world, the use of high-levelperipheral interfaces such as SCSI andIPI-3 allow multiple disks to share a sin-gle peripheral bus or string.

From the HBA, data is transferred viadirect memory access, over a system bus,such as VME (Versa Module Eurocarcl),S-Bus, MicroChannel, EISA (ExtendedIndustry Standard Architecture), or PCI(Peripheral Component Interconnect), tothe host operating system’s buffers. Then,in most operating systems, the CPU per-forms a memory-to-memory copy over ahigh-speed memory bus from the operat-ing system buffers to buffers in the appli-cation’s address space.

2.3 Technology Trends

Much of the motivation for disk arrayscomes from the current trends in disktechnology. As Table 2 shows, magneticdisk drives have been improving rapidlyby some metrics and hardly at all byother metrics. Smaller distances betweenthe magnetic read/write head and thedisk surface, more accurate positioning

ACM Computmg Surveys, Vol. 26, No 2, June 1994

.-A -. ..-,

15U “ Yeter M. (,’hen et al.

M

3~fA

E!:?!or C7vmntl Proccs70r

I

+

S1l-loz——..

_d__

HDisk Cont!ollcr/Slor

& T!xh Bullcr$

L.T.Clocking

.

nlilgncl]c

lPI-3, SCSI-I, SCSI-2, DEC C1/MSCP

IPI-2, SCSI-I,DECSD1,IE+hlC1l’i[mlP,III1 ([111,1blWk,l

1

DIk Cooitmllcr/Slor.yc Iltcclor

Zk. Sl D, ESDI (h]i, )

Clmhin:

. Srx)(l, srd I I IPUIWSI

nl

Figure 2. Host-to-device pathways. Data that is read from a magnetic disk must pass through manylayers on its way to the requesting processor, Each dashed line marks a standard interface Lowerinterfaces such as ST506 deal more closelv with the raw maxnetic fields and are hi~hlv technologydependent, Higher layers such as SCSI d;al in packets or b~ocks of data and are ;or; technologyindependent, A string connects multiple disks to a single 1/0 controller, control of the string 1s distributedbetween the 1/0 and disk controllers.

electronics, and more advanced magneticmedia have increased the recording den-sity on the disks dramatically. This in-creased density has improved disks intwo ways. First, it has allowed disk ca-pacities to stay constant or increase, evenwhile disk sizes have decreased from 5.25inches in 1983 to 1.3 inches in 1993.Second, the increased density, along withan increase in the rotational speed of thedisk, has made possible a substantial in-crease in the transfer rate of disk drives.

On the other hand, seek times have im-proved very little, only decreasing fromapproximately 20 ms in 1980 to 10 mstoday. Rotational speeds have increasedat a similarly slow rate from 3600 revolu-tions per minute in 1980 to 5400-7200today.

3. DISK ARRAY BASICS

This section examines basic issues inthe design and implementation of disk

ACM Computmg Surveys, Vol 26, No 2, June 1994

Table 2.

And Density

Linear Density

Inter-Track Density

Capacity

(3.5” form factor)

Transfer Rate

Seek Time

RAID ● 151

Trends in Disk Technology.

1993Historical Rate

of Improvement

50-150

Mbils/sq. inch27% per year

40,000-60,000

bilslinch13% per year

Magnetic disks are improving rapidly in density and capacity, but more slowly in performance. A realdensitv is the recording densitv Ber sauare inch of magnetic media. In 1989, IBM demonstrated a 1

Gbit/;q.-inch densityi; a labo;a;ory environment. Lines; density is the number of bits written along atrack. Intertrack density refers to the number of concentric tracks on a single platter.

arrays. In particular, we examine theconcepts of data striping and redun-dancy; basic RAID organizations; perfor-mance and cost comparisons between thebasic RAID organizations; reliability ofRAID-based systems in the face of sys-tem crashes, uncorrectable bit-errors, andcorrelated disk failures; and finally, is-sues in the implementation of block-in-terleaved, redundant disk arrays.

3.1 Data Striping and Redundancy

Redundant disk arrays employ two or-thogonal concepts: data striping for im-proved performance and redundancy forimproved reliability. Data striping dis-tributes data transparently over multipledisks to make them appear as a singlefast, large disk. Striping improves aggre-gate 1/0 performance by allowing multi-ple 1/0s to be serviced in parallel. Thereare two aspects to this parallelism. First,multiple independent requests can beserviced in parallel by separate disks.This decreases the queuing time seen by1/0 requests. Second, single multiple-block requests can be serviced by multi-ple disks acting in coordination. This in-creases the effective transfer rate seen by

a single request. The more disks in thedisk array, the larger the potential per-formance benefits. Unfortunately, a largenumber of disks lowers the overall relia-bility of the disk array, as mentionedbefore. Assuming independent failures,100 disks collectively have only 1/100ththe reliability of a single disk. Thus, re-dundancy is necessary to tolerate diskfailures and allow continuous operationwithout data loss.

We will see that the majority of redun-dant disk array organizations can be dis-tinguished based on two features: (1) thegranularity of data interleaving and (2)the method and pattern in which theredundant information is computed anddistributed across the disk array. Datainterleaving can be characterized as ei-ther fine grained or coarse grained. Fine-grained disk arrays conceptually inter-leave data in relatively small units sothat all 1/0 requests, regardless of theirsize, access all of the disks in the diskarray. This results in very high datatransfer rates for all 1/0 requests buthas the disadvantages that (1) only onelogical 1/0 request can be in service atany given time and (2) all disks mustwaste time positioning for every request.

ACM Computing Surveys, Vol. 26, No. 2, June 1994

152 ● Peter M. Chen et al.

Coarse-grained disk arrays interleavedata in relatively large units so that small1/0 requests need access only a smallnumber of disks while large requests canaccess all the disks in the disk array.This allows multiple small requests to beserviced simultaneously while still allow-ing large requests to see the highertransfer rates afforded by using multipledisks.

The incorporation of’ redundancy indisk arrays brings up two somewhat or-thogonal problems. The first problem isto select the method for computing theredundant information. Most redundantdisk arrays today use parity, though someuse Hamming or Reed-Solomon codes.The second problem is that of selecting amethod for distributing the redundantinformation across the disk array. Al-though there are an unlimited number ofpatterns in which redundant informationcan be distributed, we classify these pat-terns roughly into two different distribu-tions schemes, those that concentrate re-dundant information on a small numberof disks and those that distributed re-dundant information uniformly across allof the disks. Schemes that uniformly dis-tribute redundant information are gener-ally more desirable because they avoidhot spots and other load-balancing prob-lems suffered by schemes that do notdistribute redundant information uni-formly. Although the basic concepts ofdata striping and redundancy are con-ceptually simple, selecting between themany possible data striping and redun-dancy schemes involves complex trade-offs between reliability, performance, andcost.

3.2 Basic RAID Organizations

This section describes the basic RAIDorganizations that will be used as thebasis for further examinations of the per-formance, cost, and reliability of diskarrays. In addition to presenting RAIDlevels 1 through 5 that first appeared inthe landmark paper by Patterson, Gib-son, and Katz [Patterson et al. 1988], wepresent two other RAID organizations,

RAID levels O and 6, that have sincebecome generally accepted.x For the ben-efit of those unfamiliar with the originalnumerical classification of RAID, we willuse English phrases in preference to thenumerical classifications. It should comeas no surprise to the reader that even theoriginal authors have been confusedsometimes with regard to the disk arrayorganization referred to by a particularRAID level! Figure 3 illustrates the sevenRAID organizations schematically.

3.2.1 Nonredundant (RAID Level O)

A nonredundant disk array, or RAID levelO, has the lowest cost of any RAID orga-nization because it does not employ re-dundancy at all. This scheme offers thebest write performance since it does notneed to update redundant information.Surprisingly, it does not have the bestread performance. Redundancy schemesthat duplicate data, such as mirroring,can perform better on reads by selec-tively scheduling requests on the diskwith the shortest expected seek and rota-tional delays [Bitten and Gray 1988].Without redundancy, any single disk fail-ure will result in data loss. Nonredun-dant disk arrays are widely used insupercomputing environments whereperformance and capacity, rather thanreliability, are the primary concerns.

3.2.2 Mirrored (RAID Level 1)

The traditional solution, called mirroring

or shadowing, uses twice as many disksas a nonredundant disk array [Bitten andGray 1988]. Whenever data is written toa disk the same data is also written to aredundant disk, so that there are alwaystwo copies of the information. When datais read, it can he retrieved from the diskwith the shorter queuing, seek, and rota-tional delays [Chen et al. 1990]. If a diskfails, the other copy is used to servicerequests. Mirroring is frequently used in

2Strictly speaking, RAID level O IS not a type ofredundant array of inexpensive disks since it storesno error-correcting codes.

ACM Computing Surveys, Vol 26, No 2, June 1994

RAID ● 153

E3E3E3Non-Redundant (RAID Level O)

EE3BMirrored (RAID Level 1)

Mcmo]y-S[ylc ECC (RAID LCW4 2)

BwIntcrleavcd Parity (R~lD LCVC1 3)

EE3mn!iiiiBlock-In{ crleavcd PJriIy (RAID Level 4)

I]lock-In~c!lcavcd Dislribuwd-Parjly [RAID Level 5)

B … Ei23N…. ‘

,., ,, . .. Es~.’ ‘.+’

..,’

P+Q Redundancy (RAID Level 6)

Figure 3. RAID levels O through 6. All RAID levels are illustrated at a user capacity of four disks. Disks

with multiple platters indicate block-level striping while disks without multiple platters indicate bit-levelstriping. The shaded platters represent redundant information.

database applications where availabilityand transaction rate are more importantthan storage efficiency [Gray et al. 1990].

3.2.3 Memoiy-Siyle ECC (RAID Level 2)

Memory systems have provided recoveryfrom failed components with much less

cost than mirroring by using Hammingcodes [Peterson and Weldon 1972]. Ham-ming codes contain parity for distinctoverlapping subsets of components. Inone version of this scheme, four datadisks require three redundant disks, oneless than mirroring. Since the number ofredundant disks is proportional to the log

ACM Computing Surveys, Vol. 26, No. 2, June 1.994

154 “ Peter M. Chenetal.

of the total number of disks in the sys-tem, storage efficiency increases as thenumber of data disks increases.

If a single component fails, several ofthe parity components will have inconsis-tent values, and the failed component isthe one held in common by each incorrectsubset. The lost information is recoveredby reading the other components in asubset, including the parity component,and setting the missing bit to O or 1 tocreate the proper parity value for thatsubset. Thus, multiple redundant disksare needed to identify the failed disk, butonly one is needed to recover the lostinformation.

Readers unfamiliar with parity canthink of the redundant disk as havingthe sum of all the data in the other disks.When a disk fails, you can subtract allthe data on the good disks from the par-ity disk; the remaining information mustbe the missing information. Parity issimply this sum modulo two.

3.2.4 Bit-Interleaved Parity (RAID Level 3)

One can improve upon memory-style EGGdisk arrays by noting that, unlike mem-ory component failures, disk controllerscan easilv identifv which disk has failed.Thus, on”e can u~e a single parity diskrather than a set of parity disks to re-cover lost information.

In a bit-interleaved parity disk array,data is conceptually interleaved bit-wiseover the data disks, and a single paritydisk is added to tolerate any single diskfailure. Each read request accesses alldata disks, and each write request ac-cesses all data disks and the parity disk.Thus, only one request can be serviced ata time. Because the parity disk containsonly parity and no data, the parity diskcannot participate on reads, resulting inslightly lower read performance than forredundancy schemes that distribute theparity and data over all disks. Bit-inter-leaved parity disk arrays are frequentlyused in applications that require highbandwidth but not high 1/0 rates. Alsothey are simpler to implement than RAIDLevels 4, 5, and 6.

3.2,5 Block-interleaved Parity (RAID Level 4)

The block-interleaved parity disk arrayis similar to the bit-interleaved paritydisk array except that data is interleavedacross disks in blocks of arbitrary sizerather than in bits. The size of theseblocks is called the striping unit [Chenand Patterson 1990]. Read requestssmaller than the striping unit access onlya single data disk. Write reauests mustupda~e the requested data ‘blocks andmust compute and update the parityblock. For large writes that touch blockson all disks, p~rity is easily computed byexclusive-oring the new data for eachdisk. For small write reauests that uwdate only one data disk,’ parity is comp-uted by noting how the new data differsfrom the old data and applying thosedifferences to the ~aritv block. Smallwrite requests thu~ re&ire four disk1/0s: one to write the new data, two toread the old data and old parity for com-puting the new parity, and one to writethe new parity. This is referred to as aread-modify-write procedure. Because ablock-interleaved parity disk array hasonly one parity disk, which must be up-dated on all write operations. the ~aritvdisk can easily become a bottleneck. B~-cause of this limitation, the block-inter-leaved distributed-parity disk array isuniversally m-eferred over the block-in-terleaved ~a~ity disk array.

3.2.6 Block-Interleaved Distributed-Parly (RAIDLevel 5)

The block-interleaved distributed-~ aritvdisk array eliminates the parity disk bo~-tleneck present in the block-interleavedparity disk array by distributing the par-

ity uniformly over all of the disks. An

additional, frequently overlooked advan-

tage to distributing the parity is that it

also distributes data over all of the disks

rather than over all but one. This allows

all disks to participate in servicing read

operations in contrast to redundance. .schemes with dedicated parity disks inwhich the parity disk cannot participatein servicing read requests. Block-inter-

ACM Computing Surveys, Vol 26, No 2, June 1994

RAID ● 155

leaved distributed-parity disk arrayshave the best small read, large read, andlarge write performance of any redun-dant disk array. Small write requests aresomewhat inefficient compared with re-dundancy schemes such as mirroringhowever, due to the need to performread-modify-write operations to updateparity. This is the major performanceweakness of RAID level-5 disk arrays andhas been the subject of intensive re-search [Menon et al. 1993; Stodolsky andGibson 1993].

The exact method used to distributeparity in block-interleaved distributed-parity disk arrays can affect perfor-mance. Figure 4 illustrates the bestparity distribution of those investigatedin [Lee and Katz 1991b], called the left-symmetric parity distribution. A usefulproperty of the left-symmetric parity dis-tribution is that whenever you traversethe striping units sequentially, you willaccess each disk once before accessingany disk twice. This property reduces diskconflicts when servicing large requests.

3.2.7 P + Q Redundancy (RAID Level 6,)

Parity is a redundancy code capable ofcorrecting any single self-identifyingfailure. As larger disk arrays are consid-ered, multiple failures are possible, andstronger codes are needed [Burkhard andMenon 1993]. Moreover, when a disk failsin a parity-protected disk array, recover-ing the contents of the failed disk re-quires a successful reading of the con-tents of all nonfailed disks. As we willsee in Section 3.4, the probability of en-countering an uncorrectable read errorduring recovery can be significant. Thus,applications with more stringent reliabil-ity requirements require stronger error-correcting codes.

One such scheme, called P + Q redun-

dancy, uses Reed-Solomon codes to pro-tect against up to two disk failures usingthe bare minimum of two redundantdisks. The P + Q redundant disk arraysare structurally very similar to the block-interleaved distributed-parity disk ar-rays and operate in much the same

o 1 2

5 6 7

10 11 ;P2 8 9.,

(Left-SJ mmetnc)

Figure 4. RAID level-5 left-symmetric parity

placement. Each square corresponds to a stripeunit. Each column of squares corresponds to a disk.PO computes the parity over stripe units O, 1,2, and3; PI computes parity over stripe units 4, 5, 6, and

7; etc. Lee and Katz [ 1991b] show that the left-sym-metric parity distribution has the best perfor-mance. Only the minimum repeating pattern isshown.

manner. In particular, P + Q redundantdisk arrays also perform small write op-erations using a read-modify-write proce-dure, except that instead of four diskaccesses per write requests, P + Q re-dundant disk arrays require six disk ac-cesses due to the need to update both the“P and “Q” information.

3.3 Performance and Cost Comparisons

The three primary metrics in the evalua-tion of disk arrays are reliability, perfor-mance, and cost. RAID levels O through 6cover a wide range of tradeoffs amongthese metrics. It is important to considerall three metrics to understand fully thevalue and cost of each disk array organi-zation. In this section, we compare RAIDlevels O through 6 based on performanceand cost. The following section examinesreliability y.

3.3.1 Ground Rules and Observations

While there are only three primarymetrics in the evaluation of disk arrays

ACM Computing Surveys, Vol 26, No 2, June 1994

156 “ Peter M. Chen et al.

(reliability, performance, and cost), thereare many different ways to measure eachmetric and an even larger number ofways of using them. For example, shouldperformance be measured in 1/0s persecond, bytes per second, or responsetime? Would a hybrid metric such as 1/0sper second per dollar be more appropri-ate? Once a metric is agreed upon, shouldwe compare systems at the same cost,the same total user capacity, the sameperformance, or the same reliability? Themethod one uses depends largely on thepurpose of the comparison and the in-tended use of the system. In time-sharingapplications, the primary metric may beuser capacity per dollar; in transaction-processing applications the primary met-ric may be 1/0s per second per dollar;and in scientific applications, the pri-mary metric may be bytes per second perdollar. In certain heterogeneous systems,such as file servers, both 1/0s per secondand bytes per second may be important.In many cases, these metrics may all beconditioned on meeting a reliabilitythreshold.

Most large secondary storage systems,and disk arrays in particular, arethroughput oriented. That is, generallywe are more concerned with the aggre-gate throughput of the system than, forexample, its response time on individualrequests (as long as requests are satis-fied within a specified time limit). Sucha bias has a sound technical basis: astechniques such as asynchronous 1/0,prefetching, read caching, and writebuffering become more widely used, fastresponse time depends on sustaining ahigh throughput.

In throughput-oriented systems, per-formance can potentially increase Iin-early as additional components areadded; if one disk provides 30 1/0s persecond, 2 should provide 60 1/0s persecond. Thus, in comparing the perfor-mance of disk arrays, we will normalizethe performance of the system by its cost.In other words we will use performancemetrics such as 1/0s per second per dol-lar rather than the absolute number of1/0s per second.

Even after the metrics are agreed upon,one must decide whether to compare sys-tems of equivalent capacity, cost, or someother metric. We chose to compare sys-tems of equiualen t file capacity where

file capacity is the amount of informationthe file system can store on the deviceand excludes the storage used for redun-dancy. Comparing systems with the samefile capacity makes it easy to chooseequivalent workloads for two differentredundancy schemes. Were we to com-pare systems with different file capaci-ties, we would be confronted with toughchoices such as how a workload on asystem with user capacity X maps onto asystem with user capacity 2X.

Finally, there is currently much confu-sion in comparisons of RAID levels 1through 5. The confusion arises becausea RAID level sometimes specifies not aspecific implementation of a system butrather its configuration and use. For ex-ample, a RAID level-5 disk array (block-interleaved distributed parity) with a

parity group size of two is comparable toRAID level 1 (mirroring) with the excep-tion that in a mirrored disk array, cer-tain disk-scheduling and data layoutoptimizations can be performed that,

generally, are not implemented for RAIDlevel-5 disk arrays [Hsiao and DeWitt1990; Orji and Solworth 1993]. Analo-

gously, a RAID level-5 disk array can beconfigured to operate equivalently to aRAID level-3 disk array by choosing aunit of data striping such that the small-est unit of array access always accesses afull parity stripe of data. In other words,RAID level-l and RAID level-3 disk ar-rays can be viewed as a subclass of RAIDlevel-5 disk arrays. Since RAID level-2and RAID level-4 disk arrays are, practi-cally speaking, in all ways inferior toRAID level-5 disk arrays, the problem ofselecting among RAID levels 1 through 5is a subset of the more general problemof choosing an appropriate parity groupsize and striping unit size for RAID level-5 disk arrays. A parity group size close totwo may indicate the use of RAID level-1disk arrays; a striping unit much smallerthan the size of an average request may

ACM Computmg Surveys, Vol. 26, No 2, June 1994

RAID ● 157

Table 3. Throughput per Dollar Relative to RAID Level 0.

Small Read Small Write Large Read Large Write Storage Efficiency

RAID level O 1 1 1 1 1

RAID level 1 I 1 1/2 1 1/2 1/2

RAID level 3 l/G l/G (G- 1)/G (G-1 )/G (G-1)/G

RAID level 5 I 1 max(l/G, I/4) 1 (G-lj/G (G-1)/G

RAID level 6 1 max(l/G,l/6) 1 (G-2)/G (G-2)/G

This table compares the throughputs of various redundancy schemes for four types of 1/0 requests. Smallhere refers to 1/0 requests of one striping unit; large refers to 1/0 requests of one full stripe (one stripe

unit from each disk in an error correction group). G refers to the number of disks in an error correctiongroup, In all cases, the higher the number the better. The entries in this table account for the major

performance effects but not some of the second-order effects. For instance, since RAID level 1 stores twocopies of the data, a common optimization is to read dynamically the disk whose positioning time to the

data is smaller.

indicate the use of a RAID level-3 diskarray.

3.3.2 Comparisons

Table 3 tabulates the maximum through-put per dollar relative to RAID level O forRAID levels O, 1, 3, 5, and 6. The cost ofeach system is assumed to be propor-tional to the total number of disks in thedisk array. Thus, the table illustratesthat given equivalent cost RAID level-Oand RAID level- 1 systems, the RAIDlevel-l system can sustain half the num-ber of small writes per second that aRAID level-O system can sustain. Equiva-lently, we can say that the cost of smallwrites is twice as expensive in a RAIDlevel-l system as in a RAID level-O sys-tem. In addition to performance, the tableshows the storage efficiency of each diskarray organization. The storage effi-ciency is approximately inverse to thecost of each unit of user capacity relativeto a RAID level-O system. For the abovedisk array organizations, the storage effi-ciency is equal to the performance/costmetric for large writes.

Figure 5 graphs the performance/costmetrics from Table 3 for RAID levels 1, 3,5, and 6 over a range of parity groupsizes. The performance/cost of RAIDlevel-l systems is equivalent to the per-formance/cost of RAID level-5 systems

when the parity group size is equal totwo. The performance/cost of RAIDlevel-3 systems is always less than orequal to the performance/cost of RAIDlevel-5 systems. This is expected giventhat a RAID level-3 system is a subclassof RAID level-5 systems derived by re-stricting the striping unit size such thatall requests access exactly a parity stripeof data. Since the configuration of RAIDlevel-5 systems is not subject to such arestriction, the performance/cost ofRAID level-5 systems can never be lessthan that of an equivalent RAID level-3system. It is important to stress thatthese performancecost observations ap-ply only to the abstract models of diskarrays for which we have formulated per-formance/cost metrics. In reality, a spe-cific implementation of a RAID level-3system can have better performance/costthan a specific implementation of a RAIDlevel-5 system.

As previously mentioned, the questionof which RAID level to use is often betterexpressed as more general configurationquestions concerning the size of the par-ity group and striping unit. If a paritygroup size of two is indicated, then mir-roring is desirable. If a very small strip-ing unit is indicated then a RAID level-3system may be sufficient. To aid thereader in evaluating such decisions, Fig-ure 6 plots the four performance/cost

ACM Computing Surveys, Vol 26, No. 2, June 1994

158 . Peter M. Chen et al.

Small Reads

‘“0 r

05

L

AID 3

() o0 5 10 15 ?0

Group SI/c

Large ReadsAll) 5&6

7

R ID

“RAID3

o 5 1() 15 20Group SI~c

Small Writes

1.0,

RAID 1

RAID 3,5 & 6

0 5 10 Is 20

GIm]p SI)C

Large }Vrites

RAID 3&5

F

RAID 1

RAID6

5 10 15 20GIOUp Size

Figure 5. Throughput per dollar relatlve to RAID level 0, RAID level-l performance is approximately

equal to RAID level-5 performance with a group size of two. Note that for small writes, RAID levels 3, 5,and 6 are equally cost effective at small group sizes, but as group size increases, RAID levels 5 and 6

become better alternatives.

metrics from Table 3 on the same graphfor each of the RAID levels 3, 5, and 6.This makes the performance/cost trade-offs explicit in choosing an appropriateparity group size. Section 4.4 addresseshow to choose the unit of striping.

3.4 Reliability

Reliability is as important a metric tomany 1/0 systems as performance andcost, and it is perhaps the main reasonfor the popularity of redundant disk ar-rays, This section starts by reviewing thebasic reliability provided by a block-in-terleaved parity disk array and then liststhree factors that can undermine the po-tential reliability of disk arrays.

3.4.1 Basic Reliability

Redundancy in disk arrays is motivatedby the need to overcome disk failures.When only independent disk failures areconsidered, a simple parity scheme worksadmirably. Patterson et al. [1988] derivethe mean time between failures for aRAID level 5 to be

MTTF( disk )2

NX (G – 1) x MTTR(disk) ‘

where MTTF( disk) is the mean-time-to-failure (MTTF) of a single disk,MTTR( disk) is the mean-time-to-repair(MTTR) of a single disk, N is the total

ACM Computing Surveys, Vol 26, No 2, June 1994

RAID ● 159

RAID Level 3

cLqe kCdS &! ~r,lc<

Small Reads & Wmcs

() 5 10 15 20Group Si~c

1

o

RAID Level 6

S]IMII & Lame Reads

RAID Level 5

Small & L:IIge Reads

cL:irgc Wnlc,

Small Wnks

5 10 15 20Group Si/.c

o.o~5 10 15 20

Group Size

Figure 6. Throughput per dollar relative to RAID level O. The graphs illustrate the tradeoff in perfor-mancecost versus group size for each specified R41D level. Note that in this comparison, mirroring (RAIDlevel 1) is the same as RAID level 5 with a group size of two.

number of disks in the disk array, and Gis the parity group size. For illustrationpurposes, let us assume we have 100disks that each had a mean time to fail-ure of 200,000 hours and a mean time torepair of one hour. If we organized these100 disks into parity groups of averagesize 16, then the mean time to failure ofthe system would be an astounding 3000years. Mean times to failure of this mag-nitude lower the chances of failure overany given period of time.

For a disk array with two redundantdisk per parity group, such as P + Q re-dundancy, the mean time to failure is

iWTTF3 ( disk )

N x (G – 1) x (G – 2) x MTTR2(disk)

Using the same values for our reliabilityparameters, this implies an astronomi-cally large mean time to failure of 38million years.

This is an idealistic picture, but it givesus an idea of the potential reliability af-forded by disk arrays. The rest of thissection takes a more realistic look at thereliability of block-interleaved disk ar-rays by considering factors such as sys-tem crashes, uncorrectable bit-errors, andcorrelated disk failures that can dramati-cally affect the reliability of disk arrays.

3.4.2 System Crashes and Parity Inconsistency

In this section, the term system crash

refers to any event such as a powerfailure, operator error, hardware

ACM Computmg Surveys, Vol. 262 No. 2, June 1994

160 “ Peter M. Chen et al.

breakdown, or software crash that caninterrupt an 1/0 operation to a disk ar-ray. Such crashes can interrupt write op-erations, resulting in states where thedata is updated and the parity is not, orvisa versa. In either case, the parity isinconsistent and cannot be used in theevent of a disk failure. Techniques suchas redundant hardware and power sup-plies can be applied to make such crashesless frequent [Menon and Cartney 1993],but no technique can prevent systemscrashes 100% of the time.

System crashes can cause parity incon-sistencies in both bit-interleaved andblock-interleaved disk arrays, but theproblem is of practical concern only inblock-interleaved disk arrays. This is be-cause in bit-interleaved disk arrays theinconsistent parity can only affect thedata that is currently being written. Ifwrites do not have to be atomic, applica-tions cannot assume either that the writeduring a system crash completed or didnot complete, and thus it is generallypermissible for the bit-interleaved diskarray to store arbitrary data on the up-dated sectors. In a block-interleaved diskarray, however, an interrupted write op-eration can affect the ~arit~ of other datablocks in that stripe ;hat ;ere not beingwritten. Thus, for reliability purposes,svstem crashes in block-interleaved diska&ays are similar to disk failures in thatthey may result in the loss of the correctparity for stripes that were being modi-fied during the crash.

In actuality, system crashes can bemuch worse than disk failures for tworeasons. First, they may occur more fre-quently than disk failures. Second, a sys-tem crash in disk arrays using P + Qredundancy is analogous to a double diskfailure because both the “P” and “Q” in-formation is made inconsistent. To avoidthe loss of parity on system crashes, in-formation sufficient to recover the paritymust be logged to nonvolatile storage be-fore executing each write operation. Theinformation need only be saved untilthe corresponding write completes. Hard-ware implementations of RAID systemscan implement such logging efficiently

using nonvolatile RAM. In software im-plementations that do not have access tofast nonvolatile storage, it is generallynot possible to protect against systemcrashes without significantly sacrificingperformance.

3.4.3 Uncorrectable Bit Errors

Although modern disks are highly reli-able devices that can withstand sig-nificant amounts of abuse, they occa-sionally fail to read or write small bits ofdata. ~urrently, most disks cite uncor-rectable bit error rates of one error in

10 lJ bits read. Unfortunately. the exact“,

interpretation of what is meant by anuncorrectable bit error is unclear. Forexample, does the act of reading disksactually generate errors, or do the errorsoccur during writes and become evidentduring reads?

Generally, disk manufactures agreethat reading a disk is very unlikely tocause permanent errors. Most uncorrect-able errors are generated because data isincorrectly writ;en or gradually damagedas the magnetic media ages. These errorsare detected only when we attempt toread the data. Our interpretation of un-correctable bit error rates is that theyrem-esent the rate at which errors arede~ected during reads from the disk dur-ing the normal operation of the disk drive.It is important to stress that there is nogenerally agreed upon interpretation ofbit error rates.

The primary ramification of an uncor-rectable bit error is felt when a disk failsand the contents of the failed disk mustbe reconstructed by reading data fromthe nonfailed disks. For example, the re-construction of a failed disk in a 100 GB

disk array requires the successful read-ing of approximately 200 million sectorsof information. A bit error rate of one in1014 bits implies that one 512 byte sectorin 24 billion sectors cannot be correctlyread. Thus, if we assume that the proba-bility of reading sectors is independent ofeach other, the probability of reading all200 million sectors successfully is ap-proximately (1 – 1/(2.4 X 1010)) A (2.0

ACM Computmg Surveys, Vol 26. No. 2, .June 1994

x 108) = 99.29%. This means that on av-

erage, 0.8% of disk failures would resultin data loss due to an uncorrectable biterror.

The above example indicates that un-recoverable bit errors can be a significantfactor in designing large, highly reliabledisk arrays. This conclusion is heavilydependent on our particular interpreta-tion of what is meant by an unrecov-erable bit error and the guaranteedunrecoverable bit error rates as suppliedby the disk manufactures; actual errorrates may be much better.

One approach that can be used with orwithout redundancy is to try to protectagainst bit errors by predicting when adisk is about to fail. VAXsimPLUS, aproduct from Digital Equipment Corpo-ration, monitors the warnings given bydisks and notifies an operator when itfeels the disk is about to fail. Such pre-dictions can significantly lower incidentof data loss [Emlich and Polich 1989;Malhotra and Trivedi 1993].

3.4.4 Correlated Disk Failures

The simplest model of reliability of diskarrays [Patterson et al. 1988] assumesthat all disk failures are independentwhen calculating mean time to data loss.This resulted in very high mean time todata loss estimates, up to millions ofyears. In reality, common environmentaland manufacturing factors can cause cor-related disk failures frequently. For ex-ample, an earthquake might sharplyincrease the failure rate for all disksin a disk array for a short period of time.More commonly, power surges, powerfailures, and simply the act of poweringdisks on and off can place simultaneousstress on the electrical components of allaffected disks. Disks also share commonsupport hardware; when this hardwarefails, it can lead to multiple, simultane-ous disk failures.

Aside from environmental factors, thedisks themselves have certain correlatedfailure modes built into them. For exam-ple, disks are generally more likely to faileither very early or very late in their

lifetimes. Early failures are caused fre-quently by transient defects which maynot have been detected during the manu-facturer’s burn-in process; late failuresoccur when a disk wears out. A system-atic manufacturing defect can producealso bad batches of disks that can failclose together in time. Correlated diskfailures greatly reduce the reliability ofdisk arrays by making it much morelikely that an initial disk failure will beclosely followed by additional disk fail-ures before the failed disk can be recon-structed.

3.4.5 Reliability Revisited

The previous sections have described howsystem crashes, uncorrectable bit errors,and correlated disk failures can decreasethe reliability of redundant disk arrays.In this section, we will calculate mean-time-to-data-loss statistics after incorpo-rating these factors.

The new failure modes imply that thereare now three relatively common ways tolose data in a block-interleaved parity-protected disk array:

e

double disk failure,

system crash followed by a disk failure,and

disk failure followed bv an uncorrect-able bit error during reconstruction.

As mentioned above, a system crashfollowed by a disk failure can be pro-tected against in most hardware disk ar-ray implementations with the help ofnonvolatile storage, but such protectionis unlikely in software disk arrays. Theabove three failure modes are the hard-est failure combinations, in that we arecurrently unaware of any techniquesto protect against them without signifi-cantly degrading performance. To con-struct a simple model of correlated diskfailures, we will assume that each suc-cessive disk failure is 10 times more

likely than the previous failure (until thefailed disk has been reconstructed).Table 4 tabulates values of the reliability

ACM Computing Surveys, Vol 26, No. 2, June 1994

162 * Peter M. Chen et al.

Table 4. Reliablilty Parameters

Total User CapacNy

Disk Size

Sector Size

Bit Error RaIc (BER)

p(dlk)

The probability of reading

all sectors on a disk(Dcnved from disk SIX,

sector si~e, and BER.)

Parily Group SIZC

MTTF(disk)

M’ITF(disk2)

MlTF(dtsk3)

MITR(disk)

M’ITF(sys)

MITR(sys)

100 dtsks (500 GB)

5 GB

512 byms

1 in 10A14 b{~

1 m 2.4 IOAIO scclors

99.96%

16 disks

200,000 hours

20,000 hours

2,(XKI hours

1 hour

1 month

1 hour

This table lists parameters used for reliabdity cal-culations m this section.

parameters we will use for calculatingnumeric reliability estimates in this sec-tion. Note that the reliability estimateswill be given per a constant user capacityof 100 disks, consisting of independent,16-disk parity groups.

Table 5, which tabulates reliabilitymetrics for RAID level-5 disk arrays,shows that the frequency of the threefailure combinations are within an orderof magnitude of each other. This meansthat none of the three failure modes canbe ignored in determining reliability. Thismakes it difficult to improve the overallreliability of the system without improv-ing the reliability y of several componentsof the system; a more reliable disk willgreatly reduce the frequency of doubledisk failures, but its protection againstthe other two failure combinations is lesspronounced. Frequencies of both systemcrashes and bit error rates also must bereduced before significant improvementsin overall system reliability can beachieved. Also, note the deceptively reas-suring MTTDL numbers. Even with a

MTTDL of 285 years, there is a 3.4%chance of losing data in the first 10 years.

Table 6 tabulates the reliability met-rics for P + Q redundant disk arrays.As can be seen, system crashes are theAchilles’s heel of P + Q redundancyschemes. Since system crashes invalidateboth the P and Q information, their effectis similar to a double disk failure. Thus,unless the system provides protectionagainst system crashes, as is assumed inthe calculation of the reliability for hard-ware RAID systems, P + Q redundancydoes not provide a significant advantageover parity-protected disk arrays. In gen-eral, P + Q redundancy is most usefulfor protecting against unrecoverable biterrors that occur during reconstructionand against multiple correlated disk fail-ures.

3.4.6 Summary and Conclusions

This section has examined the reliabilityof block-interleaved redundant disk ar-rays when factors other than indepen-dent disk failures are taken into account.We see that system crashes and unrecov-erable bit errors can significantly reducethe reliability of block-interleaved parity-protected disk arrays. We have shownthat P + Q redundant disk arrays arevery effective in protecting against bothdouble disk failures and unrecoverablebit errors but are susceptible to systemcrashes. In order to realize the full relia-bility advantages of P + Q redundantdisk arrays, nonvolatile storage must beused to protect against system crashes.

Numeric reliability calculations serveas useful guidelines and bounds for theactual reliability of disk arravs. How-.ever, it is infeasible to compare the re-

liability of real system based on suchnumbers. Frequently, reliability calcula-tions ignore important implementation-specific factors that are difficult to quan-tify, such as the reliability of softwarecomponents. What is useful to know,however, and what we have presentedhere, are the types of common failuresthat a disk array can tolerate, how theylimit the reliability of the system, and

ACM Computmg Surveys, Vol. 26, No 2, June 1994

RAID ● 163

Table 5. Failure Characteristics for RAID Level-5 Disk Arrays.

Probability of

MTTDL NITTDL Data Loss over

10 Year Period

Double Disk Failure MTTF (disk) x MTTF (disk2) 285 yrs. 3.4%Nx (G- 1) xh4TTR(disk)

Sys Crash + Disk FailureMTTF (sys) x MTTF (disk)

154 yrs. 6.3%N X kf~~/? ( SYS)

Disk Failure + Bit Error IWTF (disk) 36 yrs. 24 .4%

Nx (1- (p(disk))G-’)#

Software RAID (harmonic sum of above) 26 yrs. 31.6%

Hardware RAID (NVRAM) ::-:;:: ;:;;::g 32 yrs. 26.8%

MTTDL is the mean time to data loss. The 10-year failure rate is the percent chance of data loss in a10-year period, For numeric calculations, the parity group size, G, is equal to 16, and the user datacapacity is equal to 100 data disks. Note that the total number of disks in the system, N, is equal to thenumber of data disks times G/(G – 1).

Table 6. Failure Characteristics for a P + Q disk array.

Triple DiskFailure

Sys Crash+Disk Failure

DoubleDisk Failure+ Bit Error

softwareRAID

HardwareRAID(NVRAM)

Probabilityof Data

MTfDL MTTDL Loss over10 YearPeriod

MT7F (disk) X MT77F (d1sk2) X MTTF(disk3 ) )38052 yrs. 0.03%

Nx (G – 1) x (G -2) xMTTR2(disk)

MTTF (SYS) X MTTF (duk)

N X M7TR (S]S)144 yrs. 7.7%

M’f’TF (disk) x IUTTF (disk2 ) )

Nx(G-l) x(l-(l-p (disk))) ‘6-2)) x MTTR (disk)47697 yrs. 0.029??

(harmonic sum of above) 143 yrs. 6.8%

(harmonic sum excluding sys crmh+disk failure) 21166 yrs. 0.05%

MTTDL is the mean time to data loss. The 10-year failure rate is the percent chance of data loss in a10-year period. For numeric calculations, the parity group size, G, is equal to 16, and the user datacapacity is equal to 100 data disks. Note that the total number of disks in the system, N, is equal to thenumber of data disks times G/(G – 2).

ACM Computmg Surveys, Vol. 26, No. 2, June 1994

164 ● Peter M. Chen et al.

thus its approximate reliability in com-parison to other disk array organizationsof similar complexity.

3.5 Implementation Considerations

Although the operation of block-inter-leaved redundant disk arrays is concep-tually simple, a disk array implementermust address many practical considera-tions for the system to function correctlyand reliably at an acceptable level of per-formance. One problem is that the neces-sary state information for a disk arrayconsists of more than just the data andparity stored on the disks. Informationsuch as which disks are failed, how muchof a failed disk has been reconstructed,and which sectors are currently beingupdated must be accurately maintainedin the face of system crashes. We willrefer to such state information that isneither user data nor parity as metastate

information. Another problem, addressedin Section 3.5.4, is that multiple disksare usually connected to the host com-puter via a common bus or string.

3.5.1 Avoiding Stale Data

The only piece of metastate informationthat must be maintained in redundantdisk arrays is the validity of each sectorof data and parity in a disk array. Thefollowing restrictions must be observedin maintaining this information.

0

*

When a disk fails, the logical sectorscorresponding to the failed disk mustbe marked invalid before any requestthat would normally access to the faileddisk can be attempted. This invalidmark prevents users from reading cor-rupted data on the failed disk.

When an invalid logical sector is recon-structed to a spa~e disk, the logicalsector must be marked ualid beforeany write request that would normallywrite to the failed disk can be serviced.This ensures that ensuing writes up-date the reconstructed data on thespare disk.

Both restrictions are needed to ensurethat users do not receive stale data fromthe disk array. Without the first restric-tion, it would be possible for users toread stale data from a disk that is con-sidered to have failed but works inter-mittently. Without the second restriction,successive write operations would fail toupdate the newly reconstructed sector,resulting in stale data. The valid/invalid state information can be main-tained as a bit-vector either on a sepa-rate device or by reserving a smallamount of storage on the disks currentlyconfigured into the disk array. If onekeeps track of which disks are failed/op-erational, one needs only to keep a bit-vector for the failed disks. Generally, it ismore convenient to maintain thevalid/invalid state information on a perstriping unit rather than a per sectorbasis since most implementations willtend to reconstruct an entire striping unitof data at a time rather than a singlesector. Because disk failures are rela-tively rare events and large groups ofstriping units can be invalidated at atime, updating the valid/invalid metas-tate information to stable storage doesnot present a significant performanceoverhead.

3.5.2 Regenerating Parity after a System Crash

System crashes can result in inconsistentparity by interrupting write operations.Thus, unless it is known which paritysectors were being updated, all paritysectors must be regenerated when ever adisk array comes up from a system crash.This is an expensive operation that re-quires scanning the contents of the en-tire disk array. To avoid this overhead,information concerning the consistentinconsistent state of each parity sectormust be logged to stable storage. Thefollowing restriction must be observed.

Before servicing any write request, thecorresponding parity sectors must bemarked inconsistent.

When bringing a system up from a sys-tem crash, all inconsistent parity sec-tors must be regenerated.

ACM Computmg Surveys, Vol. 26, No, 2, June 1994

RAID ● 165

Note that because regenerating a con-sistent parity sector does no harm, it isnot absolutely necessary to mark a paritysector as consistent. To avoid havingto regenerate a large number of paritysectors after each crash, however, it isclearly desirable to mark parity sectorsperiodically, as consistent.

Unlike updating valid/invalid infor-mation, the updating of consistent/in-consistent slate information is a poten-tial performance problem in softwareRAID systems, which usually do not haveaccess to fast. nonvolatile storage. A sim-plistic implementation would ~equire adisk write to mark a parity sector asinconsistent before each write operationand a corresponding disk write to markthe parity sector as consistent after eachwrite operation. A more palatable solu-tion is to maintain a most recently usedpool that keeps track of a fixed numberof inconsistent parity sectors on stablestorage. By keeping a copy of the pool inmain memory, one can avoid accessingstable storage to mark parity sectors thatare already marked as inconsistent. Byvarying the size of the pool, one cantradeoff the hit rate of the pool againstthe amount of parity information thatneeds to be regenerated when recoveringfrom a system crash.

The above method should work effi-ciently for requests that exhibit good lo-cality of reference. If the disk array mustservice a large number of random writerequests, as in transaction-processing en-vironments, we recommend incorporat-ing a group commit mechanism s: thata large number of parity sectors can bemarked inconsistent with a sinde accessto stable storage. This so~ves thethroughput problem but results in higherlatencies for random write reauests sincethe parity sectors must be ma~ked incon-sistent before the writes can proceed.

3.5.3 Operating with a Failed Disk

A system crash in a block-interleavedredundant disk array is similar to a diskfailure in that it can result in the loss ofparity information. This means that a

disk array operating with a failed diskcan potentially lose data in the event of asystem crash. Because system crashes aresimificantlv more common in most svs-tevms than ~isk failures, operating wit~ afailed disk can be risky.

While operating with a failed disk, auser must perform some form of loggingon every write operation to prevent theloss of information in the event of asystem crash. We describe two elegantmethods to perform this logging. The firstmethod. called demand reconstruction. is

the easiest and most efficient but ~e-quires stand-by spare disks. With de-mand reconstruction, accesses to a paritystripe with an invalid sector triggerreconstruction of the appropriate dataimmediately onto a spare disk. Write op-erations. thus. never deal with invalidsectors created by disk failures. A back-ground process scans the entire disk ar-ray to ensure that all the contents of thefailed disk are eventually reconstructedwithin an acceptable tim~ period.

The second method, called parity spar-ing [Chandy and Reddy 1993], can beapplied to systems without stand-byspares but requires additional metastateinformation. Before servicirw a write re-quest that would access a ~arity stripewith an invalid sector, the invalid sectoris reconstructed and relocated to over-write its corresponding parity sector.Then the sector is marked as relocated.Since the corresponding parity stripe nolonger has parity, a system crash canonly affect the data being written. Whenthe failed disk is eventually replaced, (1)the relocated sector is copied to the sparedisk, (2) the parity is regenerated, and(3) the sector is no longer marked asrelocated.

3.5.4 Orthogonal RAID

TO this point, we have ignored the issueof how to connect disks to the host com-puter. In fact, how one does this candrastically affect performance and relia-bility. Most computers connect multipledisks via some smaller number of strings.Thus, a string failure causes multiple,

ACM Computing Surveys, Vol. 26, No 2, June 1994

166 “ Peter M. Chen et al.

0,>11 <1.2

.,”,0

f“7gj3*SI,,,)gCOnlloll.r

Op,m, 1

Figure 7. Orthogonal RAID. This figure presentstwo options of how to orgamze error correctiongroups in the presence of shared resources, such asa string controller, Option 1 groups four disks onthe same string into an error correction group;Option 2 groups one disk from each string into agroup. Option 2 is preferred over Option 1 becausethe failure of a string controller will only renderone disk from each group inaccessible.

simultaneous disk failures. If not prop-erly designed, these multiple failures cancause data to become inaccessible.

For example, consider the 16-disk ar-ray in Figure 7 and two options of howto organize multiple, error correctiongroups. Option 1 combines each string offour disks into a single error correctiongroup. Option 2 combines one disk oneach string into a single error correctiongroup. Unfortunately for Option 1, if astring fails, all four disks of an errorcorrection group are inaccessible. On theother hand, Option 2 loses one disk fromeach of the four error correction groupsand still allows access to all data. Thistechnique of organizing error correctiongroups orthogonally to common hard-ware (such as a string) is called orthogo-

nal RAID [Schulze et al. 1989; Ng 1994].Orthogonal RAID has the added benefitof minimizing string conflicts when mul-tiple disks from a group transfer datasimultaneously.

4. ADVANCED TOPICS

This section discusses advanced topicsin the design of redundant disk arrays.Many of the techniques are independentof each other, allowing designers to mixand match techniques.

4.1 Improving Small Write Performance forRAID Level 5

The major performance problem withRAID level-5 disk arrays is the highoverhead for small writes. As describedin Section 3.2, each small write generatesfour separate disk 1/0s, two to read theold data and old parity and two to writethe new data and new parity. This in-creases the response time of writes byapproximately a factor of two and de-creases throughput by approximately afactor of four. In contrast, mirrored diskarrays, which generate only two disk1/0s per small write, experience verylittle increase in response time and onlya factor-of-two decrease in through-put. These performance penalties of RAIDlevel 5 relative to nonredundant and mir-rored disk arrays are prohibitive in ap-plications such as transaction processingthat generate many small writes.

This section describes three techniquesfor improving the performance of smallwrites in RAID level-5 disk arrays: buf-fering and caching, floating parity, andparity logging.

4.1.1 Buffering and Caching

Buffering and caching, two optimizationscommonly used in 1/0 systems, canbe particularly effective in disk arrays.This section describes how these opti-mization can work to minimize the per-formance degradations of small writes ina RAID level 5.

Write buffering, also called asyn-

chronous writes, acknowledges a user’swrite before the write goes to disk. Thistechnique reduces the response time seenby the user under low and moderate load.Since the response time no longer de-pends on the disk system, RAID level 5can deliver the same response time as

ACM Computing Surveys. Vol 26, No 2, June 1994

RAID ● 167

any other disk system. If system crashesare a significant problem, nonvolatilememory is necessary to prevent loss ofdata that are buffered but not yet com-mitted. This technique may also improvethroughput in two ways: (1) by givingfuture updates the opportunity to over-write previous updates, thus eliminatingthe need to write the first update [Menonand Cortney 1993], and (2) by lengthen-ing the queue of requests seen by a diskscheduler and allowing more efficientscheduling [Seltzer et al 19901.

Barring these overwrites, however, thistechnique does nothing to improvethroughput. So under high load, the writebuffer space will fill more quickly than itempties, and response time of a RAID

level 5 will still be four times worse thana RAID level O.

An extension of write buffering is togroup sequential writes together. Thistechnique can make writes to all types ofdisk systems faster, but it has a particu-lar appeal to RAID level-5 disk arrays.By writing larger units, small writes canbe turned into full stripe writes, thuseliminating altogether the Achilles heelof RAID level-5 workloads [Rosenblumand Ousterhout 1991; Menon and Court-ney 1993].

Read caching is normally used in disksystems to improve the response timeand throughput when reading data. In aRAID level-5 disk array, however, it canserve a secondary pm-pose. If the old datarequired for computing the new parity isin the cache, read caching reduces thenumber of disk accesses required forsmall writes from four to three. This isvery likely, for example, in transaction-processing systems where records arefrequently updated by reading the oldvalue, changing it, and writing back thenew value to the same location.

Also, by caching recently written par-ity, the read of the old parity can some-times be eliminated, further reducing thenumber of disk accesses for small writesfrom three to two. Because parity iscomputed over many logically consecu-tive disk sectors, the caching of parityexploits both temporal and spatial local-

ity. This is in contrast to the caching ofdata which, for the purposes of reducingdisk operations on small writes, relies onthe assumption that recently read sec-tors are likely to be written rather thanon the principle of spatial locality. Ofcourse, caching parity blocks reduces thespace available for caching data, whichmay increase the number of data misses.

4.1.2 Floating Parity

Menon et al. [1993] proposed a variationon the organization of parity in RAIDlevel-5 disk array, called floating parity,that shortens the read-modify-write ofparity updated by small, random writesto little more than a single disk accesstime on average. Floating parity clusters

parity into cylinders, each containing atrack of free blocks. Whenever a parityblock needs to be updated, the new par-ity block can be written on the rotation-ally nearest unallocated block followingthe old parity block. Menon et al. showthat for disks with 16 tracks per cylin-der, the nearest unallocated block imme-diately follows the parity block being read65% of the time, and the average numberof blocks that must be skipped to get tothe nearest unallocated block is small,between 0.7 and 0.8. Thus, the writing ofthe new parity block can usually occurimmediately after the old parity block isread, making the entire read-modify-write access only about a millisecondlonger than a read access.

To implement floating parity effi-ciently, directories for the locations ofunallocated blocks and parity blocks mustbe stored in primary memory. These ta-bles are about 1 MB in size for each diskarray containing four to ten 500 MBdisks. To exploit unallocated blocks im-mediately following the parity data beingread, the data must be modified and adisk head switched to the track contain-ing the unallocated block before the diskrotates though an interjector gap. Be-cause of these constraints, and becauseonly a disk controller can have exactknowledge of its geometry, floating par-ity is most likely to be implemented inthe disk controller.

ACM Computing Surveys, Vol. 26, No. 2, June 1994

168 “ Peter M. Chen et al

Menon et al. [1993] also propose float-ing data as well as parity. This makesthe overhead for small writes in RAIDlevel-5 disk arrays comparable to mirror-ing. The main disadvantage of floatingdata is that logically sequential data mayend up discontinuous on disk. Also, float-ing data requires much more free diskspace than floating only the parity sincethere are many more data blocks thanparity blocks.

4.1.3 Parity Logging

Stodolsky and Gibson [ 1993] propose anapproach, called parity logging, to re-duce the penalty of small writes in RAIDlevel-5 disk arravs ~Bhide and Dias 19921.Parity logging ~educes the overhead fo~small writes by delaying the read of theold parity and the write of the new par-ity. Instead of updating the parity imme-diately, an update image, which is thedifference between the old and new par-ity, is temporarily written to a log. Delay-ing the update allows the parity to begrouped together in large contiguousblocks that can be updated more effi-ciently.

This delay takes place in two parts.First, the parity update image is storedtemporarily in nonvolatile memory. Whenthis memory, which could be a few tensof KB, fills up, the parity update image iswritten to a log region on disk. When thelog fills up, the parity update image is

read into memory and added to the old

parity. The resulting new parity is thenwritten to disk. Although this schemetransfers more data to and from disk, thetransfers are in much larger units andare hence more efficient; large sequentialdisk accesses are an order of magnitude

more efficient than small random ac-

cesses (Section 2.1). Parity logging re-

duces the small write overhead from four

disk accesses to a little more than two

disk accesses, the same overhead in-

curred by mirrored disk arrays. The costs

of parity logging are the memory used for

temporarily storing update images, the

extra disk space used for the log of up-

date images, and the additional memory

used when applying the parity updateimage to the old parity. This techniquecan be applied also to the second copy ofdata in mirrored disk arrays to reducethe cost of writes in mirrored disk arraysfrom two to a little more than one di~kaccess [Orji and Solworth 1993].

4.2 Declustered Parity

Many applications, notably database andtransaction processing, require both highthroughput and high data availabilityfrom their storage systems. The most de-manding of these applications requirescontinuous operation—the ability to sat-isfy requests for data in the presence ofdisk failures while simultaneously recon-.strutting the contents of failed disks ontoreplacement disks. It is unacceptable tofulfill this requirement with arbitrarilydegraded performance, especially in long-lived real-time applications such as videoservice; customers are unlikely to toler-ate movies played at a slower speed orhaving their viewing terminated prema-turely.

Unfortunately, disk failures causelarge performance degradations in stan-dard RAID Ievel-5 disk arrays. In theworst case, a workload consisting en-tirelv of small reads will double the effec-.tive load at nonfailed disks due to extradisk accesses needed to reconstruct datafor reads to the failed disk. The addi-tional disk accesses needed for completereconstruction of the failed disk increasethe load even further.

In storage svstems that stri~e datau. .

across several RAIDs, the average in-crease in load is significantly less than inRAIDs with one large parity group, butthe RAID with the failed disk still expe-

riences a 100% increase in load in the

worst case. The failed RAID creates a hot

spot that degrades the performance ofthe entire system. The basic problem inthese large systems is that althoughinter-RAID striping distributes load uni-formly when no disk is failed, it nonuni-formly distributes the increased load thatresults from a failed disk; the small set ofdisks in the same parity group as the

ACM C!omputmg Surveys, Vol 26, No 2, June 1994

RAID ● 169

Igo

g2

g4

g6

Disk O

I

go

g2

g4

g6

Disk O

cgo

g2

g4

g6

[[

go go

g2 g2

g4 g4

S6 g6

Disk 1 Disk 2 Disk 3

cg]g3

g5

27 3gl

g3

g5

g7 ngl

g3

g5

g7 ogl

g3

g5

g7

Disk 4 Disk 5 Disk 6 Disk 7

Stmdard, Mu]tiple RAID

3gl

g2

g5

g7 1gl

g3

g4

g7

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 7

Dcclustered Parity RAID

Figure 8. Standard versus declustered-parity RAID. This figure illustrates examples of standard anddeclustered-parity RAID with eight disks and a parity group size of four, Identically labeled blocks belongto the same parity group. In the standard RAID organization parity groups are composed of disks from one

of two nonoverlapping subsets of disks. In the declustered-parity RAID, parity groups span manyoverlapping subsets of disks.

failed disk bear the entire weight ofthe increased load. The declustered-par-

ity RAID organization solves this prob-lem by distributing the increased loaduniformly over all disks [Muntz and Lui1990; Merchant and Yu 1992; Hollandand Gibson 1992; Holland et al. 1993; Ngand Mattson 1992].

Figure 8 illustrates examples of stan-dard and declustered-parity RAIDs forsystems with an array size of eight disksand a parity group size of four. In thiscase, a multiple-RAID system is con-structed by striping data over two RAIDsof four disks each with non-overlapp-ing parity groups. The declustered-parityRAID is constructed by overlapping par-ity groups. If Disk 2 fails, each read toDisk 2 in the standard, multiple RAIDgenerates a single disk access to Disks O,1, and 3 and no disk access to Disks 4, 5,6, and 7. In the declustered-parity RAID,a random read to Disk 2 generates anaccess to Disks 4, 5, and 7 one-quarter of

the time; to Disks O, 1, and 3 half of thetime; and to disk 6 three-quarters of thetime. Although the increased load is notuniform, it is more balanced than in thestandard RAID. Slightly more complexdeclustered-parity RAIDs exist that dis-tribute the load uniformly such that eachread to disk 2 generates an average of0.429 disk accesses to all nonfailed disks.

The simplest way to create a declus-tered-parity RAID that distributes loaduniformly is to create a set of paritygroups including every possible mappingof parity group members to disks. In our

[1

8example, this would result in

4= 70

distinct mappings of parity groups todisks. For nearly all practical array andparity group sizes, declustered-parityRAID organizations are possible that dis-tribute reconstruction load uniformlywith much fewer than the combinatorialnumber of parity groups. Such organiza-tions can be devised using the theory of

ACM Computing Surveys, Vol. 26, No 2, June 1994

170 “ Peter M. Chen et al.

balanced incomplete block designs [Hall1986]. In practice, the load does not needto be absolutely balanced, and a closeapproximation is sufficient.

To summarize, often a declustered-par-ity RAID is preferable to a standard,multiple RAID because it distributes loaduniformly during both the normal andfailed modes of operation. This allows amore graceful degradation in perfor-mance when a disk fails and allows thefailed disk to be reconstructed morequickly since all disks in the disk arraycan participate in its reconstruction. Ad-ditionally, unlike the example in Figure8, the disk array size in a declustered-parity RAID does not have to be a multi-ple of the parity group size. Any combi-nation of array and parity group sizessuch that the array size is greater thanthe parity group size is feasible. Declus-tered-parity RAID has two main disad-vantages. First, it can be somewhat lessreliable than standard, multiple RAID;any two disk failures will result in dataloss since each pair of disks has a paritygroup in common. In a standard, multi-ple RAID, the parity groups are disjoint,so it is possible to have more than onedisk failure without losing data as longas each failure is in a different paritygroup. Second, the more complex parity

groups could disrupt the sequentialplacement of data across the disks. Thus,large requests are more likely to en-counter disk contention in declustered-parity RAID than in standard multipleRAID. In practice, it is difficult to con-struct workloads where this effect issignificant.

4.3 Exploiting On-Line Spare Disks

On-line spare disks allow reconstructionof failed disks to start immediately,reducing the window of vulnerabilityduring which an additional disk failurewould result in data loss. Unfortunately,they are idle most of time and do notcontribute to the normal operation of thesystem. This section describes two tech-niques, distributed sparing and parity

sparing, that exploit on-line spare disks

Figure 9. Distributed sparing. Distributed sparingdistributes the capacity of the spare disk through-

put the array. This allows all disks, including thedisk that would otherwise have been a dedicatedspare, to service requests. This figure illustrates aRAID level-5 disk array with distributed sparing.The ‘Ps denote parity blocks, and ‘S’s denote spare

blocks,

to enhance performance during the nor-mal operation of the system.

As Figure 9 illustrates, distributedsparing distributes the capacity of a sparedisk across all the disks in the disk array[Menon et al. 1991]. The distribution ofspare capacity is similar to the distribu-tion of parity in RAID level-5 disk ar-rays. Instead of N data and one sparedisk, distributed sparing uses N + 1 datadisks that each have l(lV + l)th sparecapacity. When a disk fails, the blocks onthe failed disk are reconstructed onto thecorresponding spare blocks. Distributedsparing obviates dedicated spare disks,allowing all disks to participate in servic-ing requests, and thereby improving per-formance during the normal operation ofthe disk array. Additionally, because eachdisk is partially empty, each disk failurerequires less work to reconstruct the con-tents of the failed disk. Distributed spar-ing has a few disadvantages. First, thereconstructed data must eventually becopied onto a permanent replacement forthe failed disk. This creates extra workfor the disk array, but, since the copyingcan be done leisurely, it does not signifi-cantly affect performance. Second, be-cause the reconstructed data is distri-buted across many disk whereas it wasformerly on a single disk, reconstructiondisturbs the original data placement,which can be a concern for some 1/0 in-tensive applications. In disk arrays withdedicated spares, the data placementafter reconstruction is identical to thedata placement before reconstruction.

ACM Computmg Surveys, Vol 26, No 2, June 1994

Figure 10. Parity sparing. Parity sparing is simi-lar to distributed sparing except that the sparespace is used to store a second set of parity infor-mation.

Parity sparing is similar to distributedsparing, except that it uses the sparecapacity to store parity information[Chandy and Reddy 1993]. As with dis-tributed sparing, this eliminates dedi-cated spare disks, improving perfor-mance during normal operation. The sec-ond set of parity blocks can be used in avariety of ways. First, they can be usedto partition the disk array logically intotwo separate disk arrays, resulting inhigher reliability. In Figure 10, for exam-ple, POa might compute the parity overblocks 1 and 2 while POb computes theparity over blocks 3 and 4. Second, theadditional parity blocks can be used toaugment the original parity groups. InFigure 10, if one assumes that the parityof blocks 1, 2, 3, 4, POa, and POb isalways zero, write operations need to up-date only one of POa or POb. This has thebenefit of improving small write perfor-mance by allowing each small write tochoose the parity block it will updatebased on information such as the queuelength and disk arm position at the twoalternative disks. Third, the extra parityblocks can be used to implement P + Qredundancy. When a disk fails, the diskarray converts to simple parity. By logi-cal extension, a second disk failure wouldresult in a RAID level-O disk array.

Both distributed sparing and paritysparing offer interesting ways to exploiton-line spares for improved performance.‘They are most effective for disk arrayswith a small number of disks where thefraction of spare disks to nonspare disksis likely to be large. As disk arrays be-come larger, a smaller fraction of sparedisks is needed to achieve the same levelof reliability [Gibson 1991].

RAID ● 171

4.4 Data Striping in Disk Arrays

Distributing data across the disk arrayspeeds up 1/0s by allowing a single 1/0to transfer data in parallel from multipledisks or by allowing multiple 1/0s tooccur in parallel. The disk array designermust keep in mind several tradeoffs whendeciding how to distribute data over thedisks in the disk array to maximizeperformance, balancing two conflictinggoals:

Maximize the amount of useful datathat each disk transfers with each logi-cal 1/0. Typically, a disk must spendsome time seeking and rotating be-tween each logical 1/0 that it services.This positioning time represents wast-ed work—no data is transferred duringthis time. Hence it is beneficial to max-imize the amount of useful work donein between these positioning times.

Utilize all disks. Idle times are similarto positioning times in that during idletimes, no useful work is done. Idle timescan arise in two different situations.First, hot spots can exist, where certaindisks (the hot disks) are more heavilyused than other disks (the cold disks)[Friedman 1993; Wilmot 1989]. Second,it is possible that all disks could beused evenly when viewed over a longperiod of time but not evenly at everyinstant. For example, if there is onlyone request to the disk array and thatrequest only uses one disk, then allother disks will remain idle.

These goals are in conflict because theschemes that guarantee use of all disksspread data widely among more disksand hence cause each disk to transferless data per logical 1/0. On the otherhand, schemes that maximize the amountof data transferred per logical 1/0 mayleave some disks idle. Finding the rightbalance between these two goals is themain tradeoff in deciding how to dis-tribute data among multiple disks and isheavily workload dependent.

Data striping, or interleaving, is themost common way to distribute dataamong multiple disks. In this scheme,

ACM Computing Surveys, Vol. 26, No. 2, June 1994

172 ● Peter M. Chen et al,

logically contiguous pieces of data arestored on each disk in turn. We refer tothe size of each piece of data as the strip-ing unit. The main design parameter indata striping is the size of this stripingunit. Smaller striping units cause logicaldata to be spread over more disks; largerstriping units cause logical data to begrouped, or clustered, together on fewerdisks. Consequently, the size of the strip-ing unit determines how many disks eachlogical 1/0 uses.

Because the interaction between work-load and striping unit can have a sub-stantial effect on the ~erformance of adisk array with block-interleaved strip-ing, Chen and Patterson [1990] de-veloped rules of thumb for selecting astriping unit. Their simulation-basedmodel evaluated a spindle-synchronizeddisk array of 16 disks. The stochasticworkload applied to the disk array hadtwo main parameters: average requestsize (varied from 4–1500 KB). and thenumber of concurrent, independent logi-cal requests (varied from 1–20). Theirgoal was to find the size of a striping unitthat gave the largest throughput for anincompletely specified workload. Theyfound that the most important workloadparameter was concurrency. When theconcurrency of the workload was known,a simple formula specified a striping unitthat provided S)570 of the maximumthroughput possible for any particularrequest distribution:

1 sector + 1/4* average positioning time

* data transfer rate

* (concurrency – 1)

where the average positioning time is thedisk’s average seek time for the workloadplus an average rotational delay. A strip-ing unit selected by this expression issmall when the concurrency is low sothat every access can utilize all disks,and larger when the concurrency is highso that more different accesses can beserviced in parallel. Intuitively, the prod-uct of average positioning time and datatransfer rate balances the benefits and

the costs of striping data. The benefit ofstriping is the decreased transfer time ofa single request, which saves approxi-mately the transfer time of a stripe unit.The cost of striping is the increased diskutilization that arises from an additionaldisk positioning itself to access the data.The constant, 1/4, is sensitive to thenumber of disks in the array [Chen andLee 1993].

If nothing is known about a workload’sconcurrency, Chen and Patterson [19901found that a good compromise size for astriping unit is

2/3 * average positioning time

* data transfer rate.

The constant, 2/3, is sensitive to thenumber of disks in the array; researchneeds to be done quantifying this rela-tionship.

Lee and Katz [199 la] use an analyticmodel of nonredundant disk arrays toderive an equation for the optimal size ofdata striping. The disk array system theymodel is similar to that used by Chenand Patterson [ 1990] described above.They show that the optimal size of datastriping is equal to

{

PX(L – l)Z

N

where P is the average disk positioningtime, X the average disk transfer rate, L

the concurrency, Z the request size, andN the array size in disks. Their resultsagree closely with those of Chen and Pat-terson. In particular, note that theirequation predicts also that the optimalsize of data striping is dependent onlythe relative rates at which a disk posi-tions and transfers data, PX, rather thanP or X individually. Lee and Katz showthat the optimal striping unit depends onrequest size; Chen and Patterson showthat this dependency can be ignoredwithout significantly affecting perfor-mance.

Chen and Lee [1993] conducted a fol-low-up study to Chen and Patterson[1990] to determine the striping unit for

ACM Computing Surveys. V()] 26, No 2, June 1994

RAID ● 173

RAID level-5 disk arrays. Reads in aRAID level-5 are similar to reads (andwrites) in a RAID level O, causing theoptimal striping unit for a read-intensiveworkload in a RAID level-5 to be identi-cal to the optimal striping unit in a RAIDlevel O. For write-intensive workloads,however, the overhead of maintainingparity causes full-stripe writes (writesthat span the entire parity group) to bemore efficient than read-modify writes orreconstruct writes (writes that do notspan an entire parity group). This addi-tional factor causes the optimal stripingunit for RAID level-5 to be smaller forwrite-intensive workloads than the strip-ing unit for RAID level O by a factor of 4for a 16-disk array. They explored alsothe relationship between the optimalstriping unit and the number of disksand found that the optimal striping unitfor reads varies inversely to the numberof disks, but that the optimal stripingunit for writes varies with the number ofdisks. Overall, they found that the opti-mal striping unit for workloads with anunspecified mix of reads and writes wasindependent of the number of disks andrecommended (in the absence of specificworkload information) that the stripingunit for RAID level-5 disk arrays withany number of disks be set to

1/2 * average positioning time

* data transfer rate.

Currently, researchers are investigat-ing ways to distribute data other than asimple round-robin scheme. Some vari-ations are: choosing a different stripingunit for each file and distributing data byhashing or heat-balancing [Weikum andZabback 1992; Scheuermann et al. 1991;Copeland et al. 1988].

That is, a disk array request consists ofmultiple-component disk requests thatmust be queued and serviced indepen-dently, then joined together to satisfy thedisk array request. Currently, exact solu-tions exist for certain two-server fork-joinqueues; however, the general k-serverfork-join queue is an open research prob-lem. Additionally, the bursty nature ofmost real 1/0 workloads is difficult tomodel using existing performance mod-els, which generally deal only with thesteady-state behavior of the system.Thus, most performance models of block-interleaved disk arrays place heavy re-strictions on the configuration of the diskarray or the types of workloads that canbe modeled. So far, a satisfactory perfor-mance model for RAID level-5 disk ar-rays that models both reads and writesover a wide range of system and work-load parameters has yet to be formu-lated.

Kim [1986] derives response timeequations for synchronous byte-interleaved disk arrays by treating theentire disk array as an M/G/1 queuingsystem. That is, the entire disk array ismodeled as an open queuing system withan exponential interarrival distribution,general service time distribution, and asingle server consisting of all the disks inthe disk array. The study compares theperformance of an n-disk synchronousbyte-interleaved disk array with n inde-pendent disks with uniform load and nindependent disks with skewed load. Sheconcludes that byte interleaving resultsin reduced transfer time due to increasedparallelism in servicing requests and bet-ter load balancing but dramatically re-duces the number of requests that can beserviced concurrently.

Kim and Tantawi [1991] derive

4.5 Performance and Reliability Modelingapproximate service time equations forasynchronous (disks rotate inde~en-

This section presents a brief summary of dehtly of one another) byte-interle~vedwork that has been done in modeling the disk arrays. Disk seeks are assumedperformance and reliability of disk ar- to be distributed exponentially, and rota-rays. General performance models for tional latencies are assumed to be dis-block-interleaved disk arrays are very tributed uniformly. The results of the an-difficult to formulate due to the presence alytic equations are compared with theof queuing and fork-join synchronization. results of both synthetic and trace-driven

ACM Computmg Surveys, Vol. 26, No. 2, June 1994

174 “ Peter M. C?lenet al.

simulations. An important conclusion ofthe paper is that for a wide range of seektime distributions, the sum of the seekand rotational latency can be approxi-mated by a normal distribution.

Chen and Towsley [ 1991] model RAIDlevel-l and RAID level-5 disk arrays ana-lytically for the purpose of comparingtheir performance under workloads con-sisting of very small and large requests.Bounds are used for approximate model-ing of the queuing and fork-join synchro-nization in RAID level-l disk arrays.Small write requests in RAID level-5 diskarrays are handled by ignoring the fork-join synchronization overhead, resultingin a somewhat optimistic model. Largerequests are modeled by using a singlequeue for all the disks in the disk array.The results of the model are comparedagainst simulation.

Lee and Katz [1991a; 1993] derive ap-proximate throughput and response timeequations for block-interleaved disk ar-rays. Their model is the first analyticperformance model for general block-in-terleaved disk arrays that takes intoaccount both queuing and fork-join syn-chronization. Previous models have ig-nored either the queuing or fork-join syn-chronization component of the system.Lee and Katz [199 la] provide also a sim-ple application of the analytic model todetermine an equation for the optimalunit of data striping in disk arrays.

In addition to analytic models specifi-cally for disk arrays, work dealing withthe modeling of fork-join queuing sys-tems in general [Baccelli 1985; Flatto andHahn 1984; Heidelberger and Trivedi1982; Nelson and Tantawi 1988] is use-ful when modeling disk arrays. However,most of these papers model highly re-strictive systems that are not easily ap-plied to disk arrays.

The reliability of disk arrays is mostfrequently modeled using continuous-time Markov chains. The failure and re-covery of components in the system causetransitions from one state to another.Generally, the most useful informationderived from such models is the averagetime to system failure and the equilib-

rium state probabilities from which one

can determine the fraction of failures

caused by each type of failure mode. A

disadvantage of Markov reliability mod-

els is that the number of states necessary

to model even simple disk arrays in-

creases exponentially as new failure

modes and system components are intro-

duced. Fortunately, because the repair/

replacement rates for components of most

disk arrays are much higher than the

failure rates, it is usually possible to sim-

plify greatly the Markov models by elimi-

nating states that very rarely occur. To

date, [Gibson 1991] presents the most

complete reliability study of disk arrays.

5. CASE STUDIES

Since the first publication of the RAIDtaxonomy in 1987, the disk drive indus-try has been galvanized by the RAIDconcept. At least one market survey, pre-pared by Montgomery Securities [1991],predicted (optimistically) that the diskarray market would reach $7.8 billion by1994. Companies either shipping or hav-ing announced disk array products in-clude: Array Technology Corporation (asubsidiary of Tandem), Ciprico, Compaq,Data General, Dell, EMC Corporation,Hewlett-Packard, IBM, MasPar, Maxi-mum Strategies, Microtechnologies Cor-poration, Micropolis, NCR, StorageTek,and Thinking Machines. RAID technol-ogy has found application in all majorcomputer system segments, including su-percomputing, mainframes, minicomput-ers, workstation file servers, and PC fileservers. We highlight some of these sys-tems in the following subsections.

5.1 Thinking Machines CorporationScaleArray

The TMC ScaleArray is a RAID level 3for the CM-5, which is a massively paral-lel processor (MPP) from Thinking Ma-chines Corporation (TMC). Announced in1992, this disk array is designed for sci-entific applications characterized by highbandwidth for large files. Thinking Ma-chines also provides a file system that

ACM Computmg Surveys, Vol 26, No 2, June 1994

RAID ● 175

can deliver data from a single file tomultiple processors from multiple disks[Lo Verso et al. 1993].

The base unit consists of eight IBMModel 0663E 15 disks. These 3.5-inchdisks contain 1.2 GB of data and cantransfer up to 2 MB/second for readsand 1.8 MB/second for writes. A pair ofdisks is attached to each of four SCSI-2strings, and these four strings are at-tached to an 8 MB disk buffer. Three ofthese base units are attached to thebackplane, so the minimum configura-tion is 24 disks. TMC expects the 24disks to be allocated as 22 data disks,one parity disk, and one spare, but theseratios are adjustable.

Perhaps the most interesting feature ofthe ScaleArray is that these base unitsare connected directly to the data-routingnetwork of the CM-5. Normally, mas-sively parallel processors reserve thatnetwork to send messages between pro-cessors, but TMC decided to use the samenetwork to give them a scalable amountof disk 1/0 in addition to a scalableamount of processing. Each network linkoffers 20 MB/second, and there is a net-work link for each base unit. As a conse-quence of communicating with the datanetwork and the small message size ofthe CM-5, the interleaving factor is only16 bytes. Parity is calculated by an on-board processor and sent to the appropri-ate disk.

Using the scalable MPP network toconnect disks means there is almost nopractical limit to the number of disksthat can be attached to the CM-5, sincethe machine was designed to be able toscale to over 16,000 nodes. At the time ofannouncement, TMC had tested systemswith 120 disks. Using their file systemand 120 disks (including a single paritydisk), TMC was able to demonstrate upto 185 MB/second for reads and up to135 MB/second for writes for 240 MBfiles. In another test, TMC demonstrated1.5 to 1.6 MB/second per disk for readsand 1.0 to 1.1 MB/second per disk forwrites as the number of disks scaled from20 to 120. For this test, TMC sent 2 MBto each disk from a large file.

5.2 StorageTek Iceberg 9200 DiskArray Subsystem

StorageTek undertook the developmentof disk array-based mainframe storageproducts in the late 1980s. Their array,called Iceberg, is based on collections of5.25-inch disk drives yet appears to themainframe (and its IBM-written operat-ing system) as more traditional IBM 3380and 3390 disk drives. Iceberg imple-ments an extended RAID level-5 andlevel-6 disk array. An array consists of 13data drives, P and Q drives, and a hotspare. Data, parity, and Reed-Solomoncoding are striped across the 15 activedrives within the array. A single Icebergcontroller can manage up to four sucharrays, totalling 150 GB of storage.

Iceberg incorporates a number of inno-vative capabilities within its array con-troller, called Penguin. The controlleritself is organized as an 8-processorsystem and executes its own real-timeoperating system. The controller can si-multaneously execute 8-channel pro-grams and can independently transfer onfour additional channels.

The controller manages a large, bat-tery-backed semiconductor cache (from 64MB up to 512 MB) in front of the diskarray. This “extra level of indirection”makes possible several array optimi-zation. First, the cache is used as astaging area for compressing and decom-pressing data to and from disk. This com-pression can double the effective storagecapacity of the disk array. Second, whenwritten data is replaced in the cache, it isnot written back to the same place ondisk. In a manner much like Berkeley’sLog-Structured File System [Rosenblumand Ousterhout 1991], data is writtenopportunistically to disk in large track-sized transfer units, reducing random ac-cess latencies and performing adaptiveload balancing. And third, the cachemakes it possible to translate betweenthe variable-length sectors used by mostIBM mainframe applications and thefixed-size sectors of commodity small diskdrives. StorageTek calls this process dy-

namic mapping. The controller keeps

ACM Computing Surveys, Vol 26, No. 2, June 1994

176 ● Peter M. Chen et al.

Fast/Wide

SCSI-2Host Interconnect

—cE51—

n I Fast

v u

53C916

v u-

53C916

t~.

NCR

53C916

I 1$NCR 53C9N)

I Read/Modify/’Wri[eSRAM Buffer I

Figure 11. NCR 6’298 controller data path. The lock-step data path of the 6298 requires no memory forany operations except RAID level-5 writes. By placing the XOR and MUX directly in the data path, the

controller can generate parity or reconstruct data on the fly,

track of free space within the array andmust reclaim space that is no longer be-ing used. The free-space data structuresand track tables mapping between logi-cal IBM 3380 and 3390 disks and theactual physical blocks within the array ismaintained in a separate, 8 MB, non-volatile controller method. Due to thecomplexity of the software for a systemas ambitious as Iceberg, the product isover a year behind schedule, though atthe time of this writing it is in beta test.

5.3 NCR 6298

The NCR 6298 Disk Array Subsystem,released in 1992, is a low-cost RAID sub-system supporting RAID levels O, 1, 3,and 5. Designed for commercial environ-

ments, the system supports up to four

controllers, redundant power supplies

and fans, and up to 20 3.5-inch SCSI-2

drives. All components—power supplies,

drives, and controllers—can be replaced

while the system services requests.

Though the system does not allow on-line

spares, built-in diagnostics notify the host

when a drive has failed, and reconstruc-

tion occurs automatically when a re-

placement drive is inserted.

The array controller architecture fea-

tures a unique lock-step design (Fig-are

11) that requires almost no buffering. For

all requests except RAID level-5 writes,

data flows directly through the controller

to the drives. The controller duplexes the

data stream for mirroring configurations

and generates parity for RAID level 3

ACM Comput]ng Surveys, Vol 26, No 2, June 1994

RAID ● 177

synchronously with data transfer. OnRAID level-3 reads, the system can op-tionally read the parity along withthe data, proving an additional check ofdata integrity, This lock-step nature alsomeans that RAID level-3 performancedoes not degrade when a single drivefails.

The RAID level-5 implementation doesnot support full-stripe writes. Instead, thewrite path uses an intermediate SRAMbuffer. When a write occurs, the old dataand parity are read (in lock-step) fromdisk, exclusive-ored together, and storedinto a 64KB SRAM parity buffer. As aside effect of data transfer from the host,the contents of the parity buffer are ex-elusive-ored with the data to generatethe up-to-date parity, and the parity iswritten to the parity drive. While thisdesign prohibits the overlap of datatransfer for RAID level 5, the controlleroverlaps the drive-positioning operations.This parsimonious use of buffers, in con-trast with architectures such as RAID-II,lowers the cost of the controller.

The lock-step data path is also used forreconstruction. Data and parity are readsynchronously from the surviving drives,exclusive-ored together, and written tothe replacement drive, Therefore, recon-struction is quite fast, approaching theminimum time of writing a single drive.

The host interface is fast, wide, differ-ential SCSI-2 (20 MB/s), while the drivechannels are fast, narrow SCSI-2 (10MB/s). Because of the lock-step architec-ture, transfer bandwidth to the host islimited to 10 MB/s for RAID level O, 1,and 5. However, in RAID level-3 configu-rations, performance on large transfershas been measured at over 14 MB/s(limited by the host’s memory system).

5.4 TickerTAIP / DataMesh

TickerTAIP/DataMesh is a research pro-ject at Hewlett-Packard Labs whose goalis to develop an array of “smart” disknodes linked by a fast, reliable network[Cao et al. 1993] (Figure 12). Each nodecontains a disk, a CPU, and some localmemory. Disk array controller operations

merml mercomcc[ TickerTAIP

Im—Host comecmm ‘w 88—Host connection C.’u— 88

Hmt comectlon c-w 88—Host connection CPU 88

Figure 12. The TickerTAIP/DataMesh hardwarearchitecture. A unique feature of the TickerTAIParchitecture is the close association of a CPU toeach disk drive in the array. This association allows

each node to perform some of the processing neededto perform a disk array operation.

such as parity computation are distribut-ed among these smart disk nodes, andthe nodes communicate by message pass-ing across the internal interconnect.

A unique feature of the TickerTAIParchitecture is the close association of aCPU to each disk drive in the array (Fig-ure 12). This association allows each nodeto perform some of the processing neededto perform a disk array operation. Addi-tionally, a subset of nodes are connectedto the host computers that are request-ing data. Because more than one nodecan talk to the host computers, Ticker-TAIP can survive a number of node fail-ures. In contrast, many other disk arrayshave only one connection to host comput-ers and hence cannot survive the failureof their disk array controller.

Currently, TickerTAIP exists as asmall, 7-node prototype. Each node con-sists of a T800 transputer, 4 MB of localRAM, and one HP79560 SCSI disk drive.The TickerTAIP project is developingsoftware to make the multiple, distribut-ed processing nodes appear as a single,fast storage server. Early results showthat, at least for computing parity, Tick-erTAIP achieves near-linear scaling [ Caoet al. 1993].

‘5.5 The RAID-II Storage Server

RAID-II (Figure 13) is a high-bandwidth,network file server designed and imple-mented at the University of California atBerkeley as part of a project to study

ACM Computing Surveys, Vol 26, No 2, June 1994

178 ● Peter ill. Chen et al.

Ethernet (Control and bw Latency Transfers).

High Bandwidth “METransfers XBUS

Card4 Port Interleaved

, TMC Memory (32 MB) FileHIPPI

HIPP1 TMCHIPPI12—

4-by-8 by 32-bit ~=Crossbar , :::;:

;,,~,LINK —

::::::::::;:::,,,,..,,, . .

‘“’’’Controntro. ::!::

; HIPPIS Bus IVME]IVMEI IVM’EIIVMEI BUS ;i:::.: :::;:,:.

;,,,,,,,

Figure 13. RAID-II architecture. A high-bandwidth crossbar connects the network interface (HIPPI), diskcontrollers, multi ported memory system, and parity computation engine (XOR). An internal control buspromdes access to the crossbar ports, while external point-to-point VME links provide control paths to thesurrounding SCSI and HIPPI interface boards. Up to two VME disk controllers can be attached to each ofthe four WE interfaces.

high-performance, large-capacity, highlyreliable storage systems [Chen et al.1994; Drapeau et al. 1994; Katz et al.1993]. RAID-H interfaces a SCSI-baseddisk array to a HIPPI network. One ofRAID-II’s unique features is its ability toprovide high-bandwidth access from thenetwork to the disks without transfer-ring data through the relatively slow fileserver (a Sun4/280 workstation) mem-ory system. To do this, the RAID projectdesigned a custom printed-circuit boardcalled the XBUS card.

The XBUS card provides a high-band-width path among the major system com-ponents: the HIPPI network, four VMEbusses that connect to VME disk con-trollers, and an interleaved, multiportedsemiconductor memory. The XBUS cardalso contains a parity computation en-

gine that generates parity for writes andreconstruction on the disk array. Thedata path between these system conlpo-nents is a 4 X 8 crossbar switch that cansustain approximately 160 MB/s. Theentire system is controlled by an externalSun 4/280 file server through a memory-mapped control register interface. Figure13 shows a block diagram for the con-troller.

To explore how the XBUS card en-hances disk array performance, Chenet al. [1994] compare the performance ofRAID-H to RAID-I (Table 7). RAID-I isbasically RAID-II without the XBUS card[Chervenak and Katz 1991]. They findthat adding a custom interconnect boardwith a parity engine improves perfor-mance by a factor of 8 to 15 over RAID-I.The maximum bandwidth of RAID-II is

ACM Computing Surveys, Vol 26, No 2, June 1994

RAID ● 179

Table 7. Performance Comparison between RAID-II and RAID-I

Disk Array Read Disk Array Write Write Performance

Performance Performance Degradation

RAID-I 2.4 IW3fs 1.2 MBJS 50%

RAID-II ~().9 MB/~ 18.2 kiB/s 13%-1

RAID-II speedup 8.7 15.2

This table compares the performance of RAID-II to that of RAID-I. Because RAID-II has a special-purpose

parity engine, disk array write performance is comparable to disk array read performance. All writes inthis test are full-stripe writes [Lee and Katz 1991 b], For RAID-II reads, data is read from the disk arrayinto XBUS memory, then sent over the HIPPI network back to XBUS memory. For RAID-I reads, data is

read from the disk array into Sun4 memory, then copied again into Sun4 memory. This extra copyequalizes the number of memory accesses per data word. For RAID-II writes, data starts in XBUSmemory, is sent over HIPPI back into XBUS memory, parity is computed, and the data and parity are

written to the disk subsystem. For RAID-I writes, data starts in Sun4 memory, gets copied to anotherlocation in Sun4 memory, then is written to disk. Meanwhile, parity is computed on the Sun4 and laterwritten to disk. RAID-I uses a 32 KB striping unit with 8 disks (and is performance-limited by the Sun4’s

VME bus); RAID-II uses a 64 KB striping unit with 24 disks.

between 20 and 30 MBs, enough to sup-port the full disk bandwidth of approxi-mately 20 disk drives.

5.6 IBM Hagar Disk Array Controller

Hagar is a disk array controller pro-totype developed at the IBM AlmadenResearch Center [Menon and Courtney1993]. Hagar was designed for large ca-pacity (up to 1 TB), high bandwidth (upto 100 MB/s), and high 1/0 rate (up to5000 4 KB 1/0s per second). Addition-ally, Hagar provides high availabilitythrough the use of redundant hardwarecomponents, multiple power boundaries,and on-line reconstruction of data.

Two design features of Hagar are espe-cially noteworthy. First, Hagar uses bat-tery-backed memory to allow user writesto provide safe, asynchronous writes (asdiscussed in Section 4.1.1). The designersof Hagar require each write to be storedin two separate memory locations in twodifferent power regions to further in-crease reliability.

Second, Hagar incorporates a special-purpose parity computation engine in-side the memory of the controller. This isin contrast to the RAID-II architecture,which places the parity engine as a porton the controller bus (Figure 13). TheHagar memory system supports a specialstore operation that performs an exclu-

sive-or on the current contents of a mem-ory location with the new data, thenwrites the result to that location. Incor-porating the parity engine in the memorycomplicates the memory system, but itreduces the data traffic on the controller’sinternal data bus.

Hagar was never fully operational;however, IBM is working on future diskarray products that use ideas fromHagar.

6. OPPORTUNITIES FORFUTURE RESEARCH

Redundant disk arrays have rejuvenatedresearch into secondary storage systemsover the past five to seven years. As thissurvey highlights, much has been pro-posed and examined, but much is leftto do. This section discusses the classesof research not adequately understoodwith particular attention to specificproblems.

6.1 Experience with Disk Arrays

As an over five-year-old researchthat has sported products for atsix years, redundant disk arrays

open

arealeasthave

rem&-kably few published measurementresults and experience. In addition tovalidating models and techniques foundin the literature, such experience reports

ACM Computmg Surveys, Vol 26, No 2, June 1994

180 . Peter M. CJzen et al.

can play an important role in technol-ogy transfer [Buzen and Shum 1986].Furthermore, measurements frequent-ly form the basis for developing newoptimizations.

6.2 Interaction among New Organizations

As this survey describes, there are manynew and different disk array organiza-tions. Most of these, including double

failure correction, declustered parity,parity logging, floating parity, distribut-ed sparing, log-structured file systems,and file-specific data striping, have been

studied only in isolation. Unquestion-ably, among these there will be signifi-

cant interactions, both serious newproblems and obvious simplifications oroptimizations.

As more is understood about the inter-actions among disk array technologies,designers and managers of disk arrayswill be faced with the task of configur-ing and tuning arrays. As Section 4.5discusses, redundant disk array perfor-mance and reliability modeling is largelyincomplete and unsophisticated. Workneeds to be done in the application offundamental modeling to the problem ofdisk arrays as well as the development ofthat fundamental modeling, fork-joinqueuing models in particular. A good goalfor this work is graphical, interactiveanalysis tools exploiting low-overheadmonitoring data to guide configurationand tuning.

One objection lodged commonly againstredundant disk arrays, particularly someof the newly proposed technologies, istheir relatively high complexity. Storagesystems are responsible for more thanjust the availability of our data, they areresponsible for its integrity. As the com-plexity goes up, the opportunity fordisastrous latent bugs also rises. This iscompounded by the desire to increaseperformance by continuing computationas soon as storage modifications are de-livered to storage server memory, that is,before these modifications are committedto disk. Inexpensive and highly reliablemechanisms are needed to control the

vulnerability to increased software com-plexity of storage systems.

6.3 Scalability, Massively ParallelComputers, and Small Disks

One of the key motivations for redundantdisk arrays is the opportunity to increasedata parallelism in order to satisfy thedata processing needs of future gener-ations of high-performance computers.This means that arrays must scale upwith the massively parallel computersthat are being built and the even moremassively parallel computers beingplanned. Massively parallel disk arraysintroduce many problems: physical size,

connectively, delivery system bottle-necks, and storage control processing re-quirements to name a few. The mostcompelling approach to ever larger diskarrays is to embed storage based on thenew generations of small diameter disksinto the fabric of massively parallel com-puters, use the computer’s intercon-nection network for data distributionand redundancy maintenance, and dis-tribute the storage control processingthroughout the processors of the parallelcomputer.

Though compelling, this approach hassubstantial problems to be overcome. Pri-mary among these are the impact on theinterconnection network of distributingthe redundancy computations [Cao et al.1993], the impact on the processors ofdistributing storage control, and the via-bility of allocating data on storage de-vices near the processors that will use it.

6.4 Latency

Redundant disk arrays are fundamen-tally designed for throughput, either hightransfer rates for large, parallel transfersor large numbers of concurrent small ac-cesses. They are effective only for reduc-ing access latency when this latency islimited by throughput. For lower-throughput workloads, disk arrays en-hance storage performance only slightlyover traditional storage systems.

Caching is the main mechanism forreducing access latency, but caching can

ACM Computmg Surveys, Vol 26, No 2, June 1994

RAID 9 181

be ineffective either because data is toolarge, too infrequently accessed, or toofrequently migrated among caches. Forthese workloads, data prefetching is es-sential. Research into aggressive pre-fetching systems is beginning to examineopportunities to extract or predict futureaccesses and provide mechanisms to effi-ciently utilize available resources in an-ticipation of these accesses [Korner 1990;Kotz and Ellis 1991; Gibson et al. 1992;Patterson et al. 1993; Tait and Duchamp1991].

7. CONCLUSIONS

Disk arrays have moved from researchideas in the late 1980’s to commercialproducts today. The advantages of usingstriping to improve performance and re-dundancy to improve reliability haveproven so compelling that most majorcomputer manufacturers are selling orintending to sell disk arrays. Much re-search and implementation have beenaccomplished, both in industry and uni-versities, but many theoretical and prac-tical issues remain unresolved. We lookforward to the many more fruitful yearsof disk array research.

ACKNOWLEDGMENTS

We thank Bill Courtright, Mark Holland, Jai

Menon, and Daniel Stodolsky for reading an earlier

draft of this article and for their many helpful

comments. We are especially indebted to Bill Cour-

tright and Daniel Stodolsky for writing the section

of this article describing the NCR disk array.

ANNOTATED BIBLIOGRAPHY

AMDAI-IL, G. M. 1967. Validi~ of the single pro-

cessor approach to achieving large scale com-puting capabilities. In Proceedings of the AFIPS1967 Spring Joint Computer Conference. Vol.30. AFIPS, Washington, D. C., 483–485. Three-page paper that eloquently gives case for tradi-tional computers by pointing out that perfor-mance improvement is limited by portion of the

computation that is not improved.

BACCELLI, F. 1985. Two parallel queues createdby arrivals with two demands. Tech Rep. 426,INRIA, Rocquencourt, France. Derives an exact

solution for the two-server, M/G/ 1 fork-joinqueue.

BHIDE, A. AND DIAS, D. 1992. Raid architecturesfor OLTP. Tech. Rep. RC 17879 (#78489), IBM,

Yorktown Heights, N.Y. Increases throughputfor workloads emphasizing small, random write

accesses in a redundant disk array by logging

changes to parity for efficient application later.Parity changes are logged onto a separate disk

which must be externally sorted before applica-tion to the disk array’s parity.

BITTON, D. AND GRAY, J. 1988. Disk shadowing.In Very Large Database Conference XIV. Mor-

gan Kaufmann, San Mateo, Calif., 33 1–338.Describes disk mirroring and derives an ana-lytical equation for read and write seek dis-tances as a function of the number of datacopies.

BURKHARDT, W. AND MENON, J. 1993. Disk array

storage system reliability. In the 23rd AnnualInternational Symposium on Fault-Tolerant

Con-zputmg. IEEE Computer Society, Washing-ton, D. C., 432–441. Argues need for multipleerror-correcting disk arrays; discusses how touse maximal-distance separable codes to pro-

tect against multiple errors in a space-efficientmanner.

BUZEN, J. AND Smm, W. C. 1986. 1/0 architec-

ture in MVS/370 and MVS/XA. CMG Trans.54 (Fall), 19-26. Overview of the MVS/370and MVS/XA 1/0 architecture. Describeschannel paths, storage directors, string con-trollers, rotational position sensing, static and

dynamic reconnect.

CAO, P., LIM, S. B., VENK.ATAFtAMAN, S., AND WILKES,

J. 1993. The TickerTAIP parallel RAID ar-

chitecture. In Proceedings of the 1993 In terna -tional Symposium on Computer Architecture.IEEE, New York. Describes the TickerTAIParchitecture, software implementation issues,and the performance of different methods ofdistributing parity computation among multi-ple processors.

CHANDY, J. AND REDDY, A. L. N. 1993. Failure

evaluation of disk array organizations. In F’ro-ceedmgs of the International Conference on Dis-tributed Computing Systems. IEEE ComputerSociety, Washington, D.C. Contrasts four previ-

ously described schemes for minimizing datareconstruction time in small (7 and 16 disks)

redundant disk arrays: RAID 5 with a singlespare disk, RAID 5 with a single spare whose

space is distributed across all disks, a specialcase of Muntz and Lui’s parity-clustering orga-nization, and a method of dynamically convert-ing a redundant data disk to a spare disk bymerging two redundancy groups into one largergroup. The second, distributed sparing, is gen-erally preferred because of its performance andsimplicity, but the Muntz scheme is better forminimal impact of user performance duringrecovery.

CHEN, P. M., GIBSON, G., KATZ, R., AND PATTERSON,D. A. 1990. An evaluation of redundant ar-

rays of disks using an Amdahl 5890. In Pro-ceedings of the 1990 ACM SIGMETRICSConference on Measurement and Modeling of

ACM Computing Surveys, Vol. 26, No. 2, June 1994

182 ● Peter &f. Chen et al.

Computer Systems. ACM, New York. The firstexperimental evaluation of RAID. Compares

RAID levels O, 1, and 5.

CHEN. P. M. .iND PATTERSON, D. A. 1990 Maxi-mizmg performance m a str~ped disk array. InProceedings of the 1990 Internatmnal Sympo-

swrn on Computer ArchztecYure. IEEE, NewYork, 322–331. Discusses how to choose the

strlpmg unit for a RAID level-O disk array.

CHEN, S, AND TOWSLEY, D. 1991. A queuemg

anal ysis of RAID architectures. Tech. Rep.

COINS Tech. Rep. 91-71, Dept. of Computerand Information Science, Univ of Mas.

sachusetts, Amherst, Mass Analytically mod-els RAID level-l and RAID level-5 disk arraysto compare their performance on small andlarge requests. Bounds are used to model thequeuing and fork-loin synchmmzation m RAIDlevel-l disk arrays. Small write requests in

RAID level-5 disk arrays are handled by ignor-ing the fork-join synchromzatlon overhead.

Large requests are modeled by using a single

queue for all the disks m the disk array.

CHEPJ, P. M AND LEE, E. K. 1993 Striping in aRAID level-5 disk array Tech. Rep. C’SE-TR-181-93, Univ. of Michigan, Ann Arbor, Mlch.

Discusses bow to choose striping umt for RAID

level-5 disk arrays. Quantifies effect of writesand varying number of disks

CH~N, P. M., L!@ E. K,, DRAPEALI, A. L., LUTZ. K,MILLER, E. L., SESHAN, S., SHIRRWF, K., PATTER-S(JN D. A., .4NU KATZ, R, H. 1994, Perfor-

mance and design evaluation of the RAID-II

storage server. J. Dwtrlb. Parall. Databases.To be published. Also in The 1993 InternatzonaiParallel Processing Symposium Workshop onI/O m Parallel Con2puter Systen2.s Summa-rizes major architectural features of RAID-II

and evaluates bow they impact performance

CH~RVENAJi, A. L. ANI) KATZ. R. H. 1991 Perfor-mance of a disk array prototype In Proceed-

1ngs of the 1991 ACM SIGMETRICS Con-feren.e on Measurement and Modellng of Com-puter Systems Perf. Elal. ReL). 19, 5 (May).188-197 Evaluates the performance of RAID-I,a CTC. Berkeley disk array prototype.

COPELAND, G., AJ.EXANOER, W., BCWGH’I MR, E . .ANLJKELLER, T. 1988 Data placement in BubbaIn Pruceedzngs of the ACM SIGMOD Intern a-tmnal Conference on Management of DataACM, New York, 99–108. Ehscusses data allo-cation in a large database.

DRA~EMJ, A. L., SHIRRIF~, K., LEE, E, K, CHEN,P, M , GIBSON, G A., HARTMAN, J. H MILLER,

E. L., SFSJ+AN, S., KATZ, R. H , LUTZ. K, ANDPATTERSON, D. A 1994. RAID-II: A high-bandwldth network file server In Proeeedmgsof the 1994 Internatmnal Symposzum on Com-puter Archztectw-e. IEEE, New l“ork. Describesthe architecture, file system, and performanceof RAID-II, a disk array file server prototype,

EMLICH, L, W. AND POLICH, H. D. 1989 VAXslm-PLUS, a fault manager implementation Dzg.

Tech. J. 8, (Feb.) Describes D1~tal EqmpmentCorporation’s tool for predicting and avoidingdisk failures

FLATTO, L. AND HAHN, S. 1984. Two parallelqueues created by arrivals with two demands

I}. SIAM J. Cmnput, 44, 5 (Oct.), 1041-1053,

Derives an exact solutlon for the two-server,

M/M/l, fork-join queue.

FRIEDMAN, M. B 1983. DASD access patterns. In

the 14th International Conference ori Manage-ntent and Performance Evaluatwn of ComputerSystems Computer Measurement Group,

51–61 Looks at how much disk accesses areskewed toward particular disks in se~,eral

transaction-processing sites.

GIBSON, G. A. 1992. Redundant disk arrays Re-hable, parallel secondary storage Ph.D. thesis.Univ of Califorma. Berkeley, Calif. Also avail-able from MIT Press, Cambridge, Mass.Award-winning dissertation that describes

RAIDs m detail, with emphasis on rehabilityanalysis of several alternatives,

GIBSON, G. A., PATTERSON, R. H,, AND SATYA-NA.RAY.4NAN, M. 1992. Dusk reads with DRAMlatency In the 3rd Workshop on WarkstatzonOperatzng Systems. IEEE Computer Society,

Washington, DC. Proposes that apphcatlonsgive hints about their future file accesses so

that the buffer cache can prefetch needed dataand provide low-latency file access. The hintscould be exploited also to Improve cache man-agement and disk scheduling

GRAY, J., HORST, B , AND WALK~R. M. 1990. Par-ity striping of disc arrays: Low-cost rehablestorage with acceptable throughput In Pro-

cwdl ngs of the 16th i’ery Large Database Con-ference. Morgan Kaufmann. San Mateo, Calif,148–160. Describes a data and parity layout for

disk arrays called parity striping. Par@ strip-ing M essentially RAID level 5 with an Infinitestriping umt and manual load balanclng,

H.4LL, Nl 1986. Cornbmatorzal Theory. 2nd ed.Wdey-Interscience, New York. Textbook on

combinatorial theory. The section on balanced,incomplete block designs are most rele~,ant forreaders of this article

HEIJMLBERGER, P AND TRIVEDI, K. S. 1982. Qeue-

mg network models for parallel processing withasynchronous tasks. IEEE Trons Comput C-31, 11 (No,.). 1099– 1109. Deriz-es approximatesolutions for queuing systems with forks but no

Joins.

HENNEXW, J. L. AND P~TT~RSON, D. A. 1990,Computer Architecture A Quantltatu,e Ap-proach. Morgan Kaufmann, San Mateo, Calif.Likely the most popular general book in com-puter architecture today, the discussion ontechnology trends, general 1/0 issues, andmeasurements of seek distances are most rele-vant to readers of this article

HOLLAND, M., AND GIBSON, G. 1992. Paritydeclustermg for continuous operation in redun-

AC!M Cbmputmg Surveys, Vol 26. No 2, June 1994

dant disk arrays. In Proceedings of the 5thInternational Conference on Architectural Sup-port for Programming Languages and Operat-

ing Systems ( ASPLOS-V), IEEE, New York,23-35. Describes parity declustering, a tech-

mque for improving the performance of a re-

dundant disk array in the presence of diskfailure. Analyzes the proposed solution usingdetailed simulation and finds sigmficant im-provements (20–50Vc) in both user response

time and reconstruction time. Also analyzes aset of previously-proposed optimizations thatcan be applied to the reconstruction algorithm,concluding that they can actually slow the re-construction process under certain conditions.

HOLLAND, M. GIBSON, G,, AND SIF,WIOREK, D. 1993

Fast, on-line failure recovery in redundant disk

arrays. In Proceedings of the 23rd In tern a-tional Symposium on Fault Talerant t30mput-ing IEEE Computer Society, Washington, D.C.

Compares and contrasts two data reconstruc-tion algorithms for disk arrays: “parallel

stripe-oriented reconstruction” and “disk-ori-ented reconstruction.” Presents an implemen-

tation of the disk-oriented algorlthm and ana-lyzes reconstruction performance of these algo-

rithms, concluding that the disk-oriented algo-rlthm is superior. Investigates the sensitivity

of the reconstruction process to the size of thereconstruction umt and the amount of memoryavailable for reconstruction.

HSIAO, H. AND DEWITT, D. 1990. Chained declus-

tering A ncw availability strategy for multi-processor database machines. In Proceedings of

the 1990 IEEE Intern atumal Conference onData Engineering. IEEE, New York, 456-465.

Introduces a variation of mirroring, where the

secondary copy of data is distributed across the

disks in a dif~erent manner than the primarycopy of data.

KATZ, R. H. 1992. High performance network and

channel-based storage. Proc. IEEE 80, 8 (Aug.),1238–1261. Presents overview of network-based

storage systems. Reviews hardware and soft-ware trends in storage systems.

KATZ, R. H., CHEN, P. M., DRAIJMAU, A. L., Lm,E, K., Lure, K,, MILLEIR, E. L., SMHAN, S.,PATTERSON, D. A. 1993. RAID-II: Design and

implementation of a large scale disk array con-

troller. In the 1993 Symposium on IntegratedSystems. MIT Press, Cambridge, Mass. De-

scribes the design decisions and implementa-tion experiences from RAID-IL

KIM, M. Y, 1986 Synchronized disk interleaving.IEEE Trans. Comput. C-35, 11 (Nov.), 978-988.Simulates the performance of independentdisks versus synchronized disk striping. De-rives an equation for response time by treatingthe synchronized disk array as an M/G/1queuing system,

KIM, M, Y, ANU TANTAWI, A. N. 1991. Asyn-

chronous disk interleaving: Approximatmg ac-cess delays. IEEE Trans. Comput. 40, 7 (July),

801–810. Derives an approximate equation foraccess time in unsynchronized disk arrayswhen seek times are exponentially distributed

and rotational latency is uniformly distributed.

KORNER, K. 1990. Intelligent caching for remote

file service. In Proceedings of the Znternatmnal

Conference on Distributed Computing Systems.IEEE Computer Society, Washington, DC.,220-226. Uses traces to generate hints based

on the program running and the directory andname of files accessed. The file server uses the

hints to pick a caching algorithm: LRU, MRU,none. Simulation showed sigmficant benefitsfrom intelhgent caching but not from read-ahead which delayed demand requests since itwas not preemptable.

KOTZ, D. ANTI ELLIS, C. S. 1991. Practicalprefetching techniques for parallel file systems.In Proceedings of the 1st International Confer-

ence on Parallel and Distributed InformationSystems. ACM, New York, 182–189. File accesspredictors use past accesses to prefetch data inidle nodes of a parallel file system Simulation

studies show that practical predictors often cansignificantly reduce total execution time while

the penalty for incorrect predictions m modest.

LEE, E. K. ANU KATZ, R. H. 1991a An analyticperformance model of disk arrays and its appli-cations. Tech. Rep. UCB/CSD 91/660, Univ. ofCalifornia, Berkeley, Calif Derives an analyticmodel for nonredundant disk arrays and usesthe model to derive an equation for the optimalsize of data striping.

Lm, E. K. AND KATZ, R, H, 1991b. Performance

consequences of parity placement m disk ar-rays, In Proceedings of the 4th International

Conference on Architectural Support for Pro-gramming Languages and Operatzng Systems

( ASPLOS-ZV). IEEE, New Yorkj 190-199, In-vestigates the performance of different meth-

ods of distributing parity in RAID level-5 diskarrays.

Lm, E. K. AND KATZ, R. H. 1993. An analytic

performance model of disk arrays. In Proceed-ings of th 1993 ACM SIGMETRICS Conferenceon Measurement and Modehng of ComputerSystems. ACM, New York, 98–109. Slmdar to

earlier technical report with simdar name ex-cept with better empirical justltlcations and a

more detailed study of the model’s properties.

LIVNY, M. KHOSHA~IAN, S., AND BORAI., H. 1987Multi-disk management algorithms. In Prmceedings of the 1987 ACM SIGMETRICS Con-ference On Measurement and Modeling ofCamputer System. ACM, New York, 69-77’.Compares performance of disk arrays withtrack-sized and infinite striping units. Con-cludes that striping can improve performancefor many multidisk systems.

LOVERSO, S. J., ISMAN, M., AND NANOPOULOS, A.1993. A Parallel file system for the CM-5. InProceedings of the USENIX Summer Conftir-

ACM Computing Surveys, Vol. 26, No, 2, June 1994

184 “ Peter M. Chen et al.

erzce. USENIX Assoclatlon, Berkeley, Calif. Adescription of the 1/0 hardware and the filesystem of the massively parallel processor fromThinking Machines. Them RAID-3 disk arrayhas excellent performance for large file ac-

cesses.

MALHOTRA, M. AND TRIVE~I, S. 1993. Rehabihtyanalysis of redundant arrays of inexpensivedisks. J. Parall. Dwtr. Comput. 17, (Jan.),146– 151 Uses Markov models to derive exact,

closed-form reliability equations for redundant

disk arrays. Analysis accounts for failure pre-diction and sparing.

MENON, J. AND CORTNEY, J. 1993. The architec-

ture of a fault-tolerant cached RAID controllerIn Proceedings of the 20th International S.vm -posum on Compufer Architecture IEEE, NewYork, 76–86. Describes the architecture of Ha-gar and several algorithms for asynchronouswrites that reduce susceptlblhty to data loss.

MF,NON, J., MATTSON, D., ANrI NG, S. 1991. Dis-tributed sparing for improved performance ofdisk arrays. Tech Rep. RJ 7943, IBM, AlmadenResearch Center. Explores the use of an on-linespare disk in a redundant disk array analyti-

cally It examines multiple configurations, butfundamentally it distributes the spare’s spaceover the whole array so that every disk is

N/(N + 2) data, l/(N + 2) parity, and l/(N+ 2) spare. This gives an extra l/(N + 2) per-

formance, but, more significantly, it distributesthe recovery-write load (the reconstructed data)over all disks to shorten recovery time. Thebenefits, not surprisingly, are largest for smallarrays.

MENON, J., ROCHE, J., AND KASSON, J. 1993Floating parity and data dmk arrays. J. Parall.Dzstrib. Comput. 17, 129–139, Introduces float-

ing data and floating parity as an optimizationfor RAID level-5 disk arrays. Discusses perfor-

mance and capacity overheads of methods.

MERCHANT, A. ANJ) Yu, P, 1992, Design and mod-ehng of clustered RAID. In Proceedz ngs of theInternational Symposium on Fault TolerantComputing. IEEE Computer Society, Washin-gton, D. C., 140–149. Presents an implementa-tion of parity declustering, which the authorscall “clustered RAID,” based on random permu-

tations, Its advantage is that it 1s able to derivea data mapping for any size disk array withany size parity stripe, and the correspondingdisadvantage is that the computational re-quirements of the mapping algorlthm are highcompared to the block-design-based ap-

proaches. Analyzes response time and recon-

struction time using this technique via an ana-lytic model, and finds substantial benefits m

both.

MONTGOMERY SECURITIES 1991. RAID: A technol-

Ogy pomed for explosive growth. Tech. Rep.DJIA: 2902, Montgomery Securities, San Fran-c] SCO, Calif. Industry projections of marketgrowth for RAID systems from 1990 to 1995,

MUNTZ, R. R, AND LUI, C, S. 1990. Performance

analysis of disk arrays under fadure. In Pro-ceedings of the 16th Conference on Very LargeData Bases. Morgan Kaufmann, San Mateo,

Calif. Proposes and evaluates the “clustered-RAID” technique for improving the fadure-re-covery performance in redundant disk arrays.It leaves open the problem of implementation:no techmque for efficiently mapping data units

to physical disks is presented. Analyzes via ananalytical model the technique and two poten-tial “optimlzatlons” to the reconstruction algo-rithm, and finds significant benefits to all three.

NELSON, R, AND TANTAWI, A. N. 1988, Approxi-

mate analysis of fork/join synchronization in

parallel queues. IEEE Trans. Comput. 37, 6

(June), 739-743 Approximates response timein fork-join queuing systems with k > = 2servers where each logical request always forksinto k requests.

NG, S. 1994. Crossbar disk array for Improvedrehability and performance In Proceedz ng.s the1994 Intern atmnal S.vmposzum on ComputerArchitecture, IEEE, New York, Introducesschemes to protect against multiple failures of

disk support hardware such as dmk controllersand strings.

NG, S AND MATTSON, D. 1991 Maintaining good

performance in disk arrays during failure- viauniform parity group distribution. In Proceed-ings of the 1st International Symposium onHigh Performance Distributed Computing.

260–269 Uses balanced, incomplete block de-signs to distribute the extra load from a faileddisk equally among other disks in the array.

ORJI, C. U. AND Scmwowm, J. A, 1993. Doublydistorted mirrors. In Proceeduzgs of the ACMSIGMOD International Conference on Manage-

ment of Data. ACM, New York. Describes atechnique called dmtorted mirrors that parti-

tions each of two mirrored disks into two halves,one of which lays out the data in a standardfashion, one of which “distorts” the data layout.This accelerates writes to the distorted copywhile preserving the abihty to read large filessequentially.

P~TTERSON, D. A. ANr) HENNI?,SSY, J L. 1994.Computer Organization and Design: The Hard-ware /Sof?ware Interface. Morgan Kaufmann,San Mateo, Cahf. A popular undergraduatebook in computer architecture, the discussionon technology trends are most relevant to read-ers of this article,

PATTERSON, D, A., GIBSON, G., AND KATZ, R. H.1988. A case for redundant arrays of inexpen-

swe disks (RAID) In In ternatlonat Con ferenceon Management of Data (SIGMOD). ACM, NewYork, 109–116. The first published Berkeley

paper on RAIDs, It gzves all the RAID nomen-clature.

PATTERSON, R. H., ” GIBSON, G. A., AND SATyA-NARAYANAN, M, 1993. A status report on re-

ACM Computing Surveys. Vol 26, No 2, June 1994

RAID ● 185

search in transparent informed prefetching.ACM Oper. Syst. Rev. 27, 2 (Apr.), 21-34. Ex-

pands on using application hints for fileprefetching in Gibson et al. [ 1992]. Hints should

disclose access patterns, not advise caching/prefetching actions. Greatest potential fromconverting serial accesses into concurrent ac-cesses on a disk array. Presents preliminaryresults of user-level prefetching tests.

PETERSON, E. W. AND WELDON, E. J. 1972.Error-Correcting Codes. 2nd ed. MIT Press,

Cambridge, Mass. A general textbook on themathematics of error-correcting codes.

ROSENBLUM, M. AND OUSTERHOUT, J. K. 1991. The

design and implementation of a log-structuredfile system. in Proceedwzgs of the 13th ACM

Symposium on Operating Systems Principles.ACM, New York. Describes a log-structured file

system that makes all writes to disk sequen-tial. Discusses efficient ways to clean the diskto prevent excessive fragmentation.

SALEM, K. AND GARCIA-M• LINA, H. 1986. Diskstriping. In Proceedings of the 2nd Interna-tional Conference on Data Engineering. IEEEComputer Society, Washington, D. C., 336-342.Early paper discussing disk striping.

SCHEUERMANN, P., WEIKUM, G., AND ZABBACK, P.1991. Automatic tuning of data placement and

load balancing in disk arrays. Database Sys-

tems for Next-Generation Applications: Princ-iples and Practzce. Describes heuristics for allo-cating tiles to disks to minimize disk skew.

SCHULZE, M., GIBSON, G. KATZ, R., AND PATTERSON,D. 1989. How reliable is a RAID, In Proce-

dures of the IEEE Computer Society Interna-tional Conference ( COMPCON). IEEE, NewYork. Gives a reliability calculation for theelectronics as well as the disks for RAIDS.

SELTZER, M. I., CHEN, P. M., AND OUSTERHOUT, J. K.1990. Disk scheduling revisited. In Proceed-

ings of the Winter 1990 USENIX TechnicalConference. USENIX Association, Berkeley,

Calif. 313–324. Reexamines the problem of how

to efficiently schedule a large number of diskaccesses when accounting for both seek and

rotational positioning delays.

STODOLSKY, D. AND GIBSON, G. A. 1993. Paritylogging: Overcoming the small write problemin redundant disk arrays. In Proceedings of the1993 International Symposium on ComputerArchitecture. Increases throughput for work-

loads emphasizing small, random write ac-cesses in a redundant disk array by logging

changes to the parity in a segmented log forefficient application later. Log segmentation al-

lows log operations that are large enough to beefficient yet small enough to allow in-memory

application of a log segment.

TAIT, C. D. AND DUCHAMP, D. 1991. Detection and

exploitation of file working sets. In Proceedingsof the International Conference on DistributedComputzng Systems. IEEE Computer Society,Washington, D. C., 2–9. Dynamically builds andmaintains program and data access trees topredict future file accesses. The current pat-tern is matched with previous trees to prefetchdata and manage the local cache in a dis-tributed file system. Trace-driven simulationshows reduced cache miss rates over a simpleLRU algorithm.

WEIKUM, G. AND ZABBACK, P. 1992. Tuning of

striping units in disk-array-based file systems.In Proceedings of the 2nd International Work-

shop on Research Issues on Data Engineering:Transaction and Query Processing. IEEE Com-puter Society, Washington, D. C., 80–87. Pro-

poses file-specific striping units instead of asingle, global one for all files.

WILMOT, R. B. 1989. File usage patterns fromSMF data: Highly skewed usage. In the 20thZnternatLonal Conference on Management andperformance Evahatton of Computer systems.

Computer Measurement Group, 668-677. Re-

ports on how files are accessed on four largedata centers and finds that a small number offiles account for most of all disk 1/0.

Recewed November 1993; final revision accepted March 1994

ACM Computing Surveys, Vol. 26, No. 2, June 1994

Our customer support team is here to answer your questions. Ask us anything!