The Google File System 译文

时间:2021-01-29 03:37:09

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung



    We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
     我们设计和实现了Google File System,简称GFS,一个可扩展的分布式文件系统,用于大型分布式数据相关应用。它提供了基于普通商用硬件上的容错机制,同时对大量的客户端提供高性能的响应。
    While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment,
both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.
     GFS与 此前的分布式文件系统具有许多相同的目标,但我们的设计是基于对我们的应用负载和技术环境的观察而来,包含当前状况,也包含今后的发展,这与一些早期的文件系统的假定就有了分别。这驱使着我们去重新考虑传统的选择和探索新的设计点。
    The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
     这 个文件系统成功的满足了我们的存储需求。在Google它被广泛的部署,我们的业务用其作为生成和处理数据的存储平台,同时也被用于节省在面对大量数据时 的研究和开发成本。当前最大的集群已经可以基于超过一千台机器上的数千个磁盘,来存储上万TB的数据,同时它也支持来自于上万个客户端的访问请求。
    In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both
micro-benchmarks and real world use.


    We have designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs. GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. However, its design has been driven by key observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier
file system design assumptions. We have reexamined traditional choices and explored radically different points in the design space.
     我 们设计实现了GFS来应对来自Google快速增长的数据处理需求。GFS和此前的分布式文件系统具有某些相同的目标,如性能,可扩展型,可靠性和可用 性。然而,GFS的设计被Google的应用负载情况及技术环境所驱动,具有和以往的分布式文件系统不同的方面。我们从设计角度重新考虑了传统的选择,针 对这些不同点进行了探索。
First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines. The quantity and quality of the  components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application
bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system.
     第 一,组件的失效比异常更加常见。文件系统包含了成百上千的基于普通硬件的存储机器,同时被大量的客户端机器访问,组件的数量和质量决定了在某个时刻一些组 件会失效而其中的一些无法从失效状态中恢复。我们曾经见到过由于下面的原因引发的实效:应用缺陷,OS缺陷,人为错误,磁盘/内存/连接器/网络/电源错 误等等,因此系统必须包含状态监视、错误检测、容错、自动恢复等能力。
    Second, files are huge by traditional standards. Multi-GB files are common. Each file typically contains many application objects such as web documents. When we are regularly working with fast growing data sets of many TBs comprising billions of objects, it is unwieldy to manage billions of approximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters such as I/O operation and blocksizes have to be revisited.
     第 二,传统标准的文件量十分巨大,总量一般都会达到GB级别。文件通常包含许多应用对象,诸如Web文档等。当我们在工作中与日益增长的包含大量对象的TB 级的数据进行交互时,管理数以亿计的KB大小的文件是非常困难的。所以,设计假定和参数需要重新定义,如I/O操作和块大小等。
    Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of data share these characteristics. Some may constitute large repositories that data analysis programs scan through. Some may be data streams continuously generated by running applications. Some may be archival data. Some may be intermediate results produced on one machine and processed on another, whether simultaneously or later in time. Given this access pattern on huge files, appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.
     第 三,多数的文件变化是因为增加新的数据,而非重写原有数据。在一个文件中的随机写操作其实并不存在。一旦完成写入操作,文件就变成只读,通常也是顺序存 储。多种数据拥有这样的特征。构造大型存储区以供数据分析程序操作;运行应用产生的连续数据流;历史归档数据;一台机器产生的会被其他机器使用的中间数 据;对于巨大文件的访问模式,“增加”变成了性能优化的焦点,与此同时,在客户端进行数据块缓存逐渐失去了原有的意义。
    Fourth, co-designing the applications and the file system API benefits the overall system by increasing our flexibility. For example, we have relaxed GFS’s consistency model to vastly simplify the file system without imposing an onerous burden on the applications. We have also introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them. These will be discussed in more details later in the paper.
     第 四,统一设计应用和文件系统API对提升灵活性有着好处。例如,我们将GFS的一致性模型设计的尽量轻巧,使得文件系统得到极大的简化,应用系统也不会背 上沉重的包袱。我们还引入了一个原子Append操作,这样多个客户端可以同时向一个文件增加内容,而不会出现同步问题。这些将会在论文的后续章节进行讨 论。  
    Multiple GFS clusters are currently deployed for different purposes. The largest ones have over 1000 storage nodes, over 300 TB of disk storage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis.


2.1 Assumptions 假定

    In designing a file system for our needs, we have been guided by assumptions that offer both challenges and opportunities. We alluded to some key observations earlier and now lay out our assumptions in more details.
• The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
• The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
• The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1 MB or more. Successive operations from the same client often read through a contiguous region of a file. A small random read typically reads a few KBs at some arbitrary
offset. Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go back and forth.
系统 的负荷来自于两种读操作:大型顺序读,以及小型随机读。在大型顺序读的情况中,单个操作通常读取MB级别以上的数据。来自相同客户端的连续操作通常读取一 个文件的连续区间。小型随机读通常读取若干KB的数据据。关注性能的应用往往会将小型读操作进行打包和排序,从而使得在文件中平稳的读取,而非反复前后跳 转。
• The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads. Once written, files are seldom modified again. Small writes at arbitrary positions in a file are supported but do not have to be efficient.
• The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producer-consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently
append to a file. Atomicity with minimal synchronization overhead is essential. The file may be
read later, or a consumer may be reading through the file simultaneously.
对于 多个客户端并发向同一个文件进行Append操作的情况,系统必须有效的实现良好定义的语义。我们的文件常被用作“生产者-消费者队列“或者“多路合并 ”。数以百计的生产者,每个运行于单独的机器,并行向同一个文件添加数据。降低同步的困扰必不可少。文件可能后续被读取,也许一个消费者会同时读取。
• High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write.

2.2 Interface 接口

    GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX. Files are organized hierarchically in directories and identified by pathnames. We support the usual operations to create, delete, open, close, read, and write files.
    Moreover, GFS has snapshot and record append operations. Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is useful for implementing multi-way merge results and producerconsumer queues that many clients can simultaneously append to without additional locking. We have found these types of files to be invaluable in building large distributed applications. Snapshot and record append are discussed further in Sections 3.4 and 3.3 respectively.
     GFS 也拥有快照和Append记录操作。快照以最低成本创建一个文件或一个目录树的拷贝。Append记录允许多个客户端同时向一个文件进行Append操 作,同时确保每个单独客户端Append的原子性。这一点对于实现“多路合并”和“生产者-消费者队列”非常有意义,许多客户端可以同时进行Append 操作而不受额外的加锁限制。我们发现在构造大型分布式应用时,这种类型的文件非常有价值。快照和Append记录将在3.4和3.5章中详细讨论。

2.3 Architecture 架构

    A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown in Figure 1. Each of these is typically a commodity Linux machine running a user-level server process. It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable.
     一 个GFS集群由一个master和多个块服务器(Chunkserver)组成,被多个客户端所访问,如图1所示。每个机器都是廉价的Linux机器,运 行用户态服务进程。也可以将块服务器和客户端在同一台机器上运行,只要机器的资源允许,或者可以接受可能有问题的应用代码带来的低稳定性。  
The Google File System 译文
    Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range. For reliability, each chunk is replicated on multiple chunkservers. By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace.
     文 件被分割成固定大小的块。每个块都使用一个不变的全局唯一的64位块句柄进行标识,这个句柄在master创建块时进行分配。块服务器在本地磁盘上像 Linux文件一样存储块,根据指定的块句柄和字节范围来读写块数据。为了可靠性,每个块被复制在多个块服务器上。缺省情况下,我们保存三分复制,用户也 可以为文件名称空间的不同地区指定不同的复制级别。
    The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.
     Master 维护所有的文件系统元数据。它将包括名字空间,访问控制信息,文件与块的链接,以及块的当前位置。它还控制着系统层面的活动,诸如块租借管理,孤立块的垃 圾回收,块服务器之间的块迁移。master会定期的与块服务器使用心跳消息进行通信,发送指令给块服务器,以及收集块服务器的状态。
    GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. We do not provide the POSIX API and therefore need not hook into the Linux vnode layer.
     嵌 入与应用中的GFS客户端代码实现了文件系统API,与master和块服务器进行通信,代为应用程序读写数据。客户端与master交互以进行元数据操 作,但是所有的数据通信都将直接访问块服务器。我们没有提供POSIX API,因此无需在Linux vnode层放置钩子。
    Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata, however.) Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.
     客 户端和块服务器都不会缓存文件数据。客户端进行缓存只有极少的益处,因为多数应用操作巨大的文件,而且工作输出的大小也超出的缓存的范围。没有缓存让客户 端和整个系统都变得简单,因为可以忘记缓存同步问题。(然后客户端还是会缓存元数据)块服务器也无需缓存文件数据,因为块在本地文件中存放,Linux的 缓冲区机制已经将频繁访问的数据放进了内存。

2.4 Single Master 单Master

    Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. However, we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.
     单 master极大的简化了我们的设计,同时也使得master可以给予全局知识进行复杂的块存储和复制策略。但是我们必须使得master在读写方面的占 用最小化,从而避免让它成为瓶颈。客户端从不直接从master读写数据。相反的,客户端会询问master该与哪个块服务器进行交互。而后它会将这个信 息缓存一段时间,接下来的操作会直接与这个块服务器进行交互。
    Let us explain the interactions for a simple read with reference to Figure 1. First, using the fixed chunk size, the client translates the file name and byte offset specified by the application
into a chunk index within the file. Then, it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information using the file name and chunk index as the key.
     让 我们用图1来解释一下一个简单的读操作的交互过程。首先,使用固定的块大小,客户端将文件名和应用指定的偏移量转换成文件内部的块索引。然后,客户端向 master发送一个请求,包含文件名和块索引。master响应对应的块句柄和复本的位置。客户端将这些信息进行缓存,使用文件名和块索引作为Key。
    The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk. Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened. In fact, the client typically asks for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. This
extra information sidesteps several future client-master interactions at practically no extra cost.
     客 户端向复本之一发送一个请求,通常是最近的一个。这个请求指定了块句柄和块内部的一个区间。接下来对于相同块的读取将不会再次进行客户端与master的 交互,直到缓存过期,或者文件被重新打开。事实上,客户端通常在一个请求中尝试读取多个块,master也会立即返回相应的块信息。这些额外的信息避免了 后续的一些客户端与master的交互,但又没有引入额外的成本。

2.5 Chunk Size 块大小

    Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file system block sizes. Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed. Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size.
    A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. Even for small random reads, the client can comfortably cache all the chunk location information for a multi-TB working set. Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata
stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
     大 型的块有许多关键的好处。首先,它减少了客户端与master交互的需求,因为对于同一的块的读写,只需要向master发送一个获取块位置信息的初始请 求。这极大的降低系统的负荷,因为应用通常对大型文件进行顺序读写操作。即使对于小型随机读操作,客户端也可以轻松的对TB级别的工作集的块位置存储进行 缓存。第二,因为块足够大,客户端基本上是在一个给定的块上进行多次操作,这也可以降低网络方面的困难,因为可以在一个时间段内与块服务器之间保持一个持 久的TCP连接。第三,这使得可以减少在master上存储的元数据大小。这样的话,我们可以将元数据放入内存中,从而带来其他的将在2.6中讨论的好 处。
    On the other hand, a large chunks ize, even with lazy space allocation, has its  disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially.
     另 一方面,虽然可以进行“懒”空间分配,大型的块也有它的缺点。一个小文件包含较少的块,也可能只有一个。存储这些块的块服务器可能会变成“热点”,如果许 多客户端尝试访问相同的文件。在实践中,热点不会成为主要问题,因为我们的应用在大多数情况下,是顺序的对多个块的文件进行读操作。
    However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunk file and then started on hundreds of machines
at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batch queue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.
     但 是,当GFS被第一次用于一个批处理队列系统中试,热点还是出现了:一个可执行文件作为单块文件被写入GFS,然后在成百上千台机器上启动运行。保存这个 可执行文件的几台块服务器由于大量并发请求进入过载状态。我们采取了一些措施来解决这个问题,提高复本数量,以及让批处理队列系统错开应用的启动时间。一 个潜在的长期解决方案是:允许客户端在这种情况下从其他的客户端读取数据。

2.6 Metadata 元数据

    The master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. All metadata is kept in the master’s memory. The first two types (namespaces and file-to-chunkma pping) are also kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines. Using a log allows us to update the master state simply, reliably,
and without risking inconsistencies in the event of a master crash. The master does not store chunk location information persistently. Instead, it asks each chunkserver about its
chunks at master startup and whenever a chunkserver joins the cluster.
     Master 存储三种主要的元数据:文件和块的名字空间,文件和块的映射关系,每个块复本的位置。所有的元数据都保存在Master的内存中。前两类(名字空间和映射 关系)也作为操作日志被保存在Master的本地磁盘上,并且在远程机器上保存一个复本。使用日志使得我们更加简单、可靠的更新Master的状态,不用 担心由于Master死机造成的数据不一致。Master不会持久化块的位置信息,相反,Master启动时会向块服务器查询块的状态,并且一个块服务器 加入集群时也会进行相同的操作。

2.6.1 In-Memory Data Structures 内存中的数据结构

    Since metadata is stored in memory, master operations are fast. Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background. This periodic scanning is used to implement chunk garbage collection, re-replication in the presence of chunkserver failures, and chunkm igration to balance load and disk space usage across chunkservers. Sections 4.3 and 4.4 will discuss these activities further.
    One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. This is not a serious limitation in practice. The master maintains less than 64 bytes of metadata for each 64 MB chunk. Most chunks are full because most files contain many chunks, only the last of which may be partially filled. Similarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression.
     对 于纯内存方式的潜在忧虑在于,块的数量、乃至于整个系统的容量受限于Master的内存大小。实践中这并不是一个严重的限制。对于每个64MB大小的 块,Master保存小于64字节的元数据。大多数的块是满的因为多数文件包含多个块,只有最后的一个是部分填充。相似的,每个文件的名字空间数据通常也 仅需要64字节,因为保存的文件名使用前缀压缩过。
    If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory.

2.6.2 Chunk Locations 块位置

    The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages.
    We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunkservers at startup, and periodically thereafter. This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often.
    Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver.

2.6.3 Operation Log 操作日志

    The operation log contains a historical record of critical metadata changes. It is central to GFS. Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent operations. Files and chunks, as well as their versions (see Section 4.5), are all uniquely and eternally identified by the logical times at which they were created.
    Since the operation log is critical, we must store it reliably and not make changes visible to clients until metadata changes are made persistent. Otherwise, we effectively lose the whole file system or recent client operations even if the chunks themselves survive. Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely. The master batches several log
records together before flushing thereby reducing the impact of flushing and replication on overall system throughput.
     由 于操作日志的重要性,我们必须以可靠的方式保存它,而且只有元数据的变动被持久化后,变动才会对客户端可见。否则,虽然块还存在,我们却可能丢失整个文件 系统或者最近的客户端操作。因此,我们将它的复本保存在多台远程机器上,并且只有在已经将日志输出到本地和远程的磁盘上后,才会对客户端的请求完成响应。 为了降低传输和备份对于整个系统的影响,在发送日志前,Master会将多个日志记录打包在一起。
    The master recovers its file system state by replaying the operation log. To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the limited number of log records after that. The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing. This further speeds up recovery and improves availability.
     恢 复文件系统的状态时,Master重放操作日志。为了最小化启动时间,我们必须让日志尽量的小。每次日志超过一个指定的大小后,Master会对日志保存 检查点,这样系统可以先加载最新的检查点,而后只重放少数的日志就可以回退到最新状态。检查点是一个压缩B树的形式,可以直接被映射到内存中,并且使用名 称空间查询时无需额外的解析。这也将使恢复过程变得更快。
    Because building a checkpoint can take a while, the master’s internal state is structured in such a way that a new checkpoint can be created without delaying incoming mutations. The master switches to a new log file and creates the new checkpoint in a separate thread. The new checkpoint includes all mutations before the switch. It can be created in a minute or so for a cluster with a few million files. When completed, it is written to disk both locally and remotely.
     因 为创建一个检查点会花费一些时间,Master的内部状态被构造为一种形式,这种形式可以使得创建新检查点时不会对到来的变化产生延迟。Master会切 换到一个新的日志文件,并在另一个线程中创建一个新的检查点。新的检查点包含切换前所有的变动。一个百万级的集群的检查点可以在一分钟内完成创建。当结束 后,它将被写入到本地和远程的磁盘中。
    Recovery needs only the latest complete checkpoint and subsequent log files. Older checkpoints and log files can be freely deleted, though we keep a few around to guard against catastrophes. A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints.

2.7 Consistency Model 一致性模型

    GFS has a relaxed consistency model that supports our highly distributed applications well but remains relatively simple and efficient to implement. We now discuss GFS’s guarantees and what they mean to applications. We also highlight how GFS maintains these guarantees but leave the
details to other parts of the paper.
The Google File System 译文
    The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations. Table 1 summarizes the result. A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety. When a mutation succeeds without interference
from concurrent writers, the affected region is defined (and by implication consistent): all clients will always see what the mutation has written. Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but it may not reflect what any one mutation has written. Typically, it consists of mingled fragments from multiple mutations. A failed mutation makes the region inconsistent (hence also undefined): different clients may see different data at different times. We describe below how our applications can distinguish defined regions from undefined regions. The applications do not need to further distinguish between different kinds of undefined regions.
     数 据变动后文件范围的状态取决于变动的类型,是否成功,是否是并发变动。表格1汇总了结果。如果所有的客户端不管从哪个复本读取,都一直能看见相同的数据, 则这个文件范围是一致的。在一个文件数据变动后,如果它是一致的,并且客户端可知变动的地方,则这个文件范围是已定义的。如果一个变动成功,则被影响的文 件范围是已定义的(隐含的一致性):所有的客户端一直都可见写入的变动。同步的成功变动使得范围是一致的但是未定义:所有的客户端看见相同的数据,但是它 也许不会表现出发生的变动。通常情况下,它包含多个变动的混合片段。一个失败的变动使得范围变成不一致(因此也是未定义):不同的客户端在不同时间可能看 见不同的数据。我们将会描述我们的程序如何能够分辨已定义和未定义的范围。应用不用去区分未定义的范围的种类。
    Data mutations may be writes or record appends. A write causes data to be written at an application-specified file offset. A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations, but at an offset of GFS’s choosing (Section 3.3). (In contrast, a “regular” append is merely a write at an offset that the client believes to be the current end of file.) The offset is returned to the client and marks the beginning of a defined region that contains the record. In addition, GFS may insert padding or record duplicates in between. They occupy regions considered to be inconsistent and are typically dwarfed by the amount of user data.
     数 据变动可以是写或记录追加。写会将数据写在应用指定的文件偏移位置。记录追加将把数据原子性的追加到文件中,但是GFS可以选择偏移位置(3.3)。(相 比而言,通常的追加仅指在文件的末尾)偏移量将会返回到客户端,并标识出包含记录的已定义范围的开始处。此外,GFS会在中间插入填充字符或者冗余记录。 它们占据被认为是不一致的范围,通常比用户数据的量少的多。
    After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation. GFS achieves this by (a) applying mutations to a chunk in the same order on all its replicas (Section 3.1), and (b) using chunk version numbers to detect any replica that has become stale because it has missed mutations while its chunkserver was down (Section 4.5). Stale replicas will never be involved in a mutation or given to clients asking the master for chunk locations. They are garbage collected at the earliest opportunity.
     在 一系列成功变动后,变动的文件范围保证是已定义的,并且包含了最后变动所写入的数据。GFS通过下面的方法做到这一点:(a)将块的变动在所有的复本上按 相同的顺序进行记录(3.1),(b)使用块版本号来检测是否因为块服务器死机造成错过了某些变动,从而复本变成失效(4.5)。失效的复本将不再会涉及 后续的变动,Master向客户端响应块的位置时也不会返回此复本的信息。它们将尽早被垃圾回收。
    Since clients cache chunk locations, they may read from a stale replica before that information is refreshed. This window is limited by the cache entry’s timeout and the next
open of the file, which purges from the cache all chunk information for that file. Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunk rather than outdated data. When a reader retries and contacts the master, it will immediately get current chunk locations.
     因 为客户端缓存了块的位置,它们可能会在信息刷新前从一个失效的复本读取。时间窗口由缓存超时时间以及文件再次打开的时间而决定,文件打开后会清除缓存中所 有块的信息。而且,因为我们的大多数文件是仅追加的,一个失效的复本通常返回块末尾之前的数据,而不是无效的数据。当重新联系Master时,它可以立即 得到当前的块位置。
    Long after a successful mutation, component failures can of course still corrupt or destroy data. GFS identifies failed chunkservers by regular handshakes between master and all chunkservers and detects data corruption by checksumming (Section 5.2). Once a problem surfaces, the data is restored from valid replicas as soon as possible (Section 4.3). A chunk
is lost irreversibly only if all its replicas are lost before GFS can react, typically within minutes. Even in this case, it becomes unavailable, not corrupted: applications receive clear errors rather than corrupt data.
     成 功变动过后很久,部件错误可以会损坏或销毁数据。GFS使用Master和块服务器之间的握手和数据校验,可以识别失效的块服务器(5.2),一旦出现问 题,数据可以尽快的从有效的复本中恢复出来(4.3)。只有当一个块的所有复本在GFS应对之前全部丢失,这个块才会不可逆的丢失,通常GFS的反应时间 在几分钟之内,即使在此种情况下,块变成不可用,但并没有损坏:应用可以接收到明确的错误,而不是损坏的数据。

2.7.2 Implications for Applications 应用的影响

    GFS applications can accommodate the relaxed consistency model with a few simple techniques already needed for other purposes: relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records.
    Practically all our applications mutate files by appending rather than overwriting. In one typical use, a writer generates a file from beginning to end. It atomically renames the file to a permanent name after writing all the data, or periodically checkpoints how much has been successfully written. Checkpoints may also include application-level checksums. Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined
state. Regardless of consistency and concurrency issues, this approach has served us well. Appending is far more efficient and more resilient to application failures than random writes. Checkpointing allows writers to restart incrementally and keeps readers from processing successfully written file data that is still incomplete from the application’s perspective.
     实 际上,我们所有的应用程序使用追加进行文件变动多过覆写。一个典型用法,写入者从头到尾生成文件。它写完所有的数据后,将文件重命名为一个永久的名称,或 者周期性的为写入成功多少而建立检查点。检查点也包含应用性的校验和。读取者只验证和处理在最新检查点中的文件范围,也就是已定义的状态。不管发生一致性 和同步问题,这个方法工作的很好。追加比随机写有效的多,并且对应用失效更有弹性。检查点允许写入者渐进的重新开始,并避免读取者从应用的角度认为文件数 据已经成功处理,然而实际上是不完整的。
    In the other typical use, many writers concurrently append to a file for merged results or as a producer-consumer queue. Record append’s append-at-least-once semantics preserves each writer’s output. Readers deal with the occasional padding and duplicates as follows. Each record prepared by the writer contains extra information like checksums so that its validity can be verified. A reader can identify and discard extra padding and record fragments using the checksums. If it cannot tolerate the occasional duplicates (e.g., if they would trigger non-idempotent operations), it can filter them out using unique identifiers in the records, which are often needed anyway to name corresponding application entities such as web documents. These
functionalities for record I/O (except duplicate removal) are in library code shared by our applications and applicable to other file interface implementations at Google. With that, the same sequence of records, plus rare duplicates, is always delivered to the record reader.
     在 另一个常见的用法中,多个写入者并发的向一个文件进行追加,进行结果的合并或者作为生产者-消费者队列。记录追加的“最少一次追加”的语义保证了每个写入 者的输出。读取者按照下面的方法来应对偶尔的填充数据和冗余信息。写入者准备的每条记录都包含诸如校验和这样的额外信息,因此记录的有效性可以被判断。读 取者可以使用校验和来识别和消除额外的填充数据和记录片段。如果它不能容忍偶然的冗余(例如,如果他们出发非幂操作),它可以使用记录的唯一标识来过滤掉 它们,这些标识符通常用于名称对应的应用,例如Web文档。这些记录I/O的功能(除去移除冗余)都封装在库的代码中在应用*享,并且可以用于 google实现的其它文件接口。记录的相同序列,加上少有的冗余,总是被分发给记录的读取者。


    We designed the system to minimize the master’s involvement in all operations. With that background, we now describe how the client, master, and chunkservers interact to implement data mutations, atomic record append, and snapshot. 

3.1 Leases and Mutation Order 租约和变动顺序

    A mutation is an operation that changes the contents or metadata of a chunk such as a write or an append operation. Each mutation is performed at all the chunk’s replicas. We use leases to maintain a consistent mutation order across replicas. The master grants a chunk lease to one of the replicas, which we call the primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary.
     变 动就是一个会改变块的元数据的操作,就像写和追加操作。每个变动都在块的所有副本上运行。我们使用租约来保持复本间的变动一致顺序。Master为一个复 本签署块租约,我们称其为主块。主块为块的所有变动选择一个序列顺序。所有的复本在应用变动时都遵从这个顺序。因此全局的变动顺序首先被Master选择 的租约签署顺序所定义,在一个租约中,则由主块赋予的序列号决定。
    The lease mechanism is designed to minimize management overhead at the master. A lease has an initial timeout of 60 seconds. However, as long as the chunk is being mutated, the primary can request and typically receive extensions from the master indefinitely. These extension requests and grants are piggybacked on the HeartBeat messages regularly exchanged between the master and all chunkservers. The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed). Even if the master loses communication with a primary, it can safely grant a new
lease to another replica after the old lease expires.
     租 约机制被设计用来使Master的管理难度最小化。一个租约有60秒的初始超时时间。然而只要块有变动,主块可以从Master请求租约延期。这种延期请 求和签署通常利用Master和所有块服务器之间的常规心跳消息进行传输。Master有时会在一个租约到期前解除它(例如当Master想要取消在一个 已经改名的文件上的变动)。即使Master与主块之间通信中断,它也可以在原有租约超时后,安全的与另一个复本签署新的租约。
    In Figure 2, we illustrate this process by following the control flow of a write through these numbered steps.
     在图2中,我们描述了写操作的分步骤 控制流程。
The Google File System 译文

    1. The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses (not shown).
    2. The master replies with the identity of the primary and the locations of the other  secondary) replicas. The client caches this data for future mutations. It needs to contact the master again only when the primary becomes unreachable or replies that it no longer holds
a lease.
    3. The client pushes the data to all the replicas. A client can do so in any order. Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out. By decoupling the data flow from the control flow, we can improve performance by
scheduling the expensive data flow based on the network topology regardless of which chunkserver is the primary. Section 3.2 discusses this further.
     客 户端将数据推送到所有的复本上。客户端可以以任何顺序来进行。每个块服务器将在一个内部LRU缓冲区中保存数据,直到数据被使用或者过期。通过将数据流和 控制流解耦,我们可以根据网络拓扑结构对代价大的数据流,从而提升性能,而不管哪个块服务器是主块。3.2将会详细讨论。
    4. Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The request identifies the data pushed earlier to all of the replicas. The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serialization. It applies the mutation to its own local state
in serial number order.
    5. The primary forwards the write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary.
    6. The secondaries all reply to the primary indicating that they have completed the operation.
    7. The primary replies to the client. Any errors encountered at any of the replicas are reported to the client. In case of errors, the write may have succeeded at the primary and an arbitrary subset of the secondary replicas. (If it had failed at the primary, it would not have been assigned a serial number and forwarded.) The client request is considered to have failed, and the modified region is left in an inconsistent state. Our client code handles such errors by retrying the failed mutation. It will make a few attempts at steps (3) through (7) before falling back to a retry from the beginning of the write.
     主 块向客户端进行回复。任何复本产生的错误都将报告给客户端。如果出现错误,写操作可能在主块和一些复本上成功进行。(如果主块失败,将不会分配序列号及向 复本的转发)客户端请求被认为是失败的,修改范围留在不一致的状态。我们的客户端代码通过重试失败的变动来处理这样的错误。可能会在第3步和第7步之间进 行一些重试。
    If a write by the application is large or straddles a chunk boundary, GFS client code breaks it down into multiple write operations. They all follow the control flow described above but may be interleaved with and overwritten by concurrent operations from other clients. Therefore, the shared file region may end up containing fragments from different clients, although the replicas will be identical because the individual operations are completed successfully in the same order on all replicas. This leaves the file region in consistent but undefined state as noted in Section 2.7.
     如 果应用的写操作量很大,或者超过了单块的范围,GFS客户端代码将它分割成多个写操作。它们都遵循上面描述的控制流,但是可能会和其他客户端的并发操作混 在一起,从而数据可能被覆盖。因此共享的文件范围的尾部可能会有来自不同客户端的片段,即使复本将是同一的,因为独立的操作在所有的复本上都以相同的顺序 成功完成。这使得文件范围处于2.7中描述的一致但未定义的状态。

3.2 Data Flow 数据流

    We decouple the flow of data from the flow of control to use the network efficiently. While control flows from the client to the primary and then to all secondaries, data is pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion. Our goals are to fully utilize each machine’s networkb andwidth, avoid network bottlenecks and high-latency links, and minimize the latency to push through all the data.
    To fully utilize each machine’s network bandwidth, the data is pushed linearly along a chain of chunkservers rather than distributed in some other topology (e.g., tree). Thus, each machine’s full outbound bandwidth is used to transfer the data as fast as possible rather than divided among multiple recipients.
    To avoid network bottlenecks and high-latency links (e.g., inter-switch links are often both) as much as possible, each machine forwards the data to the “closest” machine in the network topology that has not received it. Suppose the client is pushing data to chunkservers S1 through S4. It sends the data to the closest chunkserver, say S1. S1 forwards it to the closest chunkserver S2 through S4 closest to S1, say S2. Similarly, S2 forwards it to S3 or S4, whichever is closer to S2, and so on. Our network topology is simple enough that “distances” can be accurately estimated from IP addresses.
     为 了尽量避免网络瓶颈和高延迟链接,每台机器将数据发送至网络拓扑中“最近的”机器。假定客户端正在向块服务器S1至S4推送数据。它向最近的块服务器发送 数据,比如S1。S1转发数据至S2-S4中与S1最近的机器,比如S2。相似的,S2转发数据至S3或S4中离S2最近的一个。我们的网络拓扑非常简 单,“距离”可以通过IP地址精确的估算。
    Finally, we minimize latency by pipelining the data transfer over TCP connections. Once a chunkserver receives some data, it starts forwarding immediately. Pipelining is especially helpful to us because we use a switched network with full-duplex links. Sending the data immediately does not reduce the receive rate. Without network congestion, the ideal elapsed time for transferring B bytes to R replicas is B/T + RL where T is the network throughput and L is latency
to transfer bytes between two machines. Our network links are typically 100 Mbps (T), and L is far below 1 ms. Therefore, 1 MB can ideally be distributed in about 80 ms.
     最 终,我们通过将TCP连接上的数据传输管道化来最小化延迟。当一个块服务器收到一些数据时,它立即启动转发。管道化对于我们非常有用,因为我们使用全双工 连接的交换网络。立即发送数据不会降低接收的速率。不考虑网络拥塞,向R个复本转送B字节的数据,理想的耗费时间为B/T+RL,T是网络吞吐量,L是在 两台机器之间传送数据的延迟。我们的网络连接通常是100Mbps(T),L远低于1ms。因此理想情况下,1MB数据可以在80ms内被分发。

3.3 Atomic Record Appends 原子性的记录追加

    GFS provides an atomic append operation called record append. In a traditional write, the client specifies the offset at which data is to be written. Concurrent writes to the same region are not serializable: the region may end up containing data fragments from multiple clients. In a record append, however, the client specifies only the data. GFS appends it to the file at least once atomically (i.e., as one continuous sequence of bytes) at an offset of GFS’s choosing
and returns that offset to the client. This is similar to writing to a file opened in O APPEND mode in Unix without the race conditions when multiple writers do so concurrently.
     GFS 提供了一个原子性的追加操作,称为记录追加。在一个传统的写中,客户端指定数据写入的偏移量。并发的写相同的范围是不可串行的:范围的尾部可能包含来自于 多个客户端的数据片段。在一个记录追加中,客户端仅仅指定数据。GFS至少一次自动将它追加在文件中的一个GFS选定的偏移位置,而后返回此位置给客户 端。这与Unix中,多个写入者在没有竞争条件下,对一个以O APPEND模式打开的文件进行写入是相似的。
    Record append is heavily used by our distributed applications in which many clients on different machines append to the same file concurrently. Clients would need additional
complicated and expensive synchronization, for example through a distributed lock manager, if they do so with traditional writes. In our workloads, such files often serve as multiple- producer/single-consumer queues or contain merged results from many different clients.
     记 录追加在我们的应用中使用的相当频繁,许多客户端并发的向同一个文件追加数据。如果使用传统方式进行写操作,客户端需要额外的复杂、昂贵的同步,例如通过 一个分布式的锁管理器。在我们的工作中,这种文件通常作为“多生产者/单消费者队列”或者包含从许多不同客户端的结果合并。
    Record append is a kind of mutation and follows the control flow in Section 3.1 with only a little extra logic at the primary. The client pushes the data to all replicas of the last chunk of the file Then, it sends its request to the primary. The primary checks to see if appending the record to the current chunk would cause the chunk to exceed the maximum size (64 MB). If so, it pads the chunk to the maximum size, tells secondaries to do the same, and replies to the client indicating that the operation should be retried on the next chunk. (Record append is restricted to be atmost one-fourth of the maximum chunk size to keep worstcase fragmentation at an acceptable level.) If the record fits within the maximum size, which is the common case, the primary appends the data to its replica, tells the secondaries to write the data at the exact offset where it has, and finally replies success to the client.
     记 录追加是一种遵循3.1中控制流程的变动,只在主块上有少许额外的逻辑。客户端向数据推送到文件的最后的块的所有复本上,然后它发送请求给Master。 Master来检查追加记录到当前的快上是否会导致块超过大小限度(64MB)。如果是,它将块填充到最大值,通知二级复本也这样做,回复客户端指示需要 对下一个块也进行操作。(记录追加被限制于最大块尺寸的四分之一,以使最坏情况下的文件碎片处于可接受的级别)如果记录在最大尺寸范围内,这也是最常见的 情况,主块将数据追加到它的复本,通知二级复本在自身相同的偏移位置写入数据,最终向客户端回复成功。
    If a record append fails at any replica, the client retries the operation. As a result, replicas of the same chunk may contain different data possibly including duplicates of the same record in whole or in part. GFS does not guarantee that all replicas are bytewise identical. It only guarantees that the data is written at least once as an atomic unit. This property follows readily from the simple observation that for the operation to report success, the data must have been written at the same offset on all replicas of some chunk. Furthermore, after this, all replicas are at least as long as the end of record and therefore any future record will be assigned a higher offset or a different chunk even if a different replica later becomes the primary. In terms of our consistency guarantees, the regions in which successful record append operations have written their data are defined (hence consistent), whereas intervening regions are inconsistent (hence undefined). Our applications can deal with inconsistent regions as we discussed in Section 2.7.2.
     如 果在任何一个复本上发生数据追加失败,客户端重试这个操作。作为结果,相同块的复本也许包含。GFS不保证所有的复本都是字节级别相同的。它只保证数据作 为原子单元将至少被写入一次。这个属性是由如下观察结果轻松推导而来,如果操作报告成功,数据必须被写入一些块的所有复本的相同偏移位置。而且,在此之 后,所有的复本至少达到记录的长度,因此任何后续的记录都会被分配一个更高的偏移位置或者一个不同的块,即使一个不同的复本变成了主块。按照我们的一致性 保证的说法,成功的被执行追加记录的范围是已定义的(所以也是一致的),反之则是不一致的(也就是未定义的)。我们的应用可以处理不一致的范围,正如我们 在2.7.2中讨论的。

3.4 Snapshot 快照

    The snapshot operation makes a copy of a file or a directory tree (the “source”) almost instantaneously, while minimizing any interruptions of ongoing mutations. Our users use it to quickly create branch copies of huge data sets (and often copies of those copies, recursively), or to checkpoint the current state before experimenting with changes that can later be committed or rolled back easily.
    Like AFS [5], we use standard copy-on-write techniques to implement snapshots. When the master receives a snapshot request, it first revokes any outstanding leases on the chunks in the files it is about to snapshot. This ensures that any subsequent writes to these chunks will require an interaction with the master to find the lease holder. This will give the master an opportunity to create a new copy of the chunk first.
     正 如AFS,我们使用标准的copy-on-write技术来实现快照。当Master收到一个快照请求,它首先取消需要进行快照的文件的块的租约。这可以 保证后续所有的对这些块的写操作必须到Master上去获取租约持有者。这可以给Master一个创建块的新拷贝的机会。
    After the leases have been revoked or have expired, the master logs the operation to disk. It then applies this log record to its in-memory state by duplicating the metadata for the source file or directory tree. The newly created snapshot files point to the same chunks as the source files.
    The first time a client wants to write to a chunkC after the snapshot operation, it sends a request to the master to find the current lease holder. The master notices that the reference count for chunkC is greater than one. It defers replying to the client request and instead picks a new chunk handle C’. It then asks each chunkserver that has a current replica of C to create a new chunk called C’. By creating the new chunk on the same chunkservers as the original, we
ensure that the data can be copied locally, not over the network (our disks are about three times as fast as our 100 Mb Ethernet links). From this point, request handling is no different
from that for any chunk: the master grants one of the replicas a lease on the new chunkC’ and replies to the client, which can write the chunk normally, not knowing that it has just been created from an existing chunk.
     客 户端第一次想要在快照操作后写块C,它向Master发送一个请求来获取当前的租约持有者。Master意识到块C的引用数目超过1,它延迟回复客户端的 请求,取而代之的选择一个新的块句柄C'。然后它让每个有C的复本的块服务器创建一个新块C'。通过在相同的块服务器上创建新块,我们保证数据可以被本地 复制,不是从网络(我们的磁盘比100MB的以太网快3倍)。从这一点来说,请求处理和任何块没有不同:Master为一个复本签署新块C'的租约,并响 应给客户端,客户端可以正常的写块,不会知道它是刚刚从一个已存在的块中创建出来的。


    The master executes all namespace operations. In addition, it manages chunk replicas throughout the system: it makes placement decisions, creates new chunks and hence replicas, and coordinates various system-wide activities to keep chunks fully replicated, to balance load across all the chunkservers, and to reclaim unused storage. We now discuss each of these topics.

4.1 Namespace Management and Locking 名字空间管理和加锁

    Many master operations can take a long time: for example, a snapshot operation has to revoke chunkserver leases on all chunks covered by the snapshot. We do not want to delay other master operations while they are running. Therefore, we allow multiple operations to be active and use locks over regions of the namespace to ensure proper serialization.
    Unlike many traditional file systems, GFS does not have a per-directory data structure that lists all the files in that directory. Nor does it support aliases for the same file or directory (i.e, hard or symbolic links in Unix terms). GFS logically represents its namespace as a lookup table mapping full pathnames to metadata. With prefix compression, this table can be efficiently represented in memory. Each node in the namespace tree (either an absolute file name or an
absolute directory name) has an associated read-write lock. 
     和 许多传统文件系统不同,GFS没有一个基于目录、并在目录中列出所有文件的数据结构。它也不支持同样文件或目录的别名(例如Unix术语中的硬链接或符号 链接)。逻辑上讲GFS将名字空间展现成一个查找表,包含完整路径名称到元数据的映射。通过对前缀进行压缩,这个表格可以有效的存储于内存之中。名字空间 树中的每个节点(绝对文件名称或者绝对数据名称)都有一个关联的读写锁。
    Each master operation acquires a set of locks before it runs. Typically, if it involves /d1/d2/.../dn/leaf, it will acquire read-locks on the directory names /d1, /d1/d2, ...,
/d1/d2/.../dn, and either a read lock or a write lock on the full pathname /d1/d2/.../dn/leaf. Note that leaf may be a file or directory depending on the operation.
     Master 的每个操作之前都要求一系列的锁。通常,如果操作涉及/d1/d2/.../dn/leaf,它会要求对于目录/d1, /d1/d2, ..., /d1/d2/.../dn的读锁,以及对于完整路径名称/d1/d2/.../dn/leaf的一个读锁或写锁。请注意leaf可以是一个文件也可能是 一个目录,取决于操作本身。
    We now illustrate how this locking mechanism can prevent a file /home/user/foo from being created while /home/user is being snapshotted to /save/user. The snapshot operation acquires read locks on /home and /save, and write locks on /home/user and /save/user. The file creation acquires read locks on /home and /home/user, and a write lock on /home/user/foo. The two operations will be serialized properly because they try to obtain conflicting locks on /home/user. File creation does not require a write lock on the parent directory because there is no “directory”, or inode-like, data structure to be protected from modification. The read lock on the name is sufficient to protect the parent directory from deletion.
     我 们现在演示锁机制如何能够在对/home/user进行快照并保存至/save/user的同时,避免一个名为/home/user/foo的文件被创 建。快照操作要求对于/home和/save的读锁,以及对于/home/user和/save/user的写锁。文件创建要求/home和/home /user的读锁,以及对于/home/user/foo的写锁。这两个操作将按次序执行,因为它们尝试获取/home/user的锁时会发生冲突。文件 创建不会要求对于父目录的写锁,因为这里并没有类似目录或者inode的数据结构需要被保护不被修改。对于名称的读锁对于保护父目录不被删除已经足够。
    One nice property of this locking scheme is that it allows concurrent mutations in the same directory. For example, multiple file creations can be executed concurrently in the same directory: each acquires a read lock on the directory name and a write lock on the file name. The read lock on the directory name suffices to prevent the directory from being deleted, renamed, or snapshotted. The write locks on file names serialize attempts to create a file with the same name twice.
    Since the namespace can have many nodes, read-write lock objects are allocated lazily and deleted once they are not in use. Also, locks are acquired in a consistent total order to prevent deadlock: they are first ordered by level in the namespace tree and lexicographically within the same level.

4.2 Replica Placement 复本放置

    A GFS cluster is highly distributed at more levels than one. It typically has hundreds of chunkservers spread across many machine racks. These chunkservers in turn may be accessed from hundreds of clients from the same or different racks. Communication between two machines on different racks may cross one or more network switches. Additionally, bandwidth into or out of a rack may be less than the aggregate bandwidth of all the machines within the rack. Multi-level distribution presents a unique challenge to distribute data for scalability, reliability, and availability.
     GFS 集群是多层次高度分布的。它通常有部署在许多机架上的数百个块服务器。这些块服务器又被来自于相同或不同机架的数百个客户端所访问。来自不同机架上的两台 机器的通信可能会跨越一个或多个网络交换机。另外机架的出入带宽可能比机架内所有机器的带宽总和要小。多层级分布式为数据的可扩展性、可靠性以及可用性提 出了全新的挑战。
    The chunk replica placement policy serves two purposes: maximize data reliability and availability, and maximize network bandwidth utilization. For both, it is not enough to spread replicas across machines, which only guards against disk or machine failures and fully utilizes each machine’s network bandwidth. We must also spread chunk replicas across racks. This ensures that some replicas of a chunk will survive and remain available even if an entire rackis damaged or offline (for example, due to failure of a shared resource like a network switch or power circuit). It also means that traffic, especially reads, for a chunk can exploit the aggregate
bandwidth of multiple racks. On the other hand, write traffic has to flow through multiple racks, a tradeoff we make willingly.
     块 复本的放置策略有两个目标:最大化数据可靠性和可用性,最大化网络带宽利用率。为了这两个目的,仅在机器间分布复本是不够的,这只能防护磁盘和机器的实 效,也只能充分利用每台机器的带宽。我们必须在机架间分布复本。这保证了当整个机架损坏或下线(例如电源或网络交换机的失效)时,仍然有一些复本存活并提 供服务。这也意味着对于一个块的流量(特别是读)可以利用多机架的整体带宽。另一方面,写流量也需要通过多个机架,这是我们愿意付出的代价。

4.3 Creation, Re-replication, Rebalancing 创建,重新复制,重新负载均衡

    Chunk replicas are created for three reasons: chunk creation, re-replication, and rebalancing.
    When the master creates a chunk, it chooses where to place the initially empty replicas. It considers several factors. (1)We want to place new replicas on chunkservers with below-average disk space utilization. Over time this will equalize disk utilization across chunkservers. (2) We want to limit the number of “recent” creations on each chunkserver. Although creation itself is cheap, it reliably predicts imminent heavy write traffic because chunks are created when demanded by writes, and in our append-once-read-many workload they typically become practically read-only once they have been completely written. (3) As discussed above, we
want to spread replicas of a chunk across racks. The master re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal. This could happen for various reasons: a chunkserver becomes unavailable, it reports that its replica may be corrupted, one
of its disks is disabled because of errors, or the replication goal is increased. Each chunk that needs to be re-replicated is prioritized based on several factors. One is how far it is from its replication goal. For example, we give higher priority to a chunk that has lost two replicas than to a chunk that has lost only one. In addition, we prefer to first re-replicate chunks for live files as opposed to chunks that belong to recently deleted files (see Section 4.4). Finally, to minimize
the impact of failures on running applications, we boost the priority of any chunk that is blocking client progress.
     当 Master创建一个块的时候,它选择在哪里放置初始的空白复本。它会考虑几个因素。(1)我们希望将新的复本放在磁盘空间使用率低于平均水平的块服务器 上。一段时间之后这将使得块服务器的磁盘使用率趋于相同。(2)我们希望限制每台块服务器上的“最近”创建的数量。虽然创建操作本身代价不大,但是它总会 紧接着繁重的写操作,因为块总是在写请求时被创建,在我们的“追加一次读多次”的负载下,当它们被完全写入后,通常就变成只读的。(3)正如上述,我们想 要在机架之间分布块的复本。一旦可用的块的复本数量低于用户指定的指标,Master就将重新创建块的复本。这可能由于几个原因:一个块服务器变得不可 用,它报告自己的复本可能损坏了,它的磁盘之一由于错误而失效了,或者复制的目标提高了。需要被重新复制的每个块都被按照一定因素来排序。一是离它的复制 目标有多远。例如,相对与丢失一个复本的块,我们给丢失两个复本的块更高的优先级。另外,我们会先为活跃文件创建复本,而不是近期删除的文件(参阅 4.4)。最终,为了最小化失效对于正在运行的应用的影响,我们提高任何会阻塞客户端进程的块的优先级。
    The master picks the highest priority chunk and “clones” it by instructing some chunkserver to copy the chunk data directly from an existing valid replica. The new replica is placed with goals similar to those for creation: equalizing disk space utilization, limiting active clone operations on any single chunkserver, and spreading replicas across racks. To keep cloning traffic from overwhelming client traffic, the master limits the numbers of active clone operations both for the cluster and for each chunkserver. Additionally, each chunkserver limits the amount of bandwidth it spends on each clone operation by throttling its read requests to the source chunkserver.
     Master 挑选优先级最高的块,指示一些块服务器直接从一个现有有效的复本来复制块数据。新复本的放置策略和创建操作的目标相似:使磁盘空间利用率平均化,限制任何 单独块服务器上的活动复制操作,在机架之间分布复本。为了防止复制流量超过客户端流量,Master限制在集群和块服务器上的活动复制操作。另外每个块服 务器通过对源块服务器读取的限流,而控制它用于每个复制操作的带宽数量。
    Finally, the master rebalances replicas periodically: it examines the current replica distribution and moves replicas for better disk space and load balancing. Also through this process, the master gradually fills up a new chunkserver rather than instantly swamps it with new chunks and the heavy write traffic that comes with them. The placement criteria for the new replica are similar to those discussed above. In addition, the master must also choose which existing replica to remove. In general, it prefers to remove those on chunkservers with below-average free space so as to equalize disk space usage.
     最 终,Master周期性的重新调节复本的负载:它检查当前的复本分布,移动复本以得到更好的磁盘空间和负载均衡。同时,Master逐步的填满一个新的块 服务器,而不是立即使用新的块和沉重的写操作来淹没它。放置新复本的条件和上述相似。另外,Master必须选择那些现有的复本将被移走。总体上,它希望 移走那些空间低于平均水平的块服务器上的复本,以获得平均的磁盘空间使用率。

4.4 Garbage Collection 垃圾回收

    After a file is deleted, GFS does not immediately reclaim the available physical storage. It does so only lazily during regular garbage collection at both the file and chunk levels. We find that this approach makes the system much simpler and more reliable.

4.4.1 Mechanism 机制

    When a file is deleted by the application, the master logs the deletion immediately just like other changes. However instead of reclaiming resources immediately, the file is just renamed to a hidden name that includes the deletion timestamp. During the master’s regular scan of the file system namespace, it removes any such hidden files if they have existed for more than three days (the interval is configurable). Until then, the file can still be read under the new, special
name and can be undeleted by renaming it back to normal. When the hidden file is removed from the namespace, its inmemory metadata is erased. This effectively severs its links to all its chunks.
     当 一个文件被应用系统删除,像其他操作一样,Master立即记录删除的日志。但是它不会立刻回收资源,文件会被重命名为一个隐藏的名字,包含删除的时间 戳。当Master例行扫描文件系统的名字空间时,如果这些文件已经存在了三天(间隔可以配置),Master会将其进行移除。在此之前,文件仍然可以使 用新的特殊名字进行访问,也可以通过重命名至普通名称来恢复已删除的文件。当隐藏文件已经从名字空间中移除,保存在内存中的它的元数据就被删除了。这有效 的服务于所有块的连接。
    In a similar regular scan of the chunk namespace, the master identifies orphaned chunks (i.e., those not reachable from any file) and erases the metadata for those chunks. In a HeartBeat message regularly exchanged with the master, each chunkserver reports a subset of the chunks it has, and the master replies with the identity of all chunks that are no longer present in the master’s metadata. The chunkserver is free to delete its replicas of such chunks.
     在 对相似块的例行扫描中,Master确定孤立的块(例如对于任何文件都不可达的块),删除这些块的元数据。在与Master的例行心跳信息交互中,每个块 服务器报告自身拥有的块,Master响应返回所有已经不在Master元数据中存在的块标识,块服务器可以*的删除这些块的复本。

4.4.2 Discussion 讨论

    Although distributed garbage collection is a hard problem that demands complicated solutions in the context of programming languages, it is quite simple in our case. We can easily identify all references to chunks: they are in the file-to-chunk mappings maintained exclusively by the master. We can also easily identify all the chunk replicas: they are Linux files under designated directories on each chunkserver. Any such replica not known to the master is “garbage.”
     虽 然分布式的垃圾收集是一个严重问题,它在编程语言领域要求一个复杂的方案,这在我们这里非常简单。我们可以轻松的识别块的所有引用:Master专门的维 护了文件到块的映射。我们可以轻松识别所有的块复本:它们以Linux文件的方式存在在每个块服务器指定的目录中。所有Master不知道的复本都是“垃 圾”。
    The garbage collection approach to storage reclamation offers several advantages over eager deletion. First, it is simple and reliable in a large-scale distributed system where component failures are common. Chunk creation may succeed on some chunkservers but not others, leaving replicas that the master does not know exist. Replica deletion messages may be lost, and the master has to remember to resend them across failures, both its own and the chunkserver’s.
Garbage collection provides a uniform and dependable way to clean up any replicas not known to be useful. Second, it merges storage reclamation into the regular background activities of the master, such as the regular scans of namespaces and handshakes with chunkservers. Thus, it is done in batches and the cost is amortized. Moreover, it is done only when the master is relatively free. The master can respond more promptly to client requests that demand timely
attention. Third, the delay in reclaiming storage provides a safety net against accidental, irreversible deletion.
     垃 圾回收的方法比立即删除要有很多优点:首先,在一个组件失效非常普遍的大型分布式系统中,它非常的简单和可靠。块创建可能仅在一些机器上成功执行,留下了 Master不知道的复本。复本删除消息可能会丢失,Master必须记住重新发送消息,包括它自己的或块服务器的。垃圾回收提供了一个统一的可信赖的方 式来清除那些无用的复本。第二,它将存储回收合并进入Master的例行后台活动中,就像名字空间的例行扫描,以及和块服务器的握手一样。因此,垃圾回收 被批量处理,开销被摊薄。另外,只有当Master相对空闲的时候才进行操作。Master可以对关注时间的客户端请求更加及时的响应。第三,有延迟的空 间回收比偶发的删除更加安全。
    In our experience, the main disadvantage is that the delay sometimes hinders user effort to fine tune usage when storage is tight. Applications that repeatedly create and delete temporary files may not be able to reuse the storage right away. We address these issues by expediting storage reclamation if a deleted file is explicitly deleted again. We also allow users to apply different replication and reclamation policies to different parts of the namespace. For example,
users can specify that all the chunks in the files within some directory tree are to be stored without replication, and any deleted files are immediately and irrevocably removed from the file system state.
     在 我们的经验中,主要的不足在于当存储空间紧张时,这样的延迟通常会阻碍用户去更好的调节存储空间的使用。重复创建的删除临时文件的应用不能有效的重用存 储。我们以下面的方式处理这个问题,如果一个已被删除的文件被再次删除,将会加快存储回收的过程。我们也允许用户对名字空间的不同部分设置不同的复制和回 收策略。例如,用户可以指定一些目录树下的文件的所有块没有复本,任何删除的文件将被立即和不可逆的从文件系统中移除。

4.5 Stale Replica Detection 过期复本检测

    Chunk replicas may become stale if a chunkserver fails and misses mutations to the chunk while it is down. For each chunk, the master maintains a chunk version number to distinguish between up-to-date and stale replicas.
    Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas. The master and these replicas all record the new version number in their persistent state. This occurs before any client is notified and therefore before it can start writing to the chunk. If another replica is currently unavailable, its chunk version number will not be advanced. The master will detect that this chunkserver has a stale replica
when the chunkserver restarts and reports its set of chunks and their associated version numbers. If the master sees a version number greater than the one in its records, the master
assumes that it failed when granting the lease and so takes the higher version to be up-to-date.
     当 Master为一个块签署一个租约时,将会递增块的版本号,然后通知当前的复本。Master和这些复本都在它们的持久化状态中记录新的版本号。这发生在 任何客户端被通知之前,也在它开始向块写数据之前。如果另一个复本当前不可用,它的块版本号不会继续递增。当块服务器重启,并报告它的块信息以及版本 号,Master会发现这个块服务器有一个过期的复本。如果Master发现一个版本号大于自己记录的,Master假定它在签署租约时失效了,因此使用 更高的版本号。
    The master removes stale replicas in its regular garbage collection. Before that, it effectively considers a stale replica not to exist at all when it replies to client requests for chunk information. As another safeguard, the master includes the chunk version number when it informs clients which chunkserver holds a lease on a chunk or when it instructs a chunkserver to read the chunk from another chunkserver in a cloning operation. The client or the chunkserver verifies
the version number when it performs the operation so that it is always accessing up-to-date data.
     Master 在例行垃圾回收中移除过期的复本。在此之前,当Master响应客户端的块信息请求时,它高效的认为过期的复本根本不存在。作为另一个保护方式,当 Master告知客户端哪个块服务器持有某一个块的租约时,或当Master指示块服务器读取另外的块服务器进行复制操作时,消息中也包含块版本号。客户 端或者块服务器在进行操作时将验证版本号,因此它一直都可以访问最新的数据。


    One of our greatest challenges in designing the system is dealing with frequent component failures. The quality and quantity of components together make these problems more the norm than the exception: we cannot completely trust the machines, nor can we completely trust the disks. Component failures can result in an unavailable system or, worse, corrupted data. We discuss how we meet these challenges and the tools we have built into the system to diagnose problems when they inevitably occur.
     我 们在设计这个系统时的最大挑战之一是处理频繁的部件失效。部件的数量和质量使得这些问题比异常更加常见:我们不能完全的信任机器,也不能完全的信任磁盘。 部件的失效会导致一个不可用的系统或者,更坏的情况,损坏的数据。我们讨论怎样应对这些挑战,以及我们创造的内置在系统之中的工具,用于诊断系统故障。

5.1 High Availability 高可用性

    Among hundreds of servers in a GFS cluster, some are bound to be unavailable at any given time. We keep the overall system highly available with two simple yet effective strategies: fast recovery and replication.

5.1.1 Fast Recovery 快速恢复

    Both the master and the chunkserver are designed to restore their state and start in seconds no matter how they terminated. In fact, we do not distinguish between normal and abnormal termination; servers are routinely shut down just by killing the process. Clients and other servers experience a minor hiccup as they time out on their outstanding requests, reconnect to the restarted server, and retry. Section 6.2.2 reports observed startup times.
     Master 和块服务器都被设计为能够恢复它们的状态和在数秒内启动,不管他们怎样被中止。事实上,我们并不区分普通和异常的终止;通常使用杀死进程的方式来关闭服务 器。客户端和其他服务器将会遇到一个小问题,重新连接到重启后的服务器,重试请求。6.2.2报告了观察到的启动时间。

5.1.2 Chunk Replication 块复制

    As discussed earlier, each chunk is replicated on multiple chunkservers on different racks. Users can specify different replication levels for different parts of the file namespace. The default is three. The master clones existing replicas as needed to keep each chunk fully replicated as chunkservers go offline or detect corrupted replicas through checksum verification
(see Section 5.2). Although replication has served us well, we are exploring other forms of cross-server redundancy such as parity or erasure codes for our increasing read only storage requirements. We expect that it is challenging but manageable to implement these more complicated redundancy schemes in our very loosely coupled system because our traffic is dominated by appends and reads rather than small random writes.
     正 如先前讨论的,每个块都被在不同的机架上的多个块服务器上进行复制。用户可以为文件名字空间的不同部分指定不同的复制级别。缺省是三级。当块服务器失效 时,或者通过校验和检测到复本损坏时,Master将现有的复本进行复制,以使每个块留有充足的复本。虽然复本为我们工作良好,我们也在寻找其他跨服务器 冗余的方法,以应对我们日益增长的只读存储需求。我们期望它更有挑战性,更易于管理,在我们非常松散的系统下实现更加复杂的冗余机制。因为我们的流量主要 是由追加和读取构成,而不是随机写。

5.1.3 Master Replication Master复制

    The master state is replicated for reliability. Its operation log and checkpoints are replicated on multiple machines. A mutation to the state is considered committed only after its log record has been flushed to disk locally and on all master replicas. For simplicity, one master process remains in charge of all mutations as well as background activities such as garbage collection that change the system internally. When it fails, it can restart almost instantly. If its machine
or diskf ails, monitoring infrastructure outside GFS starts a new master process elsewhere with the replicated operation log. Clients use only the canonical name of the master (e.g. gfs-test), which is a DNS alias that can be changed if the master is relocated to another machine.
    Moreover, “shadow” masters provide read-only access to the file system even when the primary master is down. They are shadows, not mirrors, in that they may lag the primary slightly, typically fractions of a second. They enhance read availability for files that are not being actively mutated or applications that do not mind getting slightly stale results. In fact, since file content is read from chunkservers, applications do not observe stale file content. What could be
stale within short windows is file metadata, like directory contents or access control information.
    To keep itself informed, a shadow master reads a replica of the growing operation log and applies the same sequence of changes to its data structures exactly as the primary does. Like the primary, it polls chunkservers at startup (and infrequently thereafter) to locate chunk replicas and exchanges frequent handshake messages with them to monitor their status. It depends on the primary master only for replica location updates resulting from the primary’s decisions to create and delete replicas.

5.2 Data Integrity 数据完整性


Each chunkserver uses checksumming to detect corruption
of stored data. Given that a GFS cluster often has thousands
of disks on hundreds of machines, it regularly experiences
diskf ailures that cause data corruption or loss on both the
read and write paths. (See Section 7 for one cause.) We
can recover from corruption using other chunkre plicas, but
it would be impractical to detect corruption by comparing
replicas across chunkservers. Moreover, divergent replicas
may be legal: the semantics of GFS mutations, in particular
atomic record append as discussed earlier, does not guarantee
identical replicas. Therefore, each chunkserver must
independently verify the integrity of its own copy by maintaining
A chunki s broken up into 64 KB blocks. Each has a corresponding
32 bit checksum. Like other metadata, checksums
are kept in memory and stored persistently with logging,
separate from user data.
For reads, the chunkserver verifies the checksum of data
blocks that overlap the read range before returning any data
to the requester, whether a client or another chunkserver.
Therefore chunkservers will not propagate corruptions to
other machines. If a blockdo es not match the recorded
checksum, the chunkserver returns an error to the requestor
and reports the mismatch to the master. In response, the
requestor will read from other replicas, while the master
will clone the chunkfrom another replica. After a valid new
replica is in place, the master instructs the chunkserver that
reported the mismatch to delete its replica.
Checksumming has little effect on read performance for
several reasons. Since most of our reads span at least a
few blocks, we need to read and checksum only a relatively
small amount of extra data for verification. GFS client code
further reduces this overhead by trying to align reads at
checksum block boundaries. Moreover, checksum lookups
and comparison on the chunkserver are done without any
I/O, and checksum calculation can often be overlapped with
Checksum computation is heavily optimized for writes
that append to the end of a chunk(a s opposed to writes
that overwrite existing data) because they are dominant in
our workloads. We just incrementally update the checksum
for the last partial checksum block, and compute new
checksums for any brand new checksum blocks filled by the
append. Even if the last partial checksum block is already
corrupted and we fail to detect it now, the new checksum
value will not match the stored data, and the corruption will
be detected as usual when the blocki s next read.
In contrast, if a write overwrites an existing range of the
chunk, we must read and verify the first and last blocks of
the range being overwritten, then perform the write, and
finally compute and record the new checksums. If we do
not verify the first and last blocks before overwriting them
partially, the new checksums may hide corruption that exists
in the regions not being overwritten.
During idle periods, chunkservers can scan and verify the
contents of inactive chunks. This allows us to detect corruption
in chunks that are rarely read. Once the corruption is
detected, the master can create a new uncorrupted replica
and delete the corrupted replica. This prevents an inactive
but corrupted chunkre plica from fooling the master into
thinking that it has enough valid replicas of a chunk.
5.3 Diagnostic Tools
Extensive and detailed diagnostic logging has helped immeasurably
in problem isolation, debugging, and performance
analysis, while incurring only a minimal cost. Without
logs, it is hard to understand transient, non-repeatable
interactions between machines. GFS servers generate diagnostic
logs that record many significant events (such as
chunkservers going up and down) and all RPC requests and
replies. These diagnostic logs can be freely deleted without
affecting the correctness of the system. However, we try to
keep these logs around as far as space permits.
The RPC logs include the exact requests and responses
sent on the wire, except for the file data being read or written.
By matching requests with replies and collating RPC
records on different machines, we can reconstruct the entire
interaction history to diagnose a problem. The logs also
serve as traces for load testing and performance analysis.
The performance impact of logging is minimal (and far
outweighed by the benefits) because these logs are written
sequentially and asynchronously. The most recent events
are also kept in memory and available for continuous online