Bigtable: A Distributed Storage System for Structured Data : part6 Refinements

时间:2022-05-12 17:16:41
6 Refinements
The implementation described in the previous section required a number of refinements to achieve the high performance, availability, and reliability required by our users. 
This section describes portions of the implementation in more detail in order to highlight these refinements.
Locality groups Clients can group multiple column families together into a locality group. 
A separate SSTable is generated for each locality group in each tablet. 
Segregating column families that are not typically accessed together into separate locality groups enables more efficient reads. 
For example, page metadata in Webtable (such as language and checksums) can be in one locality group, and the contents of the page can be in a different group: 
an application that wants to read the metadata does not need to read through all of the page contents.

In addition, some useful tuning parameters can be specified on a per-locality group basis. 
For example, a locality group can be declared to be in-memory. 
SSTables for in-memory locality groups are loaded lazily into the memory of the tablet server. 
Once loaded, column families that belong to such locality groups can be read without accessing the disk. 
This feature is useful for small pieces of data that are accessed frequently: 
we use it internally for the location column family in the METADATA table.

6细化
上一节中描述的实现需要进行一些改进,以实现用户所需的高性能,可用性和可靠性。
本节将更详细地介绍实现的各个部分,以突出显示这些改进。
位置组客户可以将多个列系列组合到一个地区组中。
每个 tablet 中的每个地区组都会生成一个单独的SSTable。
将通常不一起访问的列集合隔离到单独的位置组中可以实现更高效的读取。
例如,Webtable中的页面元数据(如语言和校验和)可以在一个位置组中,页面的内容可以在不同的组中:
想要读取元数据的应用程序不需要阅读所有的页面内容。

另外,一些有用的调整参数可以基于每个地点的组来指定。
例如,一个地区组可以被声明为内存。
内存中位置组的SSTables被懒惰地加载到 tablet 服务器的内存中。
一旦加载,可以读取属于这些位置组的列族,而无需访问磁盘。
此功能对于经常访问的小数据数据很有用:
我们在内部使用METADATA表中的位置列系列。

Compression
Clients can control whether or not the SSTables for a locality group are compressed, and if so, which compression format is used. 
The user-specified compression format is applied to each SSTable block (whose size is controllable via a locality group specific tuning parameter). 
Although we lose some space by compressing each block separately, we benefit in that small portions of an SSTable can be read without decompressing the entire file. 
Many clients use a two-pass custom compression scheme. 
The first pass uses Bentley and McIlroy’s scheme , which compresses long common strings across a large window. 
The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB window of the data. 
Both compression passes are very fast—they encode at 100–200 MB/s, and decode at 400–1000 MB/s on modern machines.

Even though we emphasized speed instead of space reduction when choosing our compression algorithms, this two-pass compression scheme does surprisingly well.
For example, in Webtable, we use this compression scheme to store Web page contents. 
In one experiment, we stored a large number of documents in a compressed locality group. 
For the purposes of the experiment, we limited ourselves to one version of each document instead of storing all versions available to us. 
The scheme achieved a 10-to-1 reduction in space. 
This is much better than typical Gzip reductions of 3-to-1 or 4-to-1
on HTML pages because of the way Webtable rows are laid out: all pages from a single host are stored close to each other. 
This allows the Bentley-McIlroy algorithm to identify large amounts of shared boilerplate in pages from the same host. 
Many applications, not just Webtable, choose their row names so that similar data ends up clustered, and therefore achieve very good compression ratios. 

Compression ratios get even better when we store multiple versions of the same value in Bigtable.

Caching for read performance
To improve read performance, tablet servers use two levels of caching. 
The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. 
The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. 
The Scan Cache is most useful for applications that tend to read the same data repeatedly. 
The Block Cache is useful for applications that tend to read data that is close to the data they recently read (e.g., sequential reads, or random reads of different columns in the same locality group within a hot row).

压缩
客户端可以控制一个地点组的SSTables是否被压缩,如果是,则使用哪种压缩格式。
用户指定的压缩格式应用于每个SSTable块(其大小可通过特定于组的特定调谐参数进行控制)。
虽然我们通过单独压缩每个块来丢失一些空间,但是我们受益匪浅的是,SSTable的一小部分可以在不解压缩整个文件的情况下读取。
许多客户端使用双程自定义压缩方案。
第一次使用Bentley和McIlroy的方案,它可以在一个大窗口中压缩长的公共字符串。
第二遍使用快速压缩算法,在一个小的16 KB数据窗口中查找重复。
两个压缩传递都非常快,它们以100-200 MB / s编码,并在现代机器上以400-1000 MB / s进行解码。

尽管我们在选择压缩算法时强调了速度而不是减少空间,但是这种双向压缩方案却令人惊讶。
例如,在Webtable中,我们使用这种压缩方案来存储网页内容。
在一个实验中,我们将大量文档存储在压缩位置组中。
为了实验的目的,我们限制了每个文档的一个版本,而不是存储我们可用的所有版本。
该计划实现了10比1的空间减少。
这比3比1或4比1的典型Gzip缩减好得多
在HTML页面上,因为Webtable行的布局方式:单个主机的所有页面都彼此靠近存储。
这允许Bentley-McIlroy算法从相同的主机页面中识别大量的共享样板。
许多应用程序,而不仅仅是Webtable,选择他们的行名,以便类似的数据最终集群,从而实现非常好的压缩比。

当我们在Bigtable中存储相同值的多个版本时,压缩比会更好。

缓存读取性能
为了提高读取性能, tablet 服务器使用两级缓存。
扫描缓存是将SSTable接口返回的键值对缓存到 tablet 服务器代码的更高级别的缓存。
块缓存是缓存从GFS读取的SSTables块的低级缓存。
扫描缓存对于倾向于重复读取相同数据的应用程序最有用。
块缓存对于倾向于读取接近他们最近读取的数据的数据(例如,顺序读取或热排中的同一位置组中的不同列的随机读取)是有用的。

Bloom filters
As described in Section 5.3, a read operation has to read from all SSTables that make up the state of a tablet.
If these SSTables are not in memory, we may end up doing many disk accesses. We reduce the number of accesses by allowing clients to specify that Bloom filters should be created for SSTables in a particular locality group. 
A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair. 
For certain applications, a small amount of tablet server memory used for storing Bloom filters drastically reduces the number of disk seeks required for read operations. 

Our use of Bloom filters also implies that most lookups for non-existent rows or columns do not need to touch disk.

布鲁姆滤镜
如第5.3节所述,读操作必须从构成 tablet 状态的所有SST读取。
如果这些SSTables不在内存中,我们可能会做很多磁盘访问。 我们通过允许客户端指定为特定地区组中的SSTables创建Bloom过滤器来减少访问次数。
Bloom过滤器允许我们询问SSTable是否可能包含指定行/列对的任何数据。
对于某些应用,用于存储Bloom过滤器的少量 tablet 服务器内存大大减少了读取操作所需的磁盘查找次数。
我们使用Bloom过滤器也意味着大多数对不存在的行或列的查找不需要触摸磁盘。

Commit-log implementation
If we kept the commit log for each tablet in a separate log file, a very large number of files would be written concurrently in GFS. 
Depending on the underlying file system implementation on each GFS server, these writes could cause a large number of disk seeks to write to the different physical log files. 
In addition, having separate log files per tablet also reduces the effectiveness of the group commit optimization, since groups would tend to be smaller. 
To fix these issues, we append mutations to a single commit log per tablet server, co-mingling mutations for different tablets in the same physical log file.


提交日志实现
如果我们将每个平板电脑的提交日志保存在单独的日志文件中,则会在GFS中同时写入大量文件。
根据每个GFS服务器上的底层文件系统实现,这些写入可能导致大量磁盘寻求写入不同的物理日志文件。
此外,每个 tablet 分开的日志文件还可以降低群组提交优化的有效性,因为群组往往会较小。
为了解决这些问题,我们将突变附加到每个 tablet 服务器的单个提交日志中,在同一个物理日志文件中将不同 tablet 的突变共同混合。


Using one log provides significant performance benefits during normal operation, but it complicates recovery. 
When a tablet server dies, the tablets that it served will be moved to a large number of other tablet servers:
each server typically loads a small number of the original server’s tablets. 
To recover the state for a tablet, the new tablet server needs to reapply the mutations for that tablet from the commit log written by the original tablet server. 
However, the mutations for these tablets were co-mingled in the same physical log file. 
One approach would be for each new tablet server to read this full commit log file and apply just the entries needed for the tablets it needs to recover. 
However, under such a scheme, if 100 machines were each assigned a single tablet from a failed tablet server, then the log file would be read 100 times (once by each server).
We avoid duplicating log reads by first sorting the commit log entries in order of the keys htable, row name, log sequence numberi.
In the sorted output, all mutations for a particular tablet are contiguous and can therefore be read efficiently with one disk seek followed by a sequential read. 
To parallelize the sorting, we partition the log file into 64 MB segments, and sort each segment in parallel on different tablet servers. 
This sorting process is coordinated by the master and is initiated when a tablet server indicates that it needs to recover mutations from some commit log file.


使用一个日志在正常操作期间提供了显着的性能优势,但是恢复复杂。
当 tablet 服务器停用时,其服务的 tablet 将被移动到大量其他 tablet 服务器:
每个服务器通常会加载少量的原始服务器的 tablet 。
要恢复 tablet 的状态,新的 tablet 服务器需要从原始 tablet 服务器写入的提交日志中重新应用该 tablet 的突变。
然而,这些片剂的突变是在相同的物理日志文件中混合的。
一种方法将是每个新的 tablet 服务器读取这个完整的提交日志文件,并只应用需要恢复的 tablet 所需的条目。
但是,在这样一个方案下,如果每台机器每台从失败的 tablet 服务器分配一台 tablet ,那么日志文件就会这样被读取100次(每次服务器一次)。
我们避免重复日志读取,首先按照htable,行名,日志序列号的顺序对提交日志条目进行排序。
在排序输出中,特定图形输入板的所有突变是连续的,因此可以通过一个磁盘查找有效地读取,然后进行顺序读取。
为了并行化排序,我们将日志文件分成64 MB段,并在不同的 tablet 服务器上并行排列每个段。
该排序过程由主人协调,并且当 tablet 服务器指示需要从某个提交日志文件中恢复突变时启动。

Writing commit logs to GFS sometimes causes performance hiccups for a variety of reasons (e.g., a GFS server machine involved in the write crashes, or the network paths traversed to reach the particular set of three GFS servers is suffering network congestion, or is heavily loaded). 
To protect mutations from GFS latency spikes, each tablet server actually has two log writing threads, each writing to its own log file; only one of these two threads is actively in use at a time. 
If writes to the active log file are performing poorly, the log file writing is switched to the other thread, and mutations that are in the commit log queue are written by the newly active log writing thread. 
Log entries contain sequence numbers to allow the recovery process to elide duplicated entries resulting from this log switching process.


将提交日志写入GFS有时会导致性能打嗝,原因有多种(例如,涉及到写入崩溃的GFS服务器机器或遍历的网络路径到达特定的三个GFS服务器集合遭受网络拥塞或负载较重)。
为了保护突变免受GFS延迟尖峰,每个 tablet 服务器实际上有两个日志写入线程,每个写入线程都写入自己的日志文件; 这两个线程中只有一个正在一次正在使用中。
如果对活动日志文件的写入性能差,则将日志文件写入切换到另一个线程,并且提交日志队列中的突变由新激活的日志写入线程写入。
日志条目包含序列号,以允许恢复过程从该日志切换过程中获取重复的条目。

Speeding up tablet recovery
If the master moves a tablet from one tablet server to another, the source tablet server first does a minor compaction on that tablet. 
This compaction reduces recovery time by reducing the amount of uncompacted state in the tablet server’s commit log. 
After finishing this compaction, the tablet server stops serving the tablet. 
Before it actually unloads the tablet, the tablet server does another (usually very fast) minor compaction to eliminate any remaining uncompacted state in the tablet server’s log that arrived while the first minor compaction was being performed. 
After this second minor compaction is complete, the tablet can be loaded on another tablet server without requiring any recovery of log entries.

加快 tablet 恢复
如果主人将 tablet 从一台 tablet 服务器移动到另一台 tablet ,则源 tablet 服务器首先在该 tablet 上进行轻微的压缩。
此压缩通过减少 tablet 服务器的提交日志中未压缩状态的数量来减少恢复时间。
完成此压缩后, tablet 服务器停止投放 tablet 。
在实际卸载 tablet 之前, tablet 服务器执行另一个(通常非常快)的小型压缩,以消除在执行第一次轻微压缩时到达的 tablet 服务器的日志中的任何剩余的未压缩状态。
在第二次轻微压缩完成后, tablet 可以加载到另一台 tablet 服务器上,无需任何恢复日志条目。

Exploiting immutability
Besides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable. 
For example, we do not need any synchronization of accesses to the file system when reading from SSTables. 
As a result, concurrency control over rows can be implemented very efficiently. 
The only mutable data structure that is accessed by both reads and writes is the memtable. 
To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.
Since SSTables are immutable, the problem of permanently removing deleted data is transformed to garbage collecting obsolete SSTables. 
Each tablet’s SSTables are registered in the METADATA table. 
The master removes obsolete SSTables as a mark-and-sweep garbage collection over the set of SSTables, where the METADATA table contains the set of roots.
Finally, the immutability of SSTables enables us to split tablets quickly. Instead of generating a new set of SSTables for each child tablet, we let the child tablets share the SSTables of the parent tablet.

利用不变性
除了SSTable缓存之外,Bigtable系统的其他各个部分都被简化了,我们生成的所有SSTable都是不可变的。
例如,当从SSTables读取时,我们不需要对文件系统进行访问的任何同步。
因此,可以非常有效地实现对行的并发控制。
mem读取和写入访问的唯一可变数据结构是memtable。
为了减少在读取memtable期间的争用,我们使每个memtable行写时复制,并允许读写并行进行。
由于SSTables是不可变的,永久删除已删除的数据的问题被转换为垃圾收集过时的SSTables。
每个 tablet 的SSTables都在METADATA表中注册。
主人将过时的SSTables作为标记和扫描垃圾回收,在SSTables集合中删除,其中METADATA表包含一组根。
最后,SSTables的不变性使我们能够快速分片。而不是为每个子 tablet 生成一组新的SSTable,我们让子 tablet 共享父 tablet 的SSTables。