Bigtable: Google 的结构化数据分布式存储系统 (二)翻译

时间:2022-02-18 16:46:55

作者:

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach
Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

{fay,jeff,sanjay,wilsonh,kerr,m3b,tushar,fikes,gruberg}@google.com

原文:http://labs.google.com/papers/bigtable.html

翻译:eaglet

 

2 Data Model


A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. (row:string, column:string, time:int64) ! string

 

2 数据模型

BigTable是一个稀疏的,分布式的,持久化的多维有序图。这个图可以通过行的键,列的键和时间戳进行索引;图中的每一个值都是一个未解释(uninterpreted)字节数组。

(行:string,列:string,时间:Int64)->string

Bigtable: Google 的结构化数据分布式存储系统 (二)翻译

Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family contains
the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN's home page
is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com
and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t3, t5, and t6.

图1:存储Web页面的示例表的一部分。行名是URL 的反转。contents 这一列存储页面的内容,anchor 列存储和这个页面被引用的一些标记(锚点)。CNN 的主页被Sports Illustrated 和 My-look 的首页引用,所以这一行包含了两个列 anchor:cnnsi.com 和 anchor:my.look.ca。每个标记单元有一个版本。contents 列存在3个版本,这三个版本分别是时间t3, t5, t6 时保存的内容。

译者:

我理解这里每个版本如 t3 的结构都是上面说的那个为解释字节数组,数据结构应该为

string row;

string col;

Int64 timestamp;

We settled on this data model after examining a variety of potential uses of a Bigtable-like system. As one concrete example that drove some of our design decisions, suppose we want to keep a copy of a large collection of web pages and related information that could be used by many different projects; let us call this particular table the Webtable. In Webtable, we would use URLs as row keys, various aspects of web pages as column names, and store the contents of the web pages in the contents: column under the timestamps when they were fetched, as illustrated in Figure 1.

 

我们是在对很多类似BigTalbe系统这样的潜在用户进行评测后确定下这个数据模型的。作为一个驱动我们设计决策的具体例子,假设我们想保留大量可能被用于多个不同项目的网页和相关信息的拷贝(让我们暂且叫它为Webtable)。在Webtable中,我们将URLs用作行的键,将网页的各方面的信息作为列名,并且在Contents列中存储网页的内容(按图1所示,当他们被取出时,就是时间戳下面的列)。我理解就是我上面说的那个数据结构(译者)。


Rows
The row keys in a table are arbitrary strings (currently up to 64KB in size, although 10-100 bytes is a typical size for most of our users). Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row), a design decision that makes it easier for clients to reason about the system's behavior in the presence of concurrent updates to the same row. Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of short row ranges are efficient and typically require communication with only a small number of machines. Clients can exploit this property by selecting their row keys so that they get good locality for their data accesses. For
example, in Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs. For example, we store data for maps.google.com/index.html under the key com.google.maps/index.html. Storing pages from the same domain near each other makes some host and domain analyses more efficient.

 

表中行的键可以是任意字符串(这里是小于 64KB的字符串。尽管通常情况下大部分用户的键的典型长度都在10-100字节以内). 对单个行的数据读写都是原子级别的(不管这一行中有多少不同的列正在被写入或读取),这个设计的目的是客户可以在系统对同一行数据进行并发修改时方便的去推测系统的行为。(可维护性的考虑,译者)Bigtable 维护一个以行的键值排序的数据字典。表的行的范围是动态分区的,每一个行范围称作 tablet,这个 tablet 是分布式负载均衡时的访问单元。 当结果只需要读取较少范围的行时,这种设计是高效的且只需要和很少的机器通信就可以获取结果。客户可以使用这种特性来选择行的键以便在他们访问数据时获得更好的定位。比如,在 Webtable 中,同一个域名下的页面被通过近似的行名聚合在一起。如我们存储maps.google.com/index.html 的数据到以com.google.maps/index.html 为名的行中。将同一域名的页面存储到相对靠近的位置,对于某些基于host 和 域名的分析非常有效。

 


Column Families
Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type (we compress data in the same column family together). A column family must be created before data can be stored under any column key in that family; after a family has been created, any column key within the family can be used. It is our intent that the number of distinct column families in a table be small (in the hundreds at most), and that families rarely change during operation. In contrast,
a table may have an unbounded number of columns. A column key is named using the following syntax: family:qualifier. Column family names must be printable, but qualifiers may be arbitrary strings. An example column family for the Webtable is language, which stores the language in which a web page was written. We use only one column key in the language family, and it stores each web page's language ID. Another useful column family for this table is anchor; each column key in this family represents a single anchor, as shown in Figure 1. The qualifier is the name of the referring site; the cell contents is the link text. Access control and both disk and memory accounting are performed at the column-family level. In our Webtable example, these controls allow us to manage several different types of applications: some that add new base data, some that read the base data and create derived column families, and some that are only allowed to view existing data (and possibly not even to view all of the existing families for privacy reasons).

 

列家庭

列的键被组合到一个叫列家庭的集合中,这个集合可以构建基本的访问控制单元。存储在列家庭中的所有数据通常具有相同的类型(我们把同一个列家庭的数据压缩在一起)。列家庭必须在列可以用于存储前被创建。列家庭建好后,其中的列成员就可以使用了。表中列家庭的数量往往不是特别多(最多100个左右)而且列家庭创建后很少被改动,所以我们才这样设计。相对来说一个表可能存在无限数量的列。列的键名采用如下语法命名:family:qualifier. 列家庭名必须为可见字符,但qualifier 可以是任意的字符。一个列家庭名的例子为 lanuage,这个列家庭存储web 页面的语言。我们仅仅使用这个列家庭中的一个列,这个列存储web 页面的语言ID。另外一个有用的列家庭是表的 anchor(锚点),如图1所示这个列家庭中每个列负责一个单独的锚点。qualifier 就是被引用完整的名字。


Timestamps
Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp. Bigtable timestamps are 64-bit integers. They can be assigned by Bigtable, in which case they represent “real time” in microseconds, or be explicitly assigned by client

时间戳

BigTable 的每个单元都可以包含相同数据的不同版本。这些版本被用时间戳来索引。Bigtable 的时间戳是一个 64位的整形。他们可以被BigTable 分配用于毫秒级别的实时数据表现,或者有客户来准确的指定。

 

// Open the table
Table *T = OpenOrDie("/bigtable/web/webtable");
// Write a new anchor and delete an old anchor
RowMutation r1(T, "com.cnn.www");
r1.Set("anchor:www.c-span.org", "CNN");
r1.Delete("anchor:www.abc.com");
Operation op;
Apply(&op, &r1);


Figure 2: Writing to Bigtable.

上面那个代码是向Bigtable 写数据的操作代码。


applications. Applications that need to avoid collisions must generate unique timestamps themselves. Different versions of a cell are stored in decreasing timestamp order, so that the most recent versions can be read rst.To make the management of versioned data less onerous,
we support two per-column-family settings that tell Bigtable to garbage-collect cell versions automatically. The client can specify either that only the last n versions of a cell be kept, or that only new-enough versions be kept (e.g., only keep values that were written in the last
seven days). In our Webtable example, we set the timestamps of the crawled pages stored in the contents: column to the times at which these page versions were actually crawled. The garbage-collection mechanism described above lets us keep only the most recent three versions of
every page.

应用

为了避免冲突,必须为应用生成唯一的时间戳。在一个单元中的不同版本按时间戳的倒序排列,这样最近的时间可以在第一个位置读到。为了让版本数据的管理不是太麻烦,我们支持两个 per-column-family 的设置用于告知Bigtable 对版本信息进行自动垃圾回收。客户可以指定下面两种设置中的任何一种:

1. 仅仅保留最新的n个版本

2. 近保留足够新的版本(比如仅仅保留那些7天内写入的版本)

在我们的Webtable  示例中,我们在contents 列中存储抓取的页面的时间戳,这个时间戳记录页面抓取时的时间。垃圾回收机制被指定为仅仅保留最近3个版本。

 

Bigtable: Google 的结构化数据分布式存储系统 (一)翻译