UTF-8元组存储,使用最低的共同技术标准,仅附加。

时间:2022-02-14 17:18:18

EDIT: Note that due to the way hard drives actually write data, none of the schemes in this list work reliably. Do not use them. Just use a database. SQLite is a good simple one.

编辑:注意,由于硬盘驱动器实际写入数据的方式,本列表中的任何方案都不能可靠地工作。不使用它们。只使用一个数据库。SQLite很简单。

What's the most low-tech but reliable way of storing tuples of UTF-8 strings on disk? Storage should be append-only for reliability.

在磁盘上存储UTF-8字符串元组的最低技术但可靠的方法是什么?存储应该仅附加于可靠性。

As part of a document storage system I'm experimenting with I have to store UTF-8 tuple data on disk. Obviously, for a full-blown implementation, I want to use something like Amazon S3, Project Voldemort, or CouchDB.

作为我正在试验的文档存储系统的一部分,我必须在磁盘上存储UTF-8元组数据。显然,对于一个成熟的实现,我希望使用Amazon S3、Project Voldemort或CouchDB之类的东西。

However, at the moment, I'm experimenting and haven't even firmly settled on a programming language yet. I have been using CSV, but CSV tend to become brittle when you try to store outlandish unicode and unexpected whitespace (eg vertical tabs).

然而,目前,我正在进行实验,甚至还没有确定使用编程语言。我一直在使用CSV,但是当您试图存储古怪的unicode和意想不到的空格(如垂直选项卡)时,CSV往往变得脆弱。

I could use XML or JSON for storage, but they don't play nice with append-only files. My best guess so far is a rather idiosyncratic format where each string is preceded by a 4-byte signed integer indicating the number of bytes it contains, and an integer value of -1 indicates that this tuple is complete - the equivalent of a CSV newline. The main source of headaches there is having to decide on the endianness of the integer on disk.

我可以使用XML或JSON进行存储,但它们不能很好地处理仅附加的文件。到目前为止,我的最佳猜测是一种非常特殊的格式,其中每个字符串前面都有一个4字节的带符号整数,表示它包含的字节数,整数值为-1表示这个元组已经完成——相当于CSV的换行符。头痛的主要来源是必须决定磁盘上的整数的字节。

Edit: actually, this won't work. If the program exits while writing a string, the data becomes irrevocably misaligned. Some sort of out-of-band signalling is needed to ensure alignment can be regained after an aborted tuple.

编辑:事实上,这行不通。如果程序在编写字符串时退出,则数据将不可逆转地发生偏移。需要某种带外信号,以确保在中止的元组之后可以恢复对齐。

Edit 2: Turns out that guaranteeing atomicity when appending to text files is possible, but the parser is quite non-trivial. Writing said parser now.

编辑2:事实证明,在附加到文本文件时保证原子性是可能的,但是解析器非常重要。编写解析器现在说。

Edit 3: You can view the end result at http://github.com/MetalBeetle/Fruitbat/tree/master/src/com/metalbeetle/fruitbat/atrio/ .

编辑3:你可以在http://github.com/metalbeetle/fruitbat/tree/tree/master/src//contentbat/atrio/上查看最终结果。

2 个解决方案

#1


2  

I would recommend tab delimiting each field and carriage-return delimiting each record.

我建议制表符对每个字段进行分隔,并对每个记录进行带回车分隔。

Within each string, Replace all characters that would affect the field and record interpretation and rendering. This would include control characters (U+0000–U+001F, U+007F–U+009F), non-graphical line and paragraph separators (U+2028, U=2029), directional control characters (U+202A–U+202E), and the byte order mark (U+FEFF).

在每个字符串中,替换将影响字段和记录解释和呈现的所有字符。这将包括控制字符(U+ 0000u +001F, U+ 007F-U +009F),非图形行和段落分隔符(U+2028, U=2029),方向控制字符(U+ 202A-U +202E)和字节顺序标记(U+FEFF)。

They should be replaced with escape sequences of constant length. The escape sequences should begin with a rare (for your application) character. The escape character itself should also be escaped.

应该用长度不变的转义序列替换它们。转义序列应该以一个罕见的(应用程序)字符开始。转义字符本身也应该转义。

This would allow you to append new records easily. It has the additional advantage of being able to load the file for visual inspection and modification into any spreadsheet or word processing program, which could be useful for debugging purposes.

这将允许您轻松地添加新记录。它的另一个优点是能够将文件加载到任何电子表格或文字处理程序中进行可视化检查和修改,这对于调试非常有用。

This would also be easy to code, since the file will be a valid UTF-8 document, so standard text reading and writing routines may be used. This also allows you to convert easily to UTF-16BE or UTF-16LE if desired, without complications.

这也很容易编码,因为该文件将是一个有效的UTF-8文档,因此可以使用标准的文本读写例程。这还允许您轻松地将其转换为UTF-16BE或UTF-16LE(如果需要的话),而不会出现并发症。

Example:

例子:

U+0009 CHARACTER TABULATION becomes ~TB
U+000A LINE FEED            becomes ~LF
U+000D CARRIAGE RETURN      becomes ~CR
U+007E TILDE                becomes ~~~
etc.

There are a couple of reasons why tabs would be better than commas as field delimiters. Commas appear more commonly within normal text strings (such as English text), and would have to be replaced more frequently. And spreadsheet programs (such as Microsoft Excel) tend to handle tab-delimited files much more naturally.

有几个原因可以解释为什么制表符比作为字段分隔符的逗号更好。逗号通常出现在普通的文本字符串中(如英语文本),并且必须更频繁地替换。而电子表格程序(如Microsoft Excel)往往更自然地处理以表分隔的文件。

#2


1  

Mostly thinking out loud here...

大多数人都在想……

Really low tech would be to use (for example) null bytes as separators, and just "quote" all null bytes appearing in the output with an additional null.

真正的低技术将是使用(例如)空字节作为分隔符,并且只“引用”输出中出现的所有空字节,并附加一个空字节。

Perhaps one could use SCSU along with that.

也许还可以使用SCSU。

Or it might be worth to look at the gzip format, and maybe ape it, if not using it:

或者可以看看gzip格式,如果不用的话,也可以模仿它:

A gzip file consists of a series of "members" (compressed data sets).

gzip文件由一系列“成员”(压缩数据集)组成。

[...]

[…]

The members simply appear one after another in the file, with no additional information before, between, or after them.

成员只是在文件中一个接一个地出现,在它们之前、之间或之后都没有附加信息。

Each of these members can have an optional "filename", comment, or the like, and i believe you can just keep appending members.

每个成员都可以有一个可选的“文件名”、注释或类似的内容,我相信您可以继续添加成员。

Or you could use bencode, used in torrent-files. Or BSON.

或者你也可以使用bencode,在大量文件中使用。或BSON。

See also Wikipedia's Comparison of data serialization formats.

参见*对数据序列化格式的比较。

Otherwise i think your idea of preceding each string with its length is probably the simplest one.

否则,我认为你对每一根弦的长度的定义可能是最简单的。

#1


2  

I would recommend tab delimiting each field and carriage-return delimiting each record.

我建议制表符对每个字段进行分隔,并对每个记录进行带回车分隔。

Within each string, Replace all characters that would affect the field and record interpretation and rendering. This would include control characters (U+0000–U+001F, U+007F–U+009F), non-graphical line and paragraph separators (U+2028, U=2029), directional control characters (U+202A–U+202E), and the byte order mark (U+FEFF).

在每个字符串中,替换将影响字段和记录解释和呈现的所有字符。这将包括控制字符(U+ 0000u +001F, U+ 007F-U +009F),非图形行和段落分隔符(U+2028, U=2029),方向控制字符(U+ 202A-U +202E)和字节顺序标记(U+FEFF)。

They should be replaced with escape sequences of constant length. The escape sequences should begin with a rare (for your application) character. The escape character itself should also be escaped.

应该用长度不变的转义序列替换它们。转义序列应该以一个罕见的(应用程序)字符开始。转义字符本身也应该转义。

This would allow you to append new records easily. It has the additional advantage of being able to load the file for visual inspection and modification into any spreadsheet or word processing program, which could be useful for debugging purposes.

这将允许您轻松地添加新记录。它的另一个优点是能够将文件加载到任何电子表格或文字处理程序中进行可视化检查和修改,这对于调试非常有用。

This would also be easy to code, since the file will be a valid UTF-8 document, so standard text reading and writing routines may be used. This also allows you to convert easily to UTF-16BE or UTF-16LE if desired, without complications.

这也很容易编码,因为该文件将是一个有效的UTF-8文档,因此可以使用标准的文本读写例程。这还允许您轻松地将其转换为UTF-16BE或UTF-16LE(如果需要的话),而不会出现并发症。

Example:

例子:

U+0009 CHARACTER TABULATION becomes ~TB
U+000A LINE FEED            becomes ~LF
U+000D CARRIAGE RETURN      becomes ~CR
U+007E TILDE                becomes ~~~
etc.

There are a couple of reasons why tabs would be better than commas as field delimiters. Commas appear more commonly within normal text strings (such as English text), and would have to be replaced more frequently. And spreadsheet programs (such as Microsoft Excel) tend to handle tab-delimited files much more naturally.

有几个原因可以解释为什么制表符比作为字段分隔符的逗号更好。逗号通常出现在普通的文本字符串中(如英语文本),并且必须更频繁地替换。而电子表格程序(如Microsoft Excel)往往更自然地处理以表分隔的文件。

#2


1  

Mostly thinking out loud here...

大多数人都在想……

Really low tech would be to use (for example) null bytes as separators, and just "quote" all null bytes appearing in the output with an additional null.

真正的低技术将是使用(例如)空字节作为分隔符,并且只“引用”输出中出现的所有空字节,并附加一个空字节。

Perhaps one could use SCSU along with that.

也许还可以使用SCSU。

Or it might be worth to look at the gzip format, and maybe ape it, if not using it:

或者可以看看gzip格式,如果不用的话,也可以模仿它:

A gzip file consists of a series of "members" (compressed data sets).

gzip文件由一系列“成员”(压缩数据集)组成。

[...]

[…]

The members simply appear one after another in the file, with no additional information before, between, or after them.

成员只是在文件中一个接一个地出现,在它们之前、之间或之后都没有附加信息。

Each of these members can have an optional "filename", comment, or the like, and i believe you can just keep appending members.

每个成员都可以有一个可选的“文件名”、注释或类似的内容,我相信您可以继续添加成员。

Or you could use bencode, used in torrent-files. Or BSON.

或者你也可以使用bencode,在大量文件中使用。或BSON。

See also Wikipedia's Comparison of data serialization formats.

参见*对数据序列化格式的比较。

Otherwise i think your idea of preceding each string with its length is probably the simplest one.

否则,我认为你对每一根弦的长度的定义可能是最简单的。