使用c++编写二进制文件:缺省语言环境是否重要?

I have code that manipulates binary files using fstream with the binary flag set and using the unformatted I/O functions read and write. This works correctly on all systems I've ever used (the bits in the file are exactly as expected), but those are basically all U.S. English. I have been wondering about the potential for these bytes to be modified by a codecvt on a different system.

我有一些代码，可以使用fstream和二进制标志设置，并使用未格式化的I/O函数读写二进制文件。这在我使用过的所有系统上都能正常工作(文件中的位完全符合预期)，但这些基本上都是美式英语。我一直在想，在不同的系统上，codecvt可能会修改这些字节。

It sounds like the standard says using unformatted I/O behaves the same as putting characters into the streambuf using sputc/sgetc. These will lead to the overflow or underflow functions in the streambuf getting called, and it sounds like these lead to stuff going through some codecvt (e.g., see 27.8.1.4.3 in the c++ standard). For basic_filebuf the creation of this codecvt is specified in 27.8.1.1.5. This makes it look like the results will depend on what basic_filebuf.getloc() returns.

这听起来像是标准所说的使用未格式化的I/O就像使用sputc/sgetc将字符放入streambuf一样。这些将导致流函数中的溢出或下流函数被调用，这听起来就像通过一些codecvt(例如，在c++标准中看到27.8.1.4.3)。对于basic_filebuf，这个codecvt的创建在27.8.1.1.5中指定。这使得结果看起来取决于basic_filebuf.getloc()返回的内容。

So, my question is, can I assume that a character array written out using ofstream.write on one system can be recovered verbatim using ifstream.read on another system, no matter what locale configuration either person might be using on their system? I would make the following assumptions:

我的问题是，我能假设一个字符数组是用ofstream写出来的吗?在一个系统上写可以使用ifstream完全恢复。在另一个系统上读取，无论哪个人可能在他们的系统上使用什么语言环境配置?我的假设如下:

The program is using the default locale (i.e., the program is not changing the locale settings itself at all).
程序使用默认的语言环境(例如。，该程序根本没有更改语言环境设置本身)。
The systems both have CHAR_BIT 8, have the same bit order within each byte, store files as octets, etc.
系统都有CHAR_BIT 8，在每个字节中有相同的位顺序，将文件存储为octets等。
The stream objects have the binary flag set.
流对象设置了二进制标志。
We don't need to worry about any endianess differences at this stage. If any bytes in the array are to be interpretted as a multi-byte value, endianess conversions will be handled as required at a later stage.
在这个阶段，我们不需要担心任何意外的差异。如果要将数组中的任何字节解释为多字节值，那么endianess转换将按照后面的需要进行处理。

If the default locale isn't guaranteed to pass through this stuff unmodified on some system configuration (I don't know, Arabic or something), then what is the best way to write binary files using C++?

如果默认的语言环境不能保证在某些系统配置上(我不知道，阿拉伯语或其他东西)没有经过修改，那么使用c++编写二进制文件的最佳方法是什么?

3 个解决方案

#1

On Windows it should be fine, but on other OS you should check also the line endings (just as safety). The default C/C++ locale is "C" which is not dependent on the system's locale.

在Windows上应该没问题，但在其他操作系统上，也应该检查行尾(就像安全一样)。默认的C/ c++语言环境是“C”，它不依赖于系统的语言环境。

This is not a guarantee. As you know C/C++ compiler and their target machines vary greatly. So you're waiting for troubles to come if you keep all those assumptions. There is negligible overhead for changing the locale unless you try to make it hundreds of time per second.

这不是保证。正如您所知道的，C/ c++编译器及其目标计算机差别很大。所以你在等待麻烦的到来如果你保持这些假设。更改语言环境的开销可以忽略不计，除非您尝试每秒更改数百次。

#2

If you have binary flag set, everything you write will be written to the file verbatim. No conversions. How you interpret the bytes is up to you (and possibly the locale).

如果您设置了二进制标志，那么您编写的所有内容都将逐字记录到文件中。没有转换。如何解释字节取决于您(也可能取决于语言环境)。

One more thing: There is a possibility for breakage on different locales. If for example your data source created binary data based on locale (and format of this data would change depending on locale - this is a bad idea btw). This would cause trouble when loading data on machines with different locale. This is a design error though.

还有一件事:不同地区有可能发生破损。例如，如果您的数据源创建了基于locale的二进制数据(并且该数据的格式会根据locale而改变——顺便说一句，这不是一个好主意)。当在不同地区的机器上加载数据时，这会造成麻烦。这是一个设计错误。

If you just use standard data types/structures that have same format/layout no matter what locale they were created in everything should be OK.

如果您只是使用具有相同格式/布局的标准数据类型/结构，不管它们是在什么语言环境中创建的，那么应该没有问题。

#3

Thanks for the help. I just thought it might be helpful to post some additional information about this that wouldn't fit in a comment.

谢谢你的帮助。我只是觉得发布一些不适合评论的附加信息可能会有帮助。

The default locale for C++ programs is always the "C" locale (http://www.cplusplus.com/reference/clibrary/clocale/setlocale/). If this is the only locale used in your program, it means the behaviour doesn't depend on the particular locale configuration of the machine that it's running on. It also means that unformatted I/O for a char does not undergo any code conversion (wchar_t might be a different story though). This means that (given the assumptions in the question) read and write should allow binary data to be recovered unmodified.

c++程序的默认语言环境始终是“C”语言环境(http://www.cplusplus.com/reference/clibrary/clocale/setlocale/)。如果这是程序中唯一使用的语言环境，则意味着行为不依赖于运行的机器的特定语言环境配置。它还意味着未格式化的char I/O不会进行任何代码转换(wchar_t可能是另一回事)。这意味着(给定问题中的假设)读和写应该允许不加修改地恢复二进制数据。

(from reading the documentation) You can globally set the application's locale to match the system default by calling setlocale(LC_ALL,""), which will mean streams constructed from that point will use the system default locale. To set it back to the "C" locale you can call setlocale(LC_ALL, "C"), which will mean this is what streams constructed in the future will use. You can also specify that the "C" local should be used for a stream that's already constructed by calling stream.imbue(locale::classic()).

(通过阅读文档)通过调用setlocale(LC_ALL，"")，您可以全局地设置应用程序的locale以匹配系统默认设置，这意味着从那个点构造的流将使用系统默认的locale。要将它设置回“C”语言环境，可以调用setlocale(LC_ALL，“C”)，这意味着将来构造的流将使用它。您还可以指定“C”local应该用于调用stream.imbue(locale::classic())已经构造的流。

#1