在R中将数据帧保存到光盘以进行存储的最佳格式是什么?

时间:2022-05-14 13:43:27

What is the best format to persist simple data frames to disc in R for storage while limiting semantic loss?

将简单数据帧保存到R中用于存储同时限制语义丢失的最佳格式是什么?

I ask because I'm archiving a data set. In an ideal world, my data format would have the follow characteristics:

我问,因为我正在归档数据集。在理想的世界中,我的数据格式具有以下特征:

  1. Stability - the storage format will be compatible with future version of R
  2. 稳定性 - 存储格式将与R的未来版本兼容

  3. Semantic compatibility - the storage format will understand the semantics of R's primative data types. For example, it will be able to store ordered factors with labels in a sensible manner.
  4. 语义兼容性 - 存储格式将理解R的主要数据类型的语义。例如,它将能够以合理的方式存储带有标签的有序因子。

  5. Open standard - ideally, the format will be an open standard so other statistics packages (now or in the future) will be able to understand it
  6. 开放标准 - 理想情况下,格式将是一个开放标准,因此其他统计软件包(现在或将来)将能够理解它

My first thought was to use CSV which is very stable, but lacks the semantic richness required. On the other hand, R's builtin RData format completely captures R's semantics, but seems likely to change between releases (correct me if I'm wrong).

我的第一个想法是使用非常稳定的CSV,但缺乏所需的语义丰富性。另一方面,R的内置RData格式完全捕获了R的语义,但似乎可能在发行版之间发生变化(如果我错了,请纠正我)。

Is there another format that finds a balance between these three imperatives?

是否有其他格式可以在这三个命令之间找到平衡点?

1 个解决方案

#1


4  

Dump it to a text file with dput. That way you get all the structure of R's objects, and its in a text-based form that, should R stop existing, can be parsed fairly easily.

使用dput将其转储到文本文件。这样你就可以获得R对象的所有结构,并且它是基于文本的形式,如果R停止存在,可以相当容易地解析。

It probably doesn't pass (3), your 'open standard' test.

它可能没有通过(3),你的'开放标准'测试。

R is pretty good for backward compatibility with its .RData format, so even if the files written by the latest R aren't the same as older ones, the latest R will still read old files. However, if R should stop existing, reverse-engineering of the binary format is orders of magnitude harder than grokking the output from dput.

R非常适合与.RData格式向后兼容,因此即使最新R写入的文件与旧版本不同,最新的R仍然会读取旧文件。但是,如果R应该停止存在,那么二进制格式的逆向工程比从吞吐量中获取输出更难。

#1


4  

Dump it to a text file with dput. That way you get all the structure of R's objects, and its in a text-based form that, should R stop existing, can be parsed fairly easily.

使用dput将其转储到文本文件。这样你就可以获得R对象的所有结构,并且它是基于文本的形式,如果R停止存在,可以相当容易地解析。

It probably doesn't pass (3), your 'open standard' test.

它可能没有通过(3),你的'开放标准'测试。

R is pretty good for backward compatibility with its .RData format, so even if the files written by the latest R aren't the same as older ones, the latest R will still read old files. However, if R should stop existing, reverse-engineering of the binary format is orders of magnitude harder than grokking the output from dput.

R非常适合与.RData格式向后兼容,因此即使最新R写入的文件与旧版本不同,最新的R仍然会读取旧文件。但是,如果R应该停止存在,那么二进制格式的逆向工程比从吞吐量中获取输出更难。