什么是Linux下软实时数据采集的良好存储候选?

时间:2021-06-13 16:54:11

I'm building a system for data acquisition. Acquired data typically consists of 15 signals, each sampled at (say) 500 Hz. That is, each second approx 15 x 500 x 4 bytes (signed float) will arrive and have to persisted.

我正在建立一个数据采集系统。获取的数据通常由15个信号组成,每个信号在(例如)500 Hz下采样。也就是说,每秒大约15 x 500 x 4字节(有符号浮点数)将到达并且必须保持不变。

The previous version was built on .NET (C#) using a DB4O db for data storage. This was fairly efficient and performed well.

以前的版本是基于.NET(C#)构建的,使用DB4O db进行数据存储。这非常有效并且表现良好。

The new version will be Linux-based, using Python (or maybe Erlang) and ... Yes! What is a suitable storage-candidate?

新版本将基于Linux,使用Python(或Erlang)和......是的!什么是合适的存储候选者?

I'm thinking MongoDB, storing each sample (or actually a bunch of them) as BSON objects. Each sample (block) will have a sample counter as a key (indexed) field, as well as a signal source identification.

我在想MongoDB,将每个样本(或实际上是一堆样本)存储为BSON对象。每个样本(块)都有一个样本计数器作为密钥(索引)字段,以及信号源标识。

The catch is that I have to be able to retrieve samples pretty quickly. When requested, up to 30 seconds of data have to be retrieved in much less than a second, using a sample counter range and requested signal sources. The current (C#/DB4O) version manages this OK, retrieving data in much less than 100 ms.

问题是我必须能够很快地检索样本。根据要求,使用样本计数器范围和请求的信号源,必须在不到一秒的时间内检索多达30秒的数据。当前(C#/ DB4O)版本管理此OK,在不到100毫秒的时间内检索数据。

I know that Python might not be ideal performance-wise, but we'll see about that later on.

我知道Python可能不是理想的性能,但我们稍后会看到它。

The system ("server") will have multiple acquisition clients connected, so the architecture must scale well.

系统(“服务器”)将连接多个采集客户端,因此架构必须能够很好地扩展。

Edit: After further research I will probably go with HDF5 for sample data and either Couch or Mongo for more document-like information. I'll keep you posted.

编辑:经过进一步研究后,我可能会使用HDF5获取样本数据,使用Couch或Mongo获取更多类似文档的信息。我会及时向大家发布。

Edit: The final solution was based on HDF5 and CouchDB. It performed just fine, implemented in Python, running on a Raspberry Pi.

编辑:最终解决方案基于HDF5和CouchDB。它表现得很好,用Python实现,在Raspberry Pi上运行。

3 个解决方案

#1


4  

you could have a look into using HDF5 ... It is designed for streamed data, allows time-indexed seeking and (as far as I know) is pretty well supported in Python

你可以看看使用HDF5 ...它是专为流数据设计的,允许时间索引搜索和(据我所知)在Python中得到很好的支持

#2


2  

Using the keys you described, you should able to scale via sharding if necesssary. 120kB / 30sec ist not that much, so i think you do not need to shard too early.

使用您描述的键,如果需要,您应该能够通过分片进行缩放。 120kB / 30秒不是那么多,所以我认为你不需要太早碎片。

If you compare that to just using files you'll get more sophisticated queries and build in replication for high availability, DS or offline processing (Map Reduce etc).

如果将其与仅使用文件进行比较,您将获得更复杂的查询并构建复制以实现高可用性,DS或离线处理(Map Reduce等)。

#3


-1  

In your case, you could just create 15 files and save each sample sequentially into the corresponding file. This will make sure the requested samples are stored continuous on disk and hence reduce the number of disk seeks while reading.

在您的情况下,您可以创建15个文件并将每个样本按顺序保存到相应的文件中。这将确保所请求的样本连续存储在磁盘上,从而减少读取时的磁盘搜索次数。

#1


4  

you could have a look into using HDF5 ... It is designed for streamed data, allows time-indexed seeking and (as far as I know) is pretty well supported in Python

你可以看看使用HDF5 ...它是专为流数据设计的,允许时间索引搜索和(据我所知)在Python中得到很好的支持

#2


2  

Using the keys you described, you should able to scale via sharding if necesssary. 120kB / 30sec ist not that much, so i think you do not need to shard too early.

使用您描述的键,如果需要,您应该能够通过分片进行缩放。 120kB / 30秒不是那么多,所以我认为你不需要太早碎片。

If you compare that to just using files you'll get more sophisticated queries and build in replication for high availability, DS or offline processing (Map Reduce etc).

如果将其与仅使用文件进行比较,您将获得更复杂的查询并构建复制以实现高可用性,DS或离线处理(Map Reduce等)。

#3


-1  

In your case, you could just create 15 files and save each sample sequentially into the corresponding file. This will make sure the requested samples are stored continuous on disk and hence reduce the number of disk seeks while reading.

在您的情况下,您可以创建15个文件并将每个样本按顺序保存到相应的文件中。这将确保所请求的样本连续存储在磁盘上,从而减少读取时的磁盘搜索次数。