存储时间序列数据的最佳开源解决方案是什么?

时间:2022-10-17 21:59:16

I am interested in monitoring some objects. I expect to get about 10000 data points every 15 minutes. (Maybe not at first, but this is the 'general ballpark'). I would also like to be able to get daily, weekly, monthly and yearly statistics. It is not critical to keep the data in the highest resolution (15 minutes) for more than two months.

我有兴趣监控一些对象。我希望每15分钟获得大约10000个数据点。 (也许不是一开始,但这是'一般球场')。我还希望能够获得每日,每周,每月和每年的统计数据。将数据保持在最高分辨率(15分钟)超过两个月并不重要。

I am considering various ways to store this data, and have been looking at a classic relational database, or at a schemaless database (such as SimpleDB).

我正在考虑各种方法来存储这些数据,并且一直在查看经典的关系数据库,或者在无模式数据库(例如SimpleDB)上。

My question is, what is the best way to go along doing this? I would very much prefer an open-source (and free) solution to a proprietary costly one.

我的问题是,这样做的最佳方式是什么?我非常希望开源(和免费)解决方案成为专有的昂贵解决方案。

Small note: I am writing this application in Python.

小记:我正在用Python编写这个应用程序。

5 个解决方案

#1


11  

HDF5, which can be accessed through h5py or PyTables, is designed for dealing with very large data sets. Both interfaces work well. For example, both h5py and PyTables have automatic compression and supports Numpy.

HDF5可以通过h5py或PyTables访问,用于处理非常大的数据集。两个接口都运行良好。例如,h5py和PyTable都有自动压缩并支持Numpy。

#2


8  

RRDTool by Tobi Oetiker, definitely! It's open-source, it's been designed for exactly such use cases.

绝对是Tobi Oetiker的RRDTool!它是开源的,它专为这种用例而设计。

EDIT:

编辑:

To provide a few highlights: RRDTool stores time-series data in a round-robin data base. It keeps raw data for a given period of time, then condenses it in a configurable way so you have fine-grained data say for a month, averaged data over a week for the last 6 months, and averaged data over a month for the last 2 years. As a side effect you data base remains the same size all of the time (so no sweating you disk may run full). This was the storage side. On the retrieval side RRDTool offers data queries that are immediately turned into graphs (e.g. png) that you can readily include in documents and web pages. It's a rock solid, proven solution that is a much generalized form over its predecessor, MRTG (some might have heard of this). And once you got into it, you will find yourself re-using it over and over again.

提供一些亮点:RRDTool将时间序列数据存储在循环数据库中。它保留给定时间段内的原始数据,然后以可配置的方式压缩它,这样您就可以获得一个月的细粒度数据,过去6个月中一周的平均数据,以及最后一个月的平均数据。 2年。作为副作用,您的数据库始终保持相同的大小(因此,您的磁盘可能无法满溢)。这是存储方面。在检索方面,RRDTool提供的数据查询可立即转换为图形(例如png),您可以轻松地将其包含在文档和网页中。它是一种坚如磐石,经过验证的解决方案,与其前身MRTG(有些人可能已经听说过)相比,它是一种非常普遍的形式。一旦你进入它,你会发现自己一遍又一遍地重复使用它。

For a quick overview and who uses RRDTool, see also here. If you want to see which kinds of graphics you can produce, make sure you have a look at the gallery.

有关快速概述以及谁使用RRDTool,请参见此处。如果您想查看可以生成哪种图形,请确保查看图库。

#3


1  

plain text files? It's not clear what your 10k data points per 15 minutes translates to in terms of bytes, but in any way text files are easier to store/archive/transfer/manipulate and you can inspect the directly, just by looking at. fairly easy to work with Python, too.

纯文本文件?目前还不清楚你的每15分钟10k数据点的字节数是多少,但无论如何,文本文件更容易存储/存档/传输/操作,你可以直接检查,只需查看即可。使用Python也很容易。

#4


1  

This is pretty standard data-warehousing stuff.

这是非常标准的数据仓库。

Lots of "facts", organized by a number of dimensions, one of which is time. Lots of aggregation.

很多“事实”,由许多维度组织,其中一个是时间。很多聚合。

In many cases, simple flat files that you process with simple aggregation algorithms based on defaultdict will work wonders -- fast and simple.

在许多情况下,使用基于defaultdict的简单聚合算法处理的简单平面文件将创建奇迹 - 快速而简单。

Look at Efficiently storing 7.300.000.000 rows

看看有效存储7.300.000.000行

Database choice for large data volume?

数据库选择大数据量?

#5


0  

There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.

我写的是一个活跃开发的开源时间序列数据库(目前只有.NET)。它可以以“二进制平面文件”的方式存储大量(terrabytes)的统一数据。所有用法都是面向流的(正向或反向)。我们积极地将其用于我们公司的股票存储和分析。

https://code.google.com/p/timeseriesdb/

https://code.google.com/p/timeseriesdb/

// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
   file.UniqueIndexes = true; // enforces index uniqueness
   file.InitializeNewFile(); // create file and write header
   file.AppendData(data); // append data (stream of ArraySegment<>)
}

// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
    // Enumerate one item at a time maxitum 10 items starting at 2011-1-1
    // (can also get one segment at a time with StreamSegments)
    foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
        Console.WriteLine(val);
}

#1


11  

HDF5, which can be accessed through h5py or PyTables, is designed for dealing with very large data sets. Both interfaces work well. For example, both h5py and PyTables have automatic compression and supports Numpy.

HDF5可以通过h5py或PyTables访问,用于处理非常大的数据集。两个接口都运行良好。例如,h5py和PyTable都有自动压缩并支持Numpy。

#2


8  

RRDTool by Tobi Oetiker, definitely! It's open-source, it's been designed for exactly such use cases.

绝对是Tobi Oetiker的RRDTool!它是开源的,它专为这种用例而设计。

EDIT:

编辑:

To provide a few highlights: RRDTool stores time-series data in a round-robin data base. It keeps raw data for a given period of time, then condenses it in a configurable way so you have fine-grained data say for a month, averaged data over a week for the last 6 months, and averaged data over a month for the last 2 years. As a side effect you data base remains the same size all of the time (so no sweating you disk may run full). This was the storage side. On the retrieval side RRDTool offers data queries that are immediately turned into graphs (e.g. png) that you can readily include in documents and web pages. It's a rock solid, proven solution that is a much generalized form over its predecessor, MRTG (some might have heard of this). And once you got into it, you will find yourself re-using it over and over again.

提供一些亮点:RRDTool将时间序列数据存储在循环数据库中。它保留给定时间段内的原始数据,然后以可配置的方式压缩它,这样您就可以获得一个月的细粒度数据,过去6个月中一周的平均数据,以及最后一个月的平均数据。 2年。作为副作用,您的数据库始终保持相同的大小(因此,您的磁盘可能无法满溢)。这是存储方面。在检索方面,RRDTool提供的数据查询可立即转换为图形(例如png),您可以轻松地将其包含在文档和网页中。它是一种坚如磐石,经过验证的解决方案,与其前身MRTG(有些人可能已经听说过)相比,它是一种非常普遍的形式。一旦你进入它,你会发现自己一遍又一遍地重复使用它。

For a quick overview and who uses RRDTool, see also here. If you want to see which kinds of graphics you can produce, make sure you have a look at the gallery.

有关快速概述以及谁使用RRDTool,请参见此处。如果您想查看可以生成哪种图形,请确保查看图库。

#3


1  

plain text files? It's not clear what your 10k data points per 15 minutes translates to in terms of bytes, but in any way text files are easier to store/archive/transfer/manipulate and you can inspect the directly, just by looking at. fairly easy to work with Python, too.

纯文本文件?目前还不清楚你的每15分钟10k数据点的字节数是多少,但无论如何,文本文件更容易存储/存档/传输/操作,你可以直接检查,只需查看即可。使用Python也很容易。

#4


1  

This is pretty standard data-warehousing stuff.

这是非常标准的数据仓库。

Lots of "facts", organized by a number of dimensions, one of which is time. Lots of aggregation.

很多“事实”,由许多维度组织,其中一个是时间。很多聚合。

In many cases, simple flat files that you process with simple aggregation algorithms based on defaultdict will work wonders -- fast and simple.

在许多情况下,使用基于defaultdict的简单聚合算法处理的简单平面文件将创建奇迹 - 快速而简单。

Look at Efficiently storing 7.300.000.000 rows

看看有效存储7.300.000.000行

Database choice for large data volume?

数据库选择大数据量?

#5


0  

There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.

我写的是一个活跃开发的开源时间序列数据库(目前只有.NET)。它可以以“二进制平面文件”的方式存储大量(terrabytes)的统一数据。所有用法都是面向流的(正向或反向)。我们积极地将其用于我们公司的股票存储和分析。

https://code.google.com/p/timeseriesdb/

https://code.google.com/p/timeseriesdb/

// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
   file.UniqueIndexes = true; // enforces index uniqueness
   file.InitializeNewFile(); // create file and write header
   file.AppendData(data); // append data (stream of ArraySegment<>)
}

// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
    // Enumerate one item at a time maxitum 10 items starting at 2011-1-1
    // (can also get one segment at a time with StreamSegments)
    foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
        Console.WriteLine(val);
}