将实时数据存储到1000个文件中

时间:2021-09-03 16:58:12

I have a program that receives real time data on 1000 topics. It receives -- on average -- 5000 messages per second. Each message consists of a two strings, a topic, and a message value. I'd like to save these strings along with a timestamp indicating the message arrival time.

我有一个程序可以接收1000个主题的实时数据。它平均每秒收到5000条消息。每条消息由两个字符串,一个主题和一个消息值组成。我想保存这些字符串以及指示消息到达时间的时间戳。

I'm using 32 bit Windows XP on 'Core 2' hardware and programming in C#.

我在'Core 2'硬件上使用32位Windows XP并在C#中编程。

I'd like to save this data into 1000 files -- one for each topic. I know many people will want to tell me to save the data into a database, but I don't want to go down that road.

我想将这些数据保存到1000个文件中 - 每个主题一个。我知道很多人会想告诉我将数据保存到数据库中,但我不想走这条路。

I've considered a few approaches:

我考虑过几种方法:

1) Open up 1000 files and write into each one as the data arrives. I have two concerns about this. I don't know if it is possible to open up 1000 files simultaneously, and I don't know what effect this will have on disk fragmentation.

1)打开1000个文件,并在数据到达时写入每个文件。我对此有两个担忧。我不知道是否可以同时打开1000个文件,我不知道这会对磁盘碎片产生什么影响。

2) Write into one file and -- somehow -- process it later to produce 1000 files.

2)写入一个文件 - 以某种方式 - 稍后处理它以生成1000个文件。

3) Keep it all in RAM until the end of the day and then write one file at a time. I think this would work well if I have enough ram although I might need to move to 64 bit to get over the 2 GB limit.

3)将其全部保存在RAM中直到当天结束,然后一次写入一个文件。我认为如果我有足够的RAM,这将很有效,尽管我可能需要移动到64位以超过2 GB的限制。

How would you approach this problem?

你会怎么解决这个问题?

11 个解决方案

#1


I agree with Oliver, but I'd suggest a modification: have 1000 queues, one for each topic/file. One thread receives the messages, timestamps them, then sticks them in the appropriate queue. The other simply rotates through the queues, seeing if they have data. If so, it reads the messages, then opens the corresponding file and writes the messages to it. After it closes the file, it moves to the next queue. One advantage of this is that you can add additional file-writing threads if one can't keep up with the traffic. I'd probably first try setting a write threshold, though (defer processing a queue until it's got N messages) to batch your writes. That way you don't get bogged down opening and closing a file to only write one or two messages.

我同意Oliver,但我建议修改:拥有1000个队列,每个主题/文件一个。一个线程接收消息,对它们加时间戳,然后将它们粘贴到适当的队列中。另一个只是在队列中旋转,看看他们是否有数据。如果是,则读取消息,然后打开相应的文件并将消息写入其中。关闭文件后,它将移至下一个队列。这样做的一个优点是,如果无法跟上流量,您可以添加额外的文件写入线程。我可能首先尝试设置写入阈值,但是(延迟处理队列直到它有N条消息)来批量写入。这样你就不会陷入打开和关闭文件只能写一两条消息。

#2


I can't imagine why you wouldn't want to use a database for this. This is what they were built for. They're pretty good at it.

我无法想象为什么你不想为此使用数据库。这就是他们的目的。他们相当擅长。

If you're not willing to go that route, storing them in RAM and rotating them out to disk every hour might be an option but remember that if you trip over the power cable, you've lost a lot of data.

如果您不愿意走这条路线,将它们存储在RAM中并每小时将它们旋转到磁盘可能是一种选择,但请记住,如果您绊倒电源线,则会丢失大量数据。

Seriously. Database it.

认真。数据库吧。

Edit: I should add that getting a robust, replicated and complete database-backed solution would take you less than a day if you had the hardware ready to go.

编辑:我应该补充说,如果您准备好硬件,那么获得一个强大的,复制的和完整的数据库支持的解决方案将花费您不到一天的时间。

Doing this level of transaction protection in any other environment is going to take you weeks longer to set up and test.

在任何其他环境中执行此级别的事务保护将花费您数周的时间来设置和测试。

#3


Like n8wrl i also would recommend a DB. But if you really dislike this feature ...

像n8wrl我也会推荐一个DB。但如果你真的不喜欢这个功能......

Let's find another solution ;-)

让我们找到另一个解决方案;-)

In a minimum step i would take two threads. First is a worker one, recieving all the data and putting each object (timestamp, two strings) into a queue.

在最小的步骤中,我将采取两个线程。首先是一个工作者,接收所有数据并将每个对象(时间戳,两个字符串)放入队列中。

Another thread will check this queue (maybe by information by event or by checking the Count property). This thread will dequeue each object, open the specific file, write it down, close the file and proceed the next event.

另一个线程将检查此队列(可能是按事件或通过检查Count属性的信息)。此线程将使每个对象出列,打开特定文件,将其写下来,关闭文件并继续下一个事件。

With this first approach i would start and take a look at the performance. If it sucks, make some metering, where the problem is and try to accomplish it (put open files into a dictionary (name, streamWriter), etc).

通过第一种方法,我将开始并看一下性能。如果它很糟糕,做一些计量,问题出在哪里并尝试完成它(将打开的文件放入字典(name,streamWriter)等)。

But on the other side a DB would be so fine for this problem... One table, four columns (id, timestamp, topic, message), one additional index on topic, ready.

但另一方面,数据库对于这个问题会很好......一个表,四列(id,时间戳,主题,消息),一个关于主题的附加索引,准备就绪。

#4


I'd like to explore a bit more why you don't wnat to use a DB - they're GREAT at things like this! But on to your options...

我想再探讨一下为什么你不想使用数据库 - 他们在这样的事情上是伟大的!但是你的选择......

  1. 1000 open file handles doesn't sound good. Forget disk fragmentation - O/S resources will suck.
  2. 1000个打开的文件句柄听起来不太好。忘记磁盘碎片 - O / S资源会很糟糕。

  3. This is close to db-ish-ness! Also sounds like more trouble than it's worth.
  4. 这接近db-ish-ness!听起来也比它的价值更麻烦。

  5. RAM = volatile. You spend all day accumulating data and have a power outage at 5pm.
  6. RAM =不稳定。您花了一整天积累数据并在下午5点停电。

How would I approach this? DB! Because then I can query index, analyze, etc. etc.

我该如何处理? D B!因为那时我可以查询索引,分析等等。

:)

#5


First calculate the bandwidth! 5000 messages/sec each 2kb = 10mb/sec. Each minute - 600mb. Well you could drop that in RAM. Then flush each hour.

首先计算带宽! 5000条消息/秒,每条2kb = 10mb /秒。每分钟 - 600mb。那么你可以把它放在内存中。然后每小时冲洗一次。

Edit: corrected mistake. Sorry, my bad.

编辑:纠正错误。对不起这是我的错。

#6


I would agree with Kyle and go with a package product like PI. Be aware PI is quite expensive.

我同意凯尔的意见,并选择像PI这样的包装产品。请注意PI非常昂贵。

If your looking for a custom solution I'd go with Stephen's with some modifications. Have one server recieve the messages and drop them into a queue. You can't use a file though to hand off the message to the other process because your going to have locking issues constantly. Probably use something like MSMQ(MS message queuing) but I'm not sure on speed of that.

如果您正在寻找自定义解决方案,我会与Stephen's进行一些修改。让一台服务器接收消息并将它们放入队列中。您不能使用文件将消息传递给其他进程,因为您将不断发生锁定问题。可能使用类似MSMQ(MS消息队列)的东西,但我不确定速度。

I would also recommend using a db to store your data. You'll want to do bulk inserts of data into the db though, as I think you would need some heafty hardware to allow SQL do do 5000 transactions a second. Your better off to do a bulk insert every say 10000 messages that accumulate in the queue.

我还建议使用db来存储您的数据。您可能希望将数据批量插入到数据库中,因为我认为您需要一些重要的硬件来允许SQL每秒执行5000次事务。你最好做一个批量插入每一个说累积在队列中的10000条消息。

DATA SIZES:

Average Message ~50bytes -> small datetime = 4bytes + Topic (~10 characters non unicode) = 10bytes + Message -> 31characters(non unicode) = 31 bytes.

平均消息~50字节 - >小日期时间= 4字节+主题(~10字符非unicode)= 10字节+消息 - > 31字符(非unicode)= 31字节。

50 * 5000 = 244kb/sec -> 14mb/min -> 858mb/hour

50 * 5000 = 244kb / sec - > 14mb / min - > 858mb /小时

#7


Perhaps you don't want the overhead of a DB install?

也许你不想要DB安装的开销?

In that case, you could try a filesystem-based database like sqlite:

在这种情况下,您可以尝试基于文件系统的数据库,如sqlite:

SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world. The source code for SQLite is in the public domain.

SQLite是一个软件库,它实现了一个独立的,无服务器,零配置的事务SQL数据库引擎。 SQLite是世界上部署最广泛的SQL数据库引擎。 SQLite的源代码位于公共域中。

#8


I would make 2 separate programs: one to take the incoming requests, format them, and write them out to one single file, and another to read from that file and write the requests out. Doing things this way allows you to minimize the number of file handles open while still handling the incoming requests in realtime. If you make the first program format it's output correctly then processing it to the individual files should be simple.

我将制作两个独立的程序:一个用于接收传入的请求,格式化它们,然后将它们写入一个文件,另一个用于从该文件读取并写出请求。通过这种方式执行操作,您可以在仍然实时处理传入请求的同时最小化打开的文件句柄数。如果您正确输出第一个程序格式,那么将其处理为单个文件应该很简单。

#9


I'd keep a buffer of the incoming messages, and periodically write the 1000 files sequentially on a separate thread.

我会保留传入消息的缓冲区,并定期在单独的线程上写入1000个文件。

#10


I would look into purchasing a real time data historian package. Something like a PI System or Wonderware Data Historian. I have tried to things like this in files and a MS SQL database before and it didn't turn out good (It was a customer requirement and I wouldn't suggest it). These products have API's and they even have packages where you can make queries to the data just like it was SQL.

我会考虑购买实时数据历史数据包。类似PI System或Wonderware Data Historian的东西。我之前在文件和MS SQL数据库中尝试过这样的事情并且结果并不好(这是客户要求,我不建议)。这些产品有API,它们甚至还有一些软件包,您可以像对待SQL一样对数据进行查询。

It wouldn't allow me to post Hyperlinks so just google those 2 products and you will find information on them.

它不允许我发布超链接,所以只是谷歌这两个产品,你会发现它们的信息。

EDIT

If you do use a database like most people are suggesting I would recommend a table for each topic for historical data and consider table partitioning, indexes, and how long you are going to store the data.

如果您确实使用数据库,大多数人都建议我为每个主题推荐一个表用于历史数据,并考虑表分区,索引以及存储数据的时间。

For example if you are going to store a days worth and its one table for each topic, you are looking at 5 updates a second x 60 seconds in a minute x 60 minutes in an hour x 24 hours = 432000 records a day. After exporting the data I would imagine that you would have to clear the data for the next day which will cause a lock so you will have to have to queue you writes to the database. Then if you are going to rebuild the index so that you can do any querying on it that will cause a schema modification lock and MS SQL Enterprise Edition for online index rebuilding. If you don't clear the data everyday you will have to make sure you have plenty of disk space to throw at it.

例如,如果您要为每个主题存储一天的价值和一个表,那么您将看到5个更新,每秒60秒,每小时60分钟,每天24小时= 432000个记录。导出数据后,我想你必须清除第二天的数据,这将导致锁定,因此你必须排队写入数据库。然后,如果要重建索引,以便可以对其进行任何查询,这将导致架构修改锁定和MS SQL Enterprise Edition进行联机索引重建。如果您不是每天都清除数据,则必须确保有足够的磁盘空间。

Basically what I'm saying weigh the cost of purchasing a reliable product against building your own.

基本上我所说的权衡购买可靠产品的成本与建立自己的产品相比。

#11


If you don't want to use a database (and I would, but assuming you don't), I'd write the records to a single file, append operations are fast as they can be, and use a separate process/service to split up the file into the 1000 files. You could even roll-over the file every X minutes, so that for example, every 15 minutes you start a new file and the other process starts splitting them up into 1000 separate files.

如果您不想使用数据库(我愿意,但假设您没有),我会将记录写入单个文件,追加操作尽可能快,并使用单独的流程/服务将文件拆分为1000个文件。您甚至可以每隔X分钟翻转文件,例如,每15分钟启动一个新文件,另一个进程开始将它们拆分为1000个单独的文件。

All this does beg the question of why not a DB, and why do you need 1000 different files - you may have a very good reason - but then again, perhaps you should re-think you strategy and make sure it is sound reasoning before you go to far down this path.

所有这一切都提出了为什么不是数据库的问题,为什么你需要1000个不同的文件 - 你可能有一个很好的理由 - 但话又说回来,也许你应该重新思考你的策略,并确保它在你之前是合理的推理沿着这条路走到很远的地方。

#1


I agree with Oliver, but I'd suggest a modification: have 1000 queues, one for each topic/file. One thread receives the messages, timestamps them, then sticks them in the appropriate queue. The other simply rotates through the queues, seeing if they have data. If so, it reads the messages, then opens the corresponding file and writes the messages to it. After it closes the file, it moves to the next queue. One advantage of this is that you can add additional file-writing threads if one can't keep up with the traffic. I'd probably first try setting a write threshold, though (defer processing a queue until it's got N messages) to batch your writes. That way you don't get bogged down opening and closing a file to only write one or two messages.

我同意Oliver,但我建议修改:拥有1000个队列,每个主题/文件一个。一个线程接收消息,对它们加时间戳,然后将它们粘贴到适当的队列中。另一个只是在队列中旋转,看看他们是否有数据。如果是,则读取消息,然后打开相应的文件并将消息写入其中。关闭文件后,它将移至下一个队列。这样做的一个优点是,如果无法跟上流量,您可以添加额外的文件写入线程。我可能首先尝试设置写入阈值,但是(延迟处理队列直到它有N条消息)来批量写入。这样你就不会陷入打开和关闭文件只能写一两条消息。

#2


I can't imagine why you wouldn't want to use a database for this. This is what they were built for. They're pretty good at it.

我无法想象为什么你不想为此使用数据库。这就是他们的目的。他们相当擅长。

If you're not willing to go that route, storing them in RAM and rotating them out to disk every hour might be an option but remember that if you trip over the power cable, you've lost a lot of data.

如果您不愿意走这条路线,将它们存储在RAM中并每小时将它们旋转到磁盘可能是一种选择,但请记住,如果您绊倒电源线,则会丢失大量数据。

Seriously. Database it.

认真。数据库吧。

Edit: I should add that getting a robust, replicated and complete database-backed solution would take you less than a day if you had the hardware ready to go.

编辑:我应该补充说,如果您准备好硬件,那么获得一个强大的,复制的和完整的数据库支持的解决方案将花费您不到一天的时间。

Doing this level of transaction protection in any other environment is going to take you weeks longer to set up and test.

在任何其他环境中执行此级别的事务保护将花费您数周的时间来设置和测试。

#3


Like n8wrl i also would recommend a DB. But if you really dislike this feature ...

像n8wrl我也会推荐一个DB。但如果你真的不喜欢这个功能......

Let's find another solution ;-)

让我们找到另一个解决方案;-)

In a minimum step i would take two threads. First is a worker one, recieving all the data and putting each object (timestamp, two strings) into a queue.

在最小的步骤中,我将采取两个线程。首先是一个工作者,接收所有数据并将每个对象(时间戳,两个字符串)放入队列中。

Another thread will check this queue (maybe by information by event or by checking the Count property). This thread will dequeue each object, open the specific file, write it down, close the file and proceed the next event.

另一个线程将检查此队列(可能是按事件或通过检查Count属性的信息)。此线程将使每个对象出列,打开特定文件,将其写下来,关闭文件并继续下一个事件。

With this first approach i would start and take a look at the performance. If it sucks, make some metering, where the problem is and try to accomplish it (put open files into a dictionary (name, streamWriter), etc).

通过第一种方法,我将开始并看一下性能。如果它很糟糕,做一些计量,问题出在哪里并尝试完成它(将打开的文件放入字典(name,streamWriter)等)。

But on the other side a DB would be so fine for this problem... One table, four columns (id, timestamp, topic, message), one additional index on topic, ready.

但另一方面,数据库对于这个问题会很好......一个表,四列(id,时间戳,主题,消息),一个关于主题的附加索引,准备就绪。

#4


I'd like to explore a bit more why you don't wnat to use a DB - they're GREAT at things like this! But on to your options...

我想再探讨一下为什么你不想使用数据库 - 他们在这样的事情上是伟大的!但是你的选择......

  1. 1000 open file handles doesn't sound good. Forget disk fragmentation - O/S resources will suck.
  2. 1000个打开的文件句柄听起来不太好。忘记磁盘碎片 - O / S资源会很糟糕。

  3. This is close to db-ish-ness! Also sounds like more trouble than it's worth.
  4. 这接近db-ish-ness!听起来也比它的价值更麻烦。

  5. RAM = volatile. You spend all day accumulating data and have a power outage at 5pm.
  6. RAM =不稳定。您花了一整天积累数据并在下午5点停电。

How would I approach this? DB! Because then I can query index, analyze, etc. etc.

我该如何处理? D B!因为那时我可以查询索引,分析等等。

:)

#5


First calculate the bandwidth! 5000 messages/sec each 2kb = 10mb/sec. Each minute - 600mb. Well you could drop that in RAM. Then flush each hour.

首先计算带宽! 5000条消息/秒,每条2kb = 10mb /秒。每分钟 - 600mb。那么你可以把它放在内存中。然后每小时冲洗一次。

Edit: corrected mistake. Sorry, my bad.

编辑:纠正错误。对不起这是我的错。

#6


I would agree with Kyle and go with a package product like PI. Be aware PI is quite expensive.

我同意凯尔的意见,并选择像PI这样的包装产品。请注意PI非常昂贵。

If your looking for a custom solution I'd go with Stephen's with some modifications. Have one server recieve the messages and drop them into a queue. You can't use a file though to hand off the message to the other process because your going to have locking issues constantly. Probably use something like MSMQ(MS message queuing) but I'm not sure on speed of that.

如果您正在寻找自定义解决方案,我会与Stephen's进行一些修改。让一台服务器接收消息并将它们放入队列中。您不能使用文件将消息传递给其他进程,因为您将不断发生锁定问题。可能使用类似MSMQ(MS消息队列)的东西,但我不确定速度。

I would also recommend using a db to store your data. You'll want to do bulk inserts of data into the db though, as I think you would need some heafty hardware to allow SQL do do 5000 transactions a second. Your better off to do a bulk insert every say 10000 messages that accumulate in the queue.

我还建议使用db来存储您的数据。您可能希望将数据批量插入到数据库中,因为我认为您需要一些重要的硬件来允许SQL每秒执行5000次事务。你最好做一个批量插入每一个说累积在队列中的10000条消息。

DATA SIZES:

Average Message ~50bytes -> small datetime = 4bytes + Topic (~10 characters non unicode) = 10bytes + Message -> 31characters(non unicode) = 31 bytes.

平均消息~50字节 - >小日期时间= 4字节+主题(~10字符非unicode)= 10字节+消息 - > 31字符(非unicode)= 31字节。

50 * 5000 = 244kb/sec -> 14mb/min -> 858mb/hour

50 * 5000 = 244kb / sec - > 14mb / min - > 858mb /小时

#7


Perhaps you don't want the overhead of a DB install?

也许你不想要DB安装的开销?

In that case, you could try a filesystem-based database like sqlite:

在这种情况下,您可以尝试基于文件系统的数据库,如sqlite:

SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world. The source code for SQLite is in the public domain.

SQLite是一个软件库,它实现了一个独立的,无服务器,零配置的事务SQL数据库引擎。 SQLite是世界上部署最广泛的SQL数据库引擎。 SQLite的源代码位于公共域中。

#8


I would make 2 separate programs: one to take the incoming requests, format them, and write them out to one single file, and another to read from that file and write the requests out. Doing things this way allows you to minimize the number of file handles open while still handling the incoming requests in realtime. If you make the first program format it's output correctly then processing it to the individual files should be simple.

我将制作两个独立的程序:一个用于接收传入的请求,格式化它们,然后将它们写入一个文件,另一个用于从该文件读取并写出请求。通过这种方式执行操作,您可以在仍然实时处理传入请求的同时最小化打开的文件句柄数。如果您正确输出第一个程序格式,那么将其处理为单个文件应该很简单。

#9


I'd keep a buffer of the incoming messages, and periodically write the 1000 files sequentially on a separate thread.

我会保留传入消息的缓冲区,并定期在单独的线程上写入1000个文件。

#10


I would look into purchasing a real time data historian package. Something like a PI System or Wonderware Data Historian. I have tried to things like this in files and a MS SQL database before and it didn't turn out good (It was a customer requirement and I wouldn't suggest it). These products have API's and they even have packages where you can make queries to the data just like it was SQL.

我会考虑购买实时数据历史数据包。类似PI System或Wonderware Data Historian的东西。我之前在文件和MS SQL数据库中尝试过这样的事情并且结果并不好(这是客户要求,我不建议)。这些产品有API,它们甚至还有一些软件包,您可以像对待SQL一样对数据进行查询。

It wouldn't allow me to post Hyperlinks so just google those 2 products and you will find information on them.

它不允许我发布超链接,所以只是谷歌这两个产品,你会发现它们的信息。

EDIT

If you do use a database like most people are suggesting I would recommend a table for each topic for historical data and consider table partitioning, indexes, and how long you are going to store the data.

如果您确实使用数据库,大多数人都建议我为每个主题推荐一个表用于历史数据,并考虑表分区,索引以及存储数据的时间。

For example if you are going to store a days worth and its one table for each topic, you are looking at 5 updates a second x 60 seconds in a minute x 60 minutes in an hour x 24 hours = 432000 records a day. After exporting the data I would imagine that you would have to clear the data for the next day which will cause a lock so you will have to have to queue you writes to the database. Then if you are going to rebuild the index so that you can do any querying on it that will cause a schema modification lock and MS SQL Enterprise Edition for online index rebuilding. If you don't clear the data everyday you will have to make sure you have plenty of disk space to throw at it.

例如,如果您要为每个主题存储一天的价值和一个表,那么您将看到5个更新,每秒60秒,每小时60分钟,每天24小时= 432000个记录。导出数据后,我想你必须清除第二天的数据,这将导致锁定,因此你必须排队写入数据库。然后,如果要重建索引,以便可以对其进行任何查询,这将导致架构修改锁定和MS SQL Enterprise Edition进行联机索引重建。如果您不是每天都清除数据,则必须确保有足够的磁盘空间。

Basically what I'm saying weigh the cost of purchasing a reliable product against building your own.

基本上我所说的权衡购买可靠产品的成本与建立自己的产品相比。

#11


If you don't want to use a database (and I would, but assuming you don't), I'd write the records to a single file, append operations are fast as they can be, and use a separate process/service to split up the file into the 1000 files. You could even roll-over the file every X minutes, so that for example, every 15 minutes you start a new file and the other process starts splitting them up into 1000 separate files.

如果您不想使用数据库(我愿意,但假设您没有),我会将记录写入单个文件,追加操作尽可能快,并使用单独的流程/服务将文件拆分为1000个文件。您甚至可以每隔X分钟翻转文件,例如,每15分钟启动一个新文件,另一个进程开始将它们拆分为1000个单独的文件。

All this does beg the question of why not a DB, and why do you need 1000 different files - you may have a very good reason - but then again, perhaps you should re-think you strategy and make sure it is sound reasoning before you go to far down this path.

所有这一切都提出了为什么不是数据库的问题,为什么你需要1000个不同的文件 - 你可能有一个很好的理由 - 但话又说回来,也许你应该重新思考你的策略,并确保它在你之前是合理的推理沿着这条路走到很远的地方。