数据结构的最佳存储,以实现快速查找和持久性

时间:2021-07-23 05:00:16

Scenario

脚本

I have the following methods:

我有以下方法:

public void AddItemSecurity(int itemId, int[] userIds)
public int[] GetValidItemIds(int userId)

Initially I'm thinking storage on the form:

最初我在思考表单上的存储:

itemId -> userId, userId, userId

and

userId -> itemId, itemId, itemId

AddItemSecurity is based on how I get data from a third party API, GetValidItemIds is how I want to use it at runtime.

AddItemSecurity基于我如何从第三方API获取数据,GetValidItemIds是我想在运行时使用它的方式。

There are potentially 2000 users and 10 million items. Item id's are on the form: 2007123456, 2010001234 (10 digits where first four represent the year).

可能有2000个用户和1000万个项目。项目ID在表格上:2007123456,2010001234(10位数,前四位代表年份)。

AddItemSecurity does not have to perform super fast, but GetValidIds needs to be subsecond. Also, if there is an update on an existing itemId I need to remove that itemId for users no longer in the list.

AddItemSecurity不必执行超快速,但GetValidIds需要亚秒。此外,如果现有的itemId有更新,我需要删除列表中不再存在的用户的itemId。

I'm trying to think about how I should store this in an optimal fashion. Preferably on disk (with caching), but I want the code maintainable and clean.

我正在考虑如何以最佳方式存储它。最好是在磁盘上(带缓存),但我希望代码可维护和清洁。

If the item id's had started at 0, I thought about creating a byte array the length of MaxItemId / 8 for each user, and set a true/false bit if the item was present or not. That would limit the array length to little over 1mb per user and give fast lookups as well as an easy way to update the list per user. By persisting this as Memory Mapped Files with the .Net 4 framework I think I would get decent caching as well (if the machine has enough RAM) without implementing caching logic myself. Parsing the id, stripping out the year, and store an array per year could be a solution.

如果项目id从0开始,我想为每个用户创建一个长度为MaxItemId / 8的字节数组,如果该项目存在与否则设置一个真/假位。这将限制每个用户的阵列长度超过1mb,并提供快速查找以及更新每个用户列表的简便方法。通过使用.Net 4框架将其保存为内存映射文件,我认为我也可以获得不错的缓存(如果机器有足够的RAM),而无需自己实现缓存逻辑。解析id,剥离年份,每年存储一个阵列可能是一个解决方案。

The ItemId -> UserId[] list can be serialized directly to disk and read/write with a normal FileStream in order to persist the list and diff it when there are changes.

ItemId - > UserId []列表可以直接序列化到磁盘并使用普通的FileStream进行读/写,以便在发生更改时保留列表并进行区分。

Each time a new user is added all the lists have to updated as well, but this can be done nightly.

每次添加新用户时,所有列表也必须更新,但这可以在每晚完成。

Question

Should I continue to try out this approach, or are there other paths which should be explored as well? I'm thinking SQL server will not perform fast enough, and it would give an overhead (at least if it's hosted on a different server), but my assumptions might be wrong. Any thought or insights on the matter is appreciated. And I want to try to solve it without adding too much hardware :)

我应该继续尝试这种方法,还是应该探索其他途径?我认为SQL服务器执行速度不够快,而且会产生开销(至少如果它托管在不同的服务器上),但我的假设可能是错误的。任何关于此事的想法或见解都表示赞赏。我想尝试解决它而不添加太多硬件:)

[Update 2010-03-31]

[更新2010-03-31]

I have now tested with SQL server 2008 under the following conditions.

我现在已经在以下条件下使用SQL Server 2008进行了测试。

  • Table with two columns (userid,itemid) both are Int
  • 具有两列(userid,itemid)的表都是Int
  • Clustered index on the two columns
  • 两列上的聚簇索引
  • Added ~800.000 items for 180 users - Total of 144 million rows
  • 为180个用户添加了约800,000个项目 - 总计1.44亿行
  • Allocated 4gb ram for SQL server
  • 为SQL服务器分配4gb ram
  • Dual Core 2.66ghz laptop
  • 双核2.66ghz笔记本电脑
  • SSD disk
  • SSD磁盘
  • Use a SqlDataReader to read all itemid's into a List
  • 使用SqlDataReader将所有itemid读入List
  • Loop over all users
  • 遍历所有用户

If I run one thread it averages on 0.2 seconds. When I add a second thread it goes up to 0.4 seconds, which is still ok. From there on the results are decreasing. Adding a third thread brings alot of the queries up to 2 seonds. A forth thread, up to 4 seconds, a fifth spikes some of the queries up to 50 seconds.

如果我运行一个线程,它的平均值为0.2秒。当我添加第二个线程时,它会上升到0.4秒,这仍然可以。从那里开始,结果正在减少。添加第三个线程会带来很多查询,最多可以有2个查询。第四个线程,最多4秒,第五个线程查询一些查询,最多50秒。

The CPU is roofing while this is going on, even on one thread. My test app takes some due to the speedy loop, and sql the rest.

即使在一个线程上,CPU也在进行屋顶处理。我的测试应用程序需要一些由于快速循环,并sql其余。

Which leads me to the conclusion that it won't scale very well. At least not on my tested hardware. Are there ways to optimize the database, say storing an array of int's per user instead of one record per item. But this makes it harder to remove items.

这让我得出结论,它不会很好地扩展。至少不在我测试的硬件上。有没有办法优化数据库,比如存储每个用户的int数组而不是每个项目一个记录。但这使得删除项目变得更加困难。

[Update 2010-03-31 #2]

[更新2010-03-31#2]

I did a quick test with the same data putting it as bits in memory mapped files. It performs much better. Six threads yields access times between 0.02s and 0.06s. Purely memory bound. The mapped files were mapped by one process, and accessed by six others simultaneously. And as the sql base took 4gb, the files on disk took 23mb.

我使用相同的数据进行了快速测试,将其作为内存映射文件中的位。它表现得更好。六个线程产生的访问时间介于0.02s和0.06s之间。纯粹的记忆束缚。映射文件由一个进程映射,并由六个其他进程同时访问。由于sql base占用4GB,磁盘上的文件占用了23mb。

3 个解决方案

#1


3  

After much testing I ended up using Memory Mapped Files, marking them with the sparse bit (NTFS), using code from NTFS Sparse Files with C#.

经过大量测试后,我最终使用了内存映射文件,使用稀疏位(NTFS)标记它们,使用NTFS Sparse Files with C#中的代码。

Wikipedia has an explanation of what a sparse file is.

*解释了稀疏文件是什么。

The benefits of using a sparse file is that I don't have to care about what range my id's are in. If I only write id's between 2006000000 and 2010999999, the file will only allocate 625,000 bytes from offset 250,750,000 in the file. All space up to that offset is unallocated in the file system. Each id is stored as a set bit in the file. Sort of treated as an bit array. And if the id sequence suddenly changes, then it will allocate in another part of the file.

使用稀疏文件的好处是我不必关心我的id所在的范围。如果我只在2006000000和2010999999之间写入id,则该文件将仅从文件中的偏移量250,750,000分配625,000个字节。到该偏移量的所有空间都在文件系统中未分配。每个id都存储为文件中的设置位。被视为位数组的排序。如果id序列突然改变,那么它将分配在文件的另一部分。

In order to retrieve which id's are set, I can perform a OS call to get the allocated parts of the sparse file, and then I check each bit in those sequences. Also checking if a particular id is set is very fast. If it falls outside the allocated blocks, then it's not there, if it falls within, it's merely one byte read and a bit mask check to see if the correct bit is set.

为了检索设置了哪个id,我可以执行OS调用以获取稀疏文件的已分配部分,然后检查这些序列中的每个位。另外检查特定id是否设置非常快。如果它落在分配的块之外,则它不在那里,如果它落在其中,它只是一个字节读取和一个位掩码检查以查看是否设置了正确的位。

So for the particular scenario where you have many id's which you want to check on with as much speed as possible, this is the most optimal way I've found so far.

因此,对于您想要以尽可能快的速度检查的许多id的特定场景,这是我迄今为止找到的最佳方式。

And the good part is that the memory mapped files can be shared with Java as well (which turned out to be something needed). Java also has support for memory mapped files on Windows, and implementing the read/write logic is fairly trivial.

好的部分是内存映射文件也可以与Java共享(结果证明是必需的)。 Java还支持Windows上的内存映射文件,实现读/写逻辑非常简单。

#2


1  

I really think you should try a nice database before you make your decision. Something like this will be a challenge to maintain in the long run. Your user-base is actually quite small. SQL Server should be able to handle what you need without any problems.

在你做出决定之前,我真的认为你应该尝试一个不错的数据库。从长远来看,这样的事情将是一个挑战。您的用户群实际上非常小。 SQL Server应该能够毫无问题地处理您需要的内容。

#3


0  

2000 users isn't too bad but with 10 mil related items you really should consider putting this into a database. DBs do all the storage, persistence, indexing, caching etc. that you need and they perform very well.

2000用户不是太糟糕,但有10万相关项目,你真的应该考虑把它放入数据库。数据库执行您需要的所有存储,持久性,索引,缓存等,并且它们的性能非常好。

They also allow for better scalability into the future. If you suddenly need to deal with two million users and billions of settings having a good db in place will make scaling a non-issue.

它们还可以在未来实现更好的可扩展性。如果您突然需要处理200万用户,并且数十亿个具有良好数据库的设置将使扩展成为非问题。

#1


3  

After much testing I ended up using Memory Mapped Files, marking them with the sparse bit (NTFS), using code from NTFS Sparse Files with C#.

经过大量测试后,我最终使用了内存映射文件,使用稀疏位(NTFS)标记它们,使用NTFS Sparse Files with C#中的代码。

Wikipedia has an explanation of what a sparse file is.

*解释了稀疏文件是什么。

The benefits of using a sparse file is that I don't have to care about what range my id's are in. If I only write id's between 2006000000 and 2010999999, the file will only allocate 625,000 bytes from offset 250,750,000 in the file. All space up to that offset is unallocated in the file system. Each id is stored as a set bit in the file. Sort of treated as an bit array. And if the id sequence suddenly changes, then it will allocate in another part of the file.

使用稀疏文件的好处是我不必关心我的id所在的范围。如果我只在2006000000和2010999999之间写入id,则该文件将仅从文件中的偏移量250,750,000分配625,000个字节。到该偏移量的所有空间都在文件系统中未分配。每个id都存储为文件中的设置位。被视为位数组的排序。如果id序列突然改变,那么它将分配在文件的另一部分。

In order to retrieve which id's are set, I can perform a OS call to get the allocated parts of the sparse file, and then I check each bit in those sequences. Also checking if a particular id is set is very fast. If it falls outside the allocated blocks, then it's not there, if it falls within, it's merely one byte read and a bit mask check to see if the correct bit is set.

为了检索设置了哪个id,我可以执行OS调用以获取稀疏文件的已分配部分,然后检查这些序列中的每个位。另外检查特定id是否设置非常快。如果它落在分配的块之外,则它不在那里,如果它落在其中,它只是一个字节读取和一个位掩码检查以查看是否设置了正确的位。

So for the particular scenario where you have many id's which you want to check on with as much speed as possible, this is the most optimal way I've found so far.

因此,对于您想要以尽可能快的速度检查的许多id的特定场景,这是我迄今为止找到的最佳方式。

And the good part is that the memory mapped files can be shared with Java as well (which turned out to be something needed). Java also has support for memory mapped files on Windows, and implementing the read/write logic is fairly trivial.

好的部分是内存映射文件也可以与Java共享(结果证明是必需的)。 Java还支持Windows上的内存映射文件,实现读/写逻辑非常简单。

#2


1  

I really think you should try a nice database before you make your decision. Something like this will be a challenge to maintain in the long run. Your user-base is actually quite small. SQL Server should be able to handle what you need without any problems.

在你做出决定之前,我真的认为你应该尝试一个不错的数据库。从长远来看,这样的事情将是一个挑战。您的用户群实际上非常小。 SQL Server应该能够毫无问题地处理您需要的内容。

#3


0  

2000 users isn't too bad but with 10 mil related items you really should consider putting this into a database. DBs do all the storage, persistence, indexing, caching etc. that you need and they perform very well.

2000用户不是太糟糕,但有10万相关项目,你真的应该考虑把它放入数据库。数据库执行您需要的所有存储,持久性,索引,缓存等,并且它们的性能非常好。

They also allow for better scalability into the future. If you suddenly need to deal with two million users and billions of settings having a good db in place will make scaling a non-issue.

它们还可以在未来实现更好的可扩展性。如果您突然需要处理200万用户,并且数十亿个具有良好数据库的设置将使扩展成为非问题。