随机访问大型二进制文件

I have a large binary file (12 GB) from which I want to assemble a smaller binary file (16 KB) on the fly. Assume the file is on disk, and that the bytes for the smaller file are somewhat randomly distributed in the large binary file. What's the best and fastest way to do this? So far I've not been able to do better than about three minutes.

我有一个大的二进制文件(12 GB),我想在其中动态组装一个较小的二进制文件(16 KB)。假设文件在磁盘上,并且较小文件的字节在某种程度上随机分布在大型二进制文件中。什么是最好和最快的方法?到目前为止,我还没能比大约三分钟更好。

Things I've tried, which have more or less the same performance:

我尝试过的东西,或多或少具有相同的性能:

Converting the file to the HDF5 format and using the C interface (slow).

将文件转换为HDF5格式并使用C接口(慢速)。

Writing a small C program to fseek() through the file (slow).

通过文件写一个小的C程序到fseek()(慢)。

How can I randomly access this data really fast?

如何快速随机访问这些数据?

I want to get to less than a couple of seconds for the query.

我希望查询时间不到几秒钟。

7 个解决方案

#1

The answer is basically "no".

答案基本上是“不”。

A single mechanical disk drive is going to take 10 ms or so to perform a seek, because it has to move the disk head. 16000 seeks times 10 milliseconds per seek equals 160 seconds. It makes absolutely no difference how you write your code; e.g. mmap() is going to make no difference.

单个机械磁盘驱动器需要10毫秒左右才能执行搜索,因为它必须移动磁头。 16000寻求每次搜寻10毫秒的时间等于160秒。你编写代码的方式完全没有区别;例如mmap()没有任何区别。

Welcome to the physical world, software person :-). You must improve the locality of your operations.

欢迎来到物理世界,软件人:-)。您必须改善操作的位置。

First, sort the locations you are accessing. Nearby locations in the file are likely to be nearby on disk, and seeking between nearby locations is faster than seeking randomly.

首先,对要访问的位置进行排序。文件中的附近位置可能在磁盘附近,并且在附近位置之间寻找比随机搜索更快。

Next, your disk can probably read sequential data at around 100 megabytes/second; that is, it can read 1 megabyte sequentially in around the same time it takes to perform a seek. So if two of your values are less than 1 megabyte apart, you are better off reading all of the data between them than performing the seek between them. (But benchmark this to find the optimal trade-off on your hardware.)

接下来,您的磁盘可能会读取大约100兆字节/秒的顺序数据;也就是说,它可以在执行搜索所需的大约相同的时间内按顺序读取1兆字节。因此,如果您的两个值相差小于1兆字节,那么最好读取它们之间的所有数据,而不是在它们之间执行搜索。 (但要对此进行基准测试,以找到硬件上的最佳权衡。)

Finally, a RAID can help with throughput (but not seek time). It can also provide multiple disk heads that can seek concurrently if you want to multi-thread your read code.

最后,RAID可以帮助提高吞吐量(但不是寻求时间)。它还可以提供多个磁盘头,如果您想要多线程读取您的读取代码,它们可以同时搜索。

But in general, accessing random data is about the worst thing you can ask your computer to do, whether in memory or on disk. And the relative difference between sequential access and random access increases every year because physics is local. (Well, the physics we depend on here, anyway.)

但一般来说,访问随机数据是您可以要求计算机执行的最糟糕的事情,无论是在内存中还是在磁盘上。顺序访问和随机访问之间的相对差异每年都在增加,因为物理是本地的。 (好吧,无论如何,我们依赖的物理学。)

[edit]

@JeremyP's suggestion to use SSDs is a good one. If they are an option, they have an effective seek time of 0.1 ms or so. Meaning you could expect your code to run 50-100 times faster on such hardware. (I did not think of this because I usually work with files in the 1 TB range where SSDs would be too expensive.)

@ JeremyP建议使用SSD是一个很好的建议。如果它们是一种选择,它们的有效寻道时间约为0.1 ms。这意味着您可以期望您的代码在此类硬件上运行速度提高50-100倍。 (我没有想到这一点,因为我通常使用1 TB范围内的SSD太昂贵的文件。)

[edit 2]

As @FrankH mentions in a comment, some of my suggestions assume that the file is contiguous on disk, which of course is not guaranteed. You can help to improve this by using a good file system (e.g. XFS) and by giving "hints" at file creation time (e.g. use posix_fallocate to inform the kernel you intend to populate a large file).

正如@FrankH在评论中提到的,我的一些建议认为该文件在磁盘上是连续的,这当然不能保证。您可以通过使用良好的文件系统(例如XFS)并在文件创建时提供“提示”来帮助改进这一点(例如,使用posix_fallocate通知内核您打算填充大文件)。

#2

Well, the speed you can achieve for this largely depends on the total number of read operations you perform in order to extract the 96 kB which make up the payload for your new file.

那么,你可以达到的速度在很大程度上取决于你为了提取新文件的有效载荷96 kB而执行的读操作总数。

Why is that so? Because random reads from (spinning) disks are seek-limited; the read as such is (almost) infinitely fast compared to the time it takes to re-position the magnetic heads.

为什么?因为来自(旋转)磁盘的随机读取是寻求限制的;与重新定位磁头所花费的时间相比,读取本身(几乎)无限快。

Since you're saying the access pattern is random, you're also not likely to benefit from any readahead that the operating system may decide to use; you can, if you so choose, therefore switch that off via fadvise(fd, 0, MAX_OFFSET, FADV_RANDOM); on the filedescriptor for the big file. Or, madvise() if you've chosen to mmap() it. But that'll only gain you if you're performing big reads (and you know a big readahead would be nonsense). For small reads, it's exclusively the seek time that'll determine the total.

由于您说访问模式是随机的,因此您也不太可能从操作系统可能决定使用的任何预读中受益;你可以,如果你愿意,可以通过fadvise(fd,0,MAX_OFFSET,FADV_RANDOM)关闭它;在大文件的filedescriptor上。或者,madvise()如果你选择了mmap()它。但是,如果你正在进行大读取,那么这只会让你获益(而且你知道一个大的预告会是无意义的)。对于小读取,它只是确定总数的寻道时间。

Assuming you need N random reads and you've got an M msec seek time, it'll take at least N * m milliseconds to perform the data extraction (if you've got the disk to yourself ...). There is no way to break this barrier.

假设你需要N次随机读取并且你有一个M毫秒寻道时间,那么执行数据提取至少需要N * m毫秒(如果你有自己的磁盘......)。没有办法打破这个障碍。

Edit: A few things on mitigating strategies:

编辑:关于缓解策略的一些事项:

As mentioned by several people, the key to approach this problem is to minimize seeks. There are several strategies for this:

正如几个人所提到的,解决这个问题的关键是尽量减少寻求。有几种策略:

Issue asynchronous reads if you can (i.e. if read operation N+1 doesn't depend on what read operation N did, then you can issue both concurrently). This allows the operating system / device driver to queue them up and possibly re-order them (or merge them with reads done by other concurrently running processes) for most efficient seeking.

如果可以,则发出异步读取(即,如果读取操作N + 1不依赖于N的读取操作,则可以同时发出两者)。这允许操作系统/设备驱动程序对它们进行排队并可能对它们进行重新排序(或者将它们与由其他同时运行的进程完成的读取合并)以实现最有效的搜索。

If you know the positions all in advance, then perform scatter-gather I/O (the UN*X preadv() would come to mind), to the same effect.

如果您事先知道所有位置,那么执行分散 - 聚集I / O(UN * X preadv()会想到),达到同样的效果。

Query your filesystem and/or block device for the best / minimum blocksize; how to do this is system-dependent, see e.g. statvfs() or even ioctl_list. If you know that, you might possibly use the technique mentioned by Nemo (merge two small reads within the "optimal" block size into a single large read, needing no seek).

查询文件系统和/或块设备以获得最佳/最小块大小;如何做到这一点取决于系统,参见例如statvfs()甚至ioctl_list。如果您知道这一点,您可能会使用Nemo提到的技术(将“最佳”块大小内的两个小读取合并为一个大型读取,不需要搜索)。

Possibly even use query interfaces like FIEMAP / FIBMAP (the Windows equivalent would roughly be FSCTL_GET_RETRIEVAL_POINTERS) to determine where the physical blocks for your file data are, and perform a decision on read merging based on that (there's no point issuing a large "nonseeking" read if actually that crosses a physical block boundary and the filesystem turns it into two).

甚至可能使用像FIEMAP / FIBMAP这样的查询接口(Windows等价物大致是FSCTL_GET_RETRIEVAL_POINTERS)来确定文件数据的物理块的位置,并根据它来执行读取合并的决定(没有发出大量“不寻常”的点)读取实际上是否跨越物理块边界并且文件系统将其变为两个)。

If you build up the positions to read from over a comparatively large time, then reading (asynchronously) as you still compute future read offsets will also help to hide seek latency, as you're putting compute cycles / wait time to good use.

如果你在相对较长的时间内建立读取位置,那么在你仍然计算未来读取偏移时读取(异步)也将有助于隐藏寻道延迟,因为你正在充分利用计算周期/等待时间。

In general, if none of the above applies, you'll have to bite the bullet and accept the seek latency. Buy a solid state disk and/or use a RAM-backed filesystem if you can justify the costs (and/or the volatility of RAM).

一般来说,如果上述情况都不适用,您将不得不咬紧牙关并接受寻道延迟。如果您可以证明成本(和/或RAM的波动性),请购买固态磁盘和/或使用RAM支持的文件系统。

#3

Have you tried mmaping the file? (in your case, mmap64). This will lazy-read the data from disk as you access it.

你试过mmaping文件吗? (在您的情况下,mmap64)。这将在您访问磁盘时从磁盘中读取数据。

If you're having to seek through the entire file to find the data you're looking for you'll be able to speed it up with a SSD, but it's always going to be slow. Are the locations of the data you're looking for known ahead of time?

如果您不得不寻找整个文件来查找您正在寻找的数据,您将能够使用SSD加速它,但它总是会变慢。您正在寻找的数据位置是否已提前知晓?

Is the file a text file, or a binary file?

该文件是文本文件还是二进制文件?

#4

If you have to read the whole file and you are using a mechanical hard disk, you are screwed. Assume the transfer rate is about 1 Gigabit / second, that means you physically can't get all the bits across the bus in less than 12 x 8 = 96 seconds. That assumes there is no seek time and the processor can deal with the data as it comes in.

如果您必须阅读整个文件而您使用的是机械硬盘,则会被搞砸。假设传输速率约为1千兆位/秒,这意味着您在物理上无法在不到12 x 8 = 96秒的时间内获得总线上的所有位。这假设没有寻道时间,处理器可以处理数据。

Since transfer rate is limited by the speed of the drive as much as anything, even if you know exactly where every byte of data you want to read is, if they are spread out randomly across the file, it'll still take about as long because you have to wait for the disk to rotate until the next byte you want is under the head.

由于传输速率受到驱动器速度的限制,即使您确切知道要读取的每个数据字节的位置,如果它们在文件中随机分布,它仍然需要很长时间因为您必须等待磁盘旋转,直到您想要的下一个字节位于头部。

If you have an SSD you can probably improve on this dramatically, since there's no waiting for the bytes to come round under the head...

如果你有一个SSD,你可以大大改善这一点,因为没有等待在头部下方的字节......

#5

Some hints to speedup reading files a little (besides what was already said): - read chunks which are multiplied size of block - on POSIX compliant systems use posix_fadvise(), which advice to OS about paging.

一些提示加速读取文件(除了已经说过的内容): - 读取块大小的块 - 在POSIX兼容系统上使用posix_fadvise(),它向操作系统提供有关分页的建议。

#6

I suppose it depends on how many seeks you need to do. 16 thousand, or a smaller number? Can you store the 12 GB file on a solid state drive? That would cut down on the seek latencies.

我想这取决于你需要做多少寻求。 16千,或更小的数字?您可以将12 GB文件存储在固态驱动器上吗?这将减少寻求延迟。

Can you break up the file and store the pieces on separate hard drives? That would enable asynchronous seeking in parallel.

你可以分解文件并将这些文件存储在不同的硬盘上吗?这将实现并行异步搜索。

#7

Use parallel or asynchronous reads. Issue them from multiple threads, processes, etc. as necessary, or use preadv, just like FrankH said.

使用并行或异步读取。如果需要,可以从多个线程,进程等发出它们,或者使用preadv,就像FrankH说的那样。

This means that you won't have to wait for one I/O request to complete before the next one can come along, which is going to improve performance if you have a clever RAID controller and lots of spindles.

这意味着在下一个I / O请求出现之前,您不必等待一个I / O请求完成,如果您有一个聪明的RAID控制器和许多心轴,这将提高性能。

On the other hand, if you have a really stupid I/O subsystem, it may make only a minor difference. Consider which I/O scheduler to use (you can change them on the fly, without a reboot, which is really cool). Anecdotal evidence suggests "noop" is best if you have "smart" hardware, cfq or deadline, if you have stupid hardware.

另一方面,如果你有一个非常愚蠢的I / O子系统,它可能只会产生微小的差异。考虑使用哪个I / O调度程序(您可以动态更改它们,无需重新启动,这非常酷)。轶事证据表明,如果你有“智能”硬件,cfq或截止日期,如果你有愚蠢的硬件,“noop”是最好的。

#1