C在二进制文件的中间写入而不覆盖任何现有内容

时间:2021-09-07 20:27:10

Today's problem is that I need to write an array of numbers in a binary file at a starting position. I have the position where it should start, and I don't want to overwrite values after that, just want to insert the array at the starting position in the file. E.g:

今天的问题是我需要在起始位置的二进制文件中编写一个数字数组。我有它应该开始的位置,我不想在此之后覆盖值,只是想将数组插入文件的起始位置。例如:

12345

Let's push 456 at position 2:

让我们在第2位推456:

12456345

I know that probably I'll have to implement it by myself, but I want to know what's your opinion on how to implement that as efficiently as possible.

我知道可能我必须自己实现它,但我想知道你对如何尽可能有效地实现它有什么看法。

4 个解决方案

#1


9  

Here's a function extend_file_and_insert() that does the job, more or less.

这是一个函数extend_file_and_insert(),它或多或少地完成了这项工作。

#include <sys/stat.h>
#include <unistd.h>

enum { BUFFERSIZE = 64 * 1024 };

#define MIN(x, y) (((x) < (y)) ? (x) : (y))

/*
off_t   is signed
ssize_t is signed
size_t  is unsigned

off_t   for lseek() offset and return
size_t  for read()/write() length
ssize_t for read()/write() return
off_t   for st_size
*/

static int extend_file_and_insert(int fd, off_t offset, char const *insert, size_t inslen)
{
    char buffer[BUFFERSIZE];
    struct stat sb;
    int rc = -1;

    if (fstat(fd, &sb) == 0)
    {
        if (sb.st_size > offset)
        {
            /* Move data after offset up by inslen bytes */
            size_t bytes_to_move = sb.st_size - offset;
            off_t read_end_offset = sb.st_size; 
            while (bytes_to_move != 0)
            {
                ssize_t bytes_this_time = MIN(BUFFERSIZE, bytes_to_move);
                ssize_t rd_off = read_end_offset - bytes_this_time;
                ssize_t wr_off = rd_off + inslen;
                lseek(fd, rd_off, SEEK_SET);
                if (read(fd, buffer, bytes_this_time) != bytes_this_time)
                    return -1;
                lseek(fd, wr_off, SEEK_SET);
                if (write(fd, buffer, bytes_this_time) != bytes_this_time)
                    return -1;
                bytes_to_move -= bytes_this_time;
                read_end_offset -= bytes_this_time; /* Added 2013-07-19 */
            }   
        }   
        lseek(fd, offset, SEEK_SET);
        write(fd, insert, inslen);
        rc = 0;
    }   
    return rc;
}

(Note the additional line added 2013-07-19; it was a bug that only shows when the buffer size is smaller than the amount of data to be copied up the file. Thanks to malat for pointing out the error. Code now tested with BUFFERSIZE = 4.)

(注意添加的新行2013-07-19;这是一个只显示缓冲区大小小于要复制文件的数据量的错误。感谢malat指出错误。代码现在测试了BUFFERSIZE = 4.)

This is some small-scale test code:

这是一些小规模的测试代码:

#include <fcntl.h>
#include <string.h>

static const char base_data[] = "12345";
typedef struct Data
{
    off_t       posn;
    const char *data;
} Data;
static const Data insert[] =
{
    {  2, "456"                       },
    {  4, "XxxxxxX"                   },
    { 12, "ZzzzzzzzzzzzzzzzzzzzzzzzX" },
    { 22, "YyyyyyyyyyyyyyyY"          },
};  
enum { NUM_INSERT = sizeof(insert) / sizeof(insert[0]) };

int main(void)
{
    int fd = open("test.dat", O_RDWR | O_TRUNC | O_CREAT, 0644);
    if (fd > 0)
    {
        ssize_t base_len = sizeof(base_data) - 1;
        if (write(fd, base_data, base_len) == base_len)
        {
            for (int i = 0; i < NUM_INSERT; i++)
            {
                off_t length = strlen(insert[i].data);
                if (extend_file_and_insert(fd, insert[i].posn, insert[i].data, length) != 0)
                    break;
                lseek(fd, 0, SEEK_SET);
                char buffer[BUFFERSIZE];
                ssize_t nbytes;
                while ((nbytes = read(fd, buffer, sizeof(buffer))) > 0)
                    write(1, buffer, nbytes);
                write(1, "\n", 1);
            }
        }
        close(fd);
    }
    return(0);
}

It produces the output:

它产生输出:

12456345
1245XxxxxxX6345
1245XxxxxxX6ZzzzzzzzzzzzzzzzzzzzzzzzZ345
1245XxxxxxX6ZzzzzzzzzzYyyyyyyyyyyyyyyYzzzzzzzzzzzzzzZ345

It should be tested on some larger files (ones bigger than BUFFERSIZE, but it would be sensible to test with a BUFFERSIZE a lot smaller than 64 KiB; I used 32 bytes and it seemed to be OK). I've only eyeballed the results but the patterns are designed to make it easy to see that they are correct. The code does not check any of the lseek() calls; that's a minor risk.

它应该在一些较大的文件上测试(比BUFFERSIZE更大的文件,但是使用比64 KiB小得多的BUFFERSIZE进行测试是明智的;我使用了32个字节并且似乎没问题)。我只关注结果,但模式的设计使其很容易看出它们是正确的。代码不会检查任何lseek()调用;这是一个小风险。

#2


5  

First, use ftruncate() to enlarge the file to the final size. Then copy everything from the old end over to the new end, working your way back to the insertion point. Then overwrite the middle contents with the data you want to insert. This is as efficient as it gets, I think, because filesystems don't generally offer true "insertion" in the middle of files.

首先,使用ftruncate()将文件放大到最终大小。然后复制从旧端到新端的所有内容,返回插入点。然后用要插入的数据覆盖中间内容。我认为这是有效的,因为文件系统通常不会在文件中间提供真正的“插入”。

#3


0  

I'm going to interpret your question broadly to be "how can I implement efficiently a persistent store of an object that supports random-access lookup by index and insertion with expansion." As noted, you could use a simple linear array in the file, but this would only be efficient for lookup (O(1)) and quite inefficient for insertion (O(n)). You could achieve O(log n) for both lookup and insertion by using a tree data structure instead. Maintain one file which acts as an index, and another which acts as the data store and is a series of chunks. Each chunk can be partially full. The index file contains a tree (binary tree or B-tree) where each node corresponds to some contiguous chunk of the array and contains the size of that chunk (so that the root node contains the size of the whole array). For a binary tree, the left and right child nodes contain the size of the left and right halves (approximately) of the array. Finally leaf nodes contain a pointer to a chunk in the data store file that contains the actual data. Insertion now involves changing the 'size' property of 'k' nodes, where 'k' is the height of the tree. When a data store chunk gets too full, split it (allocate a new one by growing the file, or, if you support deletion as well, perhaps from a free list of empty chunks) and rebalance the tree (lots of standard ways of doing this.)

我将广泛地解释你的问题是“如何有效地实现一个对象的持久存储,该对象支持通过索引进行随机访问查找,并通过扩展进行插入。”如上所述,您可以在文件中使用简单的线性数组,但这只能用于查找(O(1)),并且插入效率非常低(O(n))。您可以使用树数据结构来实现查找和插入的O(log n)。维护一个充当索引的文件,另一个充当数据存储并且是一系列块。每个块可以部分填满。索引文件包含一个树(二叉树或B树),其中每个节点对应于数组的某个连续块,并包含该块的大小(以便根节点包含整个数组的大小)。对于二叉树,左子节点和右子节点包含数组左右两半(大约)的大小。最后,叶节点包含指向数据存储文件中包含实际数据的块的指针。插入现在涉及更改'k'节点的'size'属性,其中'k'是树的高度。当数据存储块太满时,将其拆分(通过增长文件来分配新的,或者,如果您也支持删除,可能来自空闲块的空闲列表)并重新平衡树(许多标准方法)这个。)

Does this sound complicated? Definitely! Efficient mid-file insertion is more complicated to achieve than is appending.

这听起来很复杂吗?非也!高效的中间文件插入比附加更复杂。

#4


0  

I'm agreeing with the others, but let me state the solution a bit differently:

我同意其他人的观点,但让我说一下解决方案有点不同:

  1. Get a temp filename (there are OS-specific calls for this)

    获取临时文件名(有特定于操作系统的调用)

  2. Copy your original file to the temp file (there are now two copies of the same file)

    将原始文件复制到临时文件(现在有两个相同文件的副本)

  3. Open the original file for "append".

    打开原始文件“append”。

  4. "Truncate" it to your insertion point

    “截断”它到你的插入点

  5. Write your new data

    写下你的新数据

  6. Open your temp file for "read"

    打开临时文件以“读取”

  7. "Seek" to the insertion point (again, the call is OS-specific)

    “寻找”到插入点(同样,调用是特定于操作系统的)

  8. Read to end-of-file in temp file; inserting into your original file (still open for "append").

    读取临时文件中的文件结尾;插入原始文件(仍然打开“追加”)。

  9. Close both files

    关闭这两个文件

  10. Delete temp file

    删除临时文件

#1


9  

Here's a function extend_file_and_insert() that does the job, more or less.

这是一个函数extend_file_and_insert(),它或多或少地完成了这项工作。

#include <sys/stat.h>
#include <unistd.h>

enum { BUFFERSIZE = 64 * 1024 };

#define MIN(x, y) (((x) < (y)) ? (x) : (y))

/*
off_t   is signed
ssize_t is signed
size_t  is unsigned

off_t   for lseek() offset and return
size_t  for read()/write() length
ssize_t for read()/write() return
off_t   for st_size
*/

static int extend_file_and_insert(int fd, off_t offset, char const *insert, size_t inslen)
{
    char buffer[BUFFERSIZE];
    struct stat sb;
    int rc = -1;

    if (fstat(fd, &sb) == 0)
    {
        if (sb.st_size > offset)
        {
            /* Move data after offset up by inslen bytes */
            size_t bytes_to_move = sb.st_size - offset;
            off_t read_end_offset = sb.st_size; 
            while (bytes_to_move != 0)
            {
                ssize_t bytes_this_time = MIN(BUFFERSIZE, bytes_to_move);
                ssize_t rd_off = read_end_offset - bytes_this_time;
                ssize_t wr_off = rd_off + inslen;
                lseek(fd, rd_off, SEEK_SET);
                if (read(fd, buffer, bytes_this_time) != bytes_this_time)
                    return -1;
                lseek(fd, wr_off, SEEK_SET);
                if (write(fd, buffer, bytes_this_time) != bytes_this_time)
                    return -1;
                bytes_to_move -= bytes_this_time;
                read_end_offset -= bytes_this_time; /* Added 2013-07-19 */
            }   
        }   
        lseek(fd, offset, SEEK_SET);
        write(fd, insert, inslen);
        rc = 0;
    }   
    return rc;
}

(Note the additional line added 2013-07-19; it was a bug that only shows when the buffer size is smaller than the amount of data to be copied up the file. Thanks to malat for pointing out the error. Code now tested with BUFFERSIZE = 4.)

(注意添加的新行2013-07-19;这是一个只显示缓冲区大小小于要复制文件的数据量的错误。感谢malat指出错误。代码现在测试了BUFFERSIZE = 4.)

This is some small-scale test code:

这是一些小规模的测试代码:

#include <fcntl.h>
#include <string.h>

static const char base_data[] = "12345";
typedef struct Data
{
    off_t       posn;
    const char *data;
} Data;
static const Data insert[] =
{
    {  2, "456"                       },
    {  4, "XxxxxxX"                   },
    { 12, "ZzzzzzzzzzzzzzzzzzzzzzzzX" },
    { 22, "YyyyyyyyyyyyyyyY"          },
};  
enum { NUM_INSERT = sizeof(insert) / sizeof(insert[0]) };

int main(void)
{
    int fd = open("test.dat", O_RDWR | O_TRUNC | O_CREAT, 0644);
    if (fd > 0)
    {
        ssize_t base_len = sizeof(base_data) - 1;
        if (write(fd, base_data, base_len) == base_len)
        {
            for (int i = 0; i < NUM_INSERT; i++)
            {
                off_t length = strlen(insert[i].data);
                if (extend_file_and_insert(fd, insert[i].posn, insert[i].data, length) != 0)
                    break;
                lseek(fd, 0, SEEK_SET);
                char buffer[BUFFERSIZE];
                ssize_t nbytes;
                while ((nbytes = read(fd, buffer, sizeof(buffer))) > 0)
                    write(1, buffer, nbytes);
                write(1, "\n", 1);
            }
        }
        close(fd);
    }
    return(0);
}

It produces the output:

它产生输出:

12456345
1245XxxxxxX6345
1245XxxxxxX6ZzzzzzzzzzzzzzzzzzzzzzzzZ345
1245XxxxxxX6ZzzzzzzzzzYyyyyyyyyyyyyyyYzzzzzzzzzzzzzzZ345

It should be tested on some larger files (ones bigger than BUFFERSIZE, but it would be sensible to test with a BUFFERSIZE a lot smaller than 64 KiB; I used 32 bytes and it seemed to be OK). I've only eyeballed the results but the patterns are designed to make it easy to see that they are correct. The code does not check any of the lseek() calls; that's a minor risk.

它应该在一些较大的文件上测试(比BUFFERSIZE更大的文件,但是使用比64 KiB小得多的BUFFERSIZE进行测试是明智的;我使用了32个字节并且似乎没问题)。我只关注结果,但模式的设计使其很容易看出它们是正确的。代码不会检查任何lseek()调用;这是一个小风险。

#2


5  

First, use ftruncate() to enlarge the file to the final size. Then copy everything from the old end over to the new end, working your way back to the insertion point. Then overwrite the middle contents with the data you want to insert. This is as efficient as it gets, I think, because filesystems don't generally offer true "insertion" in the middle of files.

首先,使用ftruncate()将文件放大到最终大小。然后复制从旧端到新端的所有内容,返回插入点。然后用要插入的数据覆盖中间内容。我认为这是有效的,因为文件系统通常不会在文件中间提供真正的“插入”。

#3


0  

I'm going to interpret your question broadly to be "how can I implement efficiently a persistent store of an object that supports random-access lookup by index and insertion with expansion." As noted, you could use a simple linear array in the file, but this would only be efficient for lookup (O(1)) and quite inefficient for insertion (O(n)). You could achieve O(log n) for both lookup and insertion by using a tree data structure instead. Maintain one file which acts as an index, and another which acts as the data store and is a series of chunks. Each chunk can be partially full. The index file contains a tree (binary tree or B-tree) where each node corresponds to some contiguous chunk of the array and contains the size of that chunk (so that the root node contains the size of the whole array). For a binary tree, the left and right child nodes contain the size of the left and right halves (approximately) of the array. Finally leaf nodes contain a pointer to a chunk in the data store file that contains the actual data. Insertion now involves changing the 'size' property of 'k' nodes, where 'k' is the height of the tree. When a data store chunk gets too full, split it (allocate a new one by growing the file, or, if you support deletion as well, perhaps from a free list of empty chunks) and rebalance the tree (lots of standard ways of doing this.)

我将广泛地解释你的问题是“如何有效地实现一个对象的持久存储,该对象支持通过索引进行随机访问查找,并通过扩展进行插入。”如上所述,您可以在文件中使用简单的线性数组,但这只能用于查找(O(1)),并且插入效率非常低(O(n))。您可以使用树数据结构来实现查找和插入的O(log n)。维护一个充当索引的文件,另一个充当数据存储并且是一系列块。每个块可以部分填满。索引文件包含一个树(二叉树或B树),其中每个节点对应于数组的某个连续块,并包含该块的大小(以便根节点包含整个数组的大小)。对于二叉树,左子节点和右子节点包含数组左右两半(大约)的大小。最后,叶节点包含指向数据存储文件中包含实际数据的块的指针。插入现在涉及更改'k'节点的'size'属性,其中'k'是树的高度。当数据存储块太满时,将其拆分(通过增长文件来分配新的,或者,如果您也支持删除,可能来自空闲块的空闲列表)并重新平衡树(许多标准方法)这个。)

Does this sound complicated? Definitely! Efficient mid-file insertion is more complicated to achieve than is appending.

这听起来很复杂吗?非也!高效的中间文件插入比附加更复杂。

#4


0  

I'm agreeing with the others, but let me state the solution a bit differently:

我同意其他人的观点,但让我说一下解决方案有点不同:

  1. Get a temp filename (there are OS-specific calls for this)

    获取临时文件名(有特定于操作系统的调用)

  2. Copy your original file to the temp file (there are now two copies of the same file)

    将原始文件复制到临时文件(现在有两个相同文件的副本)

  3. Open the original file for "append".

    打开原始文件“append”。

  4. "Truncate" it to your insertion point

    “截断”它到你的插入点

  5. Write your new data

    写下你的新数据

  6. Open your temp file for "read"

    打开临时文件以“读取”

  7. "Seek" to the insertion point (again, the call is OS-specific)

    “寻找”到插入点(同样,调用是特定于操作系统的)

  8. Read to end-of-file in temp file; inserting into your original file (still open for "append").

    读取临时文件中的文件结尾;插入原始文件(仍然打开“追加”)。

  9. Close both files

    关闭这两个文件

  10. Delete temp file

    删除临时文件