fgetpos()行为取决于换行字符

时间:2023-01-31 16:13:05

Consider these two files:

考虑这两个文件:

file1.txt (Windows newline)

file1。txt(Windows换行符)

abc\r\n
def\r\n

file2.txt (Unix newline)

file2。txt(Unix换行符)

abc\n
def\n

I've noticed that for the file2.txt, the position obtained with fgetpos is not incremented correctly. I'm working on Windows.

我注意到文件2。txt, fgetpos获得的位置没有正确增加。我工作在Windows。

Let me show you an example. The following code:

我给你们举个例子。下面的代码:

#include<cstdio>

void read(FILE *file)
{
    int c = fgetc(file);
    printf("%c (%d)\n", (char)c, c);

    fpos_t pos;
    fgetpos(file, &pos); // save the position
    c = fgetc(file);
    printf("%c (%d)\n", (char)c, c);

    fsetpos(file, &pos); // restore the position - should point to previous
    c = fgetc(file);     // character, which is not the case for file2.txt
    printf("%c (%d)\n", (char)c, c);
    c = fgetc(file);
    printf("%c (%d)\n", (char)c, c);
}

int main()
{
    FILE *file = fopen("file1.txt", "r");
    printf("file1:\n");
    read(file);
    fclose(file);

    file = fopen("file2.txt", "r");
    printf("\n\nfile2:\n");
    read(file);
    fclose(file);

    return 0;
}

gives such result:

给了这样的结果:

file1:
a (97)
b (98)
b (98)
c (99)


file2:
a (97)
b (98)
  (-1)
  (-1)

file1.txt works as expected, while file2.txt behaves strange. To explain what's wrong with it, I tried the following code:

file1。txt可以按预期工作,而file2可以。三种行为奇怪。为了解释它的错误,我尝试了以下代码:

void read(FILE *file)
{
    int c;
    fpos_t pos;
    while (1)
    {
        fgetpos(file, &pos);
        printf("pos: %d ", (int)pos);
        c = fgetc(file);
        if (c == EOF) break;
        printf("c: %c (%d)\n", (char)c, c);
    }
}

int main()
{
    FILE *file = fopen("file1.txt", "r");
    printf("file1:\n");
    read(file);
    fclose(file);

    file = fopen("file2.txt", "r");
    printf("\n\nfile2:\n");
    read(file);
    fclose(file);

    return 0;
}

I got this output:

我得到了这个输出:

file1:
pos: 0 c: a (97)
pos: 1 c: b (98)
pos: 2 c: c (99)
pos: 3 c:
 (10)
pos: 5 c: d (100)
pos: 6 c: e (101)
pos: 7 c: f (102)
pos: 8 c:
 (10)
pos: 10

file2:
pos: 0 c: a (97) // something is going wrong here...
pos: -1 c: b (98)
pos: 0 c: c (99)
pos: 1 c:
 (10)
pos: 3 c: d (100)
pos: 4 c: e (101)
pos: 5 c: f (102)
pos: 6 c:
 (10)
pos: 8

I know that fpos_t is not meant to be interpreted by coder, because it's depending on implementation. However, the above example explains the problems with fgetpos/fsetpos.

我知道fpos_t不应该由编码器解释,因为它依赖于实现。但是,上面的示例解释了fgetpos/fsetpos的问题。

How is it possible that the newline sequence affects the internal position of the file, even before it encounters that characters?

新行序列如何可能影响文件的内部位置,甚至在它遇到字符之前?

2 个解决方案

#1


3  

I would say the problem is probably caused by the second file confusing the implementation, since it's being opened in text mode, but it doesn't follow the requirements.

我想说,问题可能是由于第二个文件混淆了实现,因为它是在文本模式中打开的,但是它不符合要求。

In the standard,

的标准,

A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character

文本流是由字符组成的有序序列,每行由零或多个字符组成,外加一个新的行字符

Your second file stream contains no valid newline characters (since it looks for \r\n to convert to the newline character internally). As a result, the implementation may not understand the line length properly, and get hopelessly confused when you try to move about in it.

第二个文件流不包含有效的换行字符(因为它寻找\r\n在内部转换为换行字符)。因此,实现可能不能正确地理解行长度,当您试图在其中移动时,可能会感到非常困惑。

Additionally,

此外,

Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment.

必须在输入和输出上添加、修改或删除字符,以符合在主机环境中表示文本的不同约定。

Bear in mind that the library will not just read each byte from the file as you call fgetc - it will read the entire file (for one so small) into the stream's buffer and operate on that.

请记住,库不会在您调用fgetc时从文件中读取每个字节——它会将整个文件(一个如此小的文件)读入流的缓冲区并对其进行操作。

#2


2  

I'm adding this as supporting information for teppic's answer:

我把这作为teppic回答的支持信息:

When dealing with a FILE* that has been opened as text instead of binary, the fgetpos() function in VC++ 11 (VS 2012) may (and does for your file2.txt example) end up in this stretch of code:

当处理以文本形式而不是二进制形式打开的文件*时,vc++ 11 (VS 2012)中的fgetpos()函数可以(对文件2也是如此)。在这段代码中结束:

// ...

if (_osfile(fd) & FTEXT) {
        /* (1) If we're not at eof, simply copy _bufsiz
           onto rdcnt to get the # of untranslated
           chars read. (2) If we're at eof, we must
           look through the buffer expanding the '\n'
           chars one at a time. */

        // ...

        if (_lseeki64(fd, 0i64, SEEK_END) == filepos) {

            max = stream->_base + rdcnt;
            for (p = stream->_base; p < max; p++)
                if (*p == '\n')                     // <---
                    /* adjust for '\r' */           // <---
                    rdcnt++;                        // <---

// ...

It assumes that any \n character in the buffer was originally a \r\n sequence that had been normalized when the data was read into the buffer. So there are times when it tries to account for that (now missing) \r character that it believes previous processing of the file had removed from the buffer. This particular adjustment happens when you're near the end of the file; however there are other similar adjustments to account for the removed \r bytes in the fgetpos() handling.

它假设缓冲区中的任何\n字符最初是一个\r\n序列,当数据被读入缓冲区时,该序列已被规范化。因此,有时当它试图说明(现在丢失的)\r字符时,它认为以前对文件的处理已经从缓冲区中删除。当您接近文件的末尾时,会发生这种特殊的调整;不过,fgetpos()处理中删除的\r字节也有类似的调整。

#1


3  

I would say the problem is probably caused by the second file confusing the implementation, since it's being opened in text mode, but it doesn't follow the requirements.

我想说,问题可能是由于第二个文件混淆了实现,因为它是在文本模式中打开的,但是它不符合要求。

In the standard,

的标准,

A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character

文本流是由字符组成的有序序列,每行由零或多个字符组成,外加一个新的行字符

Your second file stream contains no valid newline characters (since it looks for \r\n to convert to the newline character internally). As a result, the implementation may not understand the line length properly, and get hopelessly confused when you try to move about in it.

第二个文件流不包含有效的换行字符(因为它寻找\r\n在内部转换为换行字符)。因此,实现可能不能正确地理解行长度,当您试图在其中移动时,可能会感到非常困惑。

Additionally,

此外,

Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment.

必须在输入和输出上添加、修改或删除字符,以符合在主机环境中表示文本的不同约定。

Bear in mind that the library will not just read each byte from the file as you call fgetc - it will read the entire file (for one so small) into the stream's buffer and operate on that.

请记住,库不会在您调用fgetc时从文件中读取每个字节——它会将整个文件(一个如此小的文件)读入流的缓冲区并对其进行操作。

#2


2  

I'm adding this as supporting information for teppic's answer:

我把这作为teppic回答的支持信息:

When dealing with a FILE* that has been opened as text instead of binary, the fgetpos() function in VC++ 11 (VS 2012) may (and does for your file2.txt example) end up in this stretch of code:

当处理以文本形式而不是二进制形式打开的文件*时,vc++ 11 (VS 2012)中的fgetpos()函数可以(对文件2也是如此)。在这段代码中结束:

// ...

if (_osfile(fd) & FTEXT) {
        /* (1) If we're not at eof, simply copy _bufsiz
           onto rdcnt to get the # of untranslated
           chars read. (2) If we're at eof, we must
           look through the buffer expanding the '\n'
           chars one at a time. */

        // ...

        if (_lseeki64(fd, 0i64, SEEK_END) == filepos) {

            max = stream->_base + rdcnt;
            for (p = stream->_base; p < max; p++)
                if (*p == '\n')                     // <---
                    /* adjust for '\r' */           // <---
                    rdcnt++;                        // <---

// ...

It assumes that any \n character in the buffer was originally a \r\n sequence that had been normalized when the data was read into the buffer. So there are times when it tries to account for that (now missing) \r character that it believes previous processing of the file had removed from the buffer. This particular adjustment happens when you're near the end of the file; however there are other similar adjustments to account for the removed \r bytes in the fgetpos() handling.

它假设缓冲区中的任何\n字符最初是一个\r\n序列,当数据被读入缓冲区时,该序列已被规范化。因此,有时当它试图说明(现在丢失的)\r字符时,它认为以前对文件的处理已经从缓冲区中删除。当您接近文件的末尾时,会发生这种特殊的调整;不过,fgetpos()处理中删除的\r字节也有类似的调整。