解析制表符分隔的文本文件

时间:2021-08-09 22:55:14

I'm trying to write a code that will parse a tab-delimited text file by assigning each string between tabs to a given element of a sample struct that I've defined. In the input file, the first row will have all the class identifiers (c_name), the second row will have all the sample identifiers (s_name), and the rest of the rows will contain data.

我正在尝试编写一个代码,它将解析制表符分隔的文本文件,方法是将制表符之间的每个字符串分配给我定义的示例结构的给定元素。在输入文件中,第一行将包含所有类标识符(c_name),第二行将包含所有样本标识符(s_name),其余行将包含数据。

I know it's going to be a bit more complicated because the first column will actually just contain labels, but I figured I'd start with trying to figure out the general parsing scheme.

我知道它会有点复杂,因为第一列实际上只包含标签,但我想我开始尝试找出一般的解析方案。

I can gather that, for the class identifiers for example, I should probably be using fscanf in a for loop add each identifier to the class field of a given sample, but I'm getting lost in the actual implementation. Based on one post I saw, I thought I could do something along the lines of using %[^\t]\t in fscanf to read into an array everything that's not a tab up to a tab, but I don't think I have this quite right.

我可以收集一下,例如,对于类标识符,我应该在for循环中使用fscanf将每个标识符添加到给定样本的类字段中,但是我在实际实现中迷路了。根据我看到的一篇文章,我认为我可以在fscanf中使用%[^ \ t] \ t来读取数组中的所有内容,而不是选项卡中的选项卡,但我不认为我这是非常正确的。

Any suggestions would be greatly appreciated.

任何建议将不胜感激。

#define LENGTH 30
#define MAX_OBS 80000

typedef struct
{
    char c_name[LENGTH];
    char s_name[LENGTH];
    double value[MAX_OBS];
}
sample;

// I've already calculated the number of columns in the file
sample sample[total_columns];
for (int i = 0; i < total_columns; i++)
   {
      fscanf(input, "%[^\t]\t", sample[i].s_name);
   }

Edit: I've tried several different variations of the code below ("%[^\t\n\r]\t\n\r", or "%[^\t\n\r]%*1[\t\n\r]", or " %[^\t\n\r]") and they all seem to be generally working except that, depending on the size I'm allocating to data and how long I'm iterating, it gives a segmentation fault at some point. The code below gives a segmentation fault immediately, but if I arbitrarily change total_columns in both places to 3, it will print Class Case Case. This seems to work up until 14, at which point the whole program segmentation faults. I'm fairly confused about the issue here. I've also tried mallocing memory to the sample data array to see if it was an issue of stack vs heap, but that doesn't seem to be helping either. Thanks so much for your help!

编辑:我已经尝试了下面代码的几种不同变体(“%[^ \ t \ n \ r \ n] \ t \ n \ r \ n”或“%[^ \ t \ n \ r] \ n \ n%* 1 [\ t \ n \ r]“或”%[^ \ t \ n \ r \ n]“)它们似乎一般都在工作,除了根据我分配给数据的大小以及我迭代多长时间,它在某些时候给出了分段错误。下面的代码立即给出了分段错误,但如果我随意将两个地方的total_columns更改为3,它将打印Class Case Case。这似乎工作到14,此时整个程序分段出错。我对此问题相当困惑。我还尝试将内存malloc到示例数据数组,以查看它是否是堆栈与堆的问题,但这似乎也没有帮助。非常感谢你的帮助!

sample data[total_columns];
fseek(input, 0, SEEK_SET);
for (int i = 0; i < total_columns; i++)
{
    fscanf(input, "%[^\t\n\r]\t\n\r", data[i].s_name);
    printf("%s\n", data[i].s_name);
}

An example input file would look like:

示例输入文件如下所示:

Class   Case    Case    Case    Case    Case    Case    Case    Case    Case    Case    Case    Case    Case    Case    Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control
Subject G038    G144    G135    G161    G116    G165    G133    G069    G002    G059    G039    G026    G125    G149    G108    G121    G060    G140    G127    G113    G023    G147    G011    G019    G148    G132    G010    G142    G020    G021
Data1   0.000741628 0.00308607  0.000267431 0.001418697 0.001237904 0.000761145 0.0008281   0.002426075 0.000236698 0.004924871 0.000722752 0.003758006 0.000104813 0.000986619 0.000121803 0.000666854 0   0.000171394 0.000877993 0.002717391 0.001336501 0.000812089 0.001448743 5.28E-05    0.001944298 0.000292529 0.000469631 0.001674047 0.000651526 0.000336615
Data2   0.102002396 0.108035127 0.015052531 0.079923731 0.020643362 0.086480609 0.017907667 0.016279315 0.076263965 0.034876124 0.187481931 0.090615572 0.037460171 0.143326961 0.029628502 0.049487575 0.020175439 0.122975405 0.019754837 0.006702899 0.014033264 0.040024363 0.076610375 0.069287599 0.098896479 0.011813681 0.293331246 0.037558052 0.303052867 0.137591517
Data2   0.218495065 0.242891829 0.23747851  0.101306336 0.309040188 0.237477347 0.293837554 0.34351816  0.217572429 0.168651691 0.179387106 0.166516699 0.099970652 0.181003474 0.076126675 0.10244981  0.449561404 0.139257863 0.127579104 0.355797101 0.354544105 0.262855651 0.10167146  0.186068602 0.316763006 0.187466247 0.05701315  0.123825467 0.064780343 0.069847682
Data4   0.141137543 0.090948286 0.102502388 0.013063365 0.162060849 0.166292135 0.070215996 0.063535037 0.333743609 0.131011609 0.140936687 0.150108506 0.07812762  0.230704405 0.069792935 0.120770743 0.164473684 0.448110378 0.42599534  0.074094203 0.096525097 0.157661185 0.036737518 0.213931398 0.091119285 0.438073807 0.224921728 0.187034237 0.06611442  0.086005218
Data5   0.003594044 0.003948354 0.008137536 0.001327901 0.002161974 0.003552012 0.002760334 0.001898667 0.001420186 0.003165988 0.001011853 0.001217382 0.000314439 0.004254794 0.000213155 0.003650147 0   0.002742309 0.002633978 0   0.002524503 0.002146234 0.001751465 0.006543536 0.003941146 0.00049505  0.00435191  0.001944054 0.001303053 0.004207692
Data6   0.000285242 2.27E-05    0   1.13E-05    0.0002964   3.62E-05    0.000138017 0.000210963 0.000662753 0   0   0   0   4.11E-05    0   0   0   0   0.000101307 0   0   0   0   5.28E-05    0.00152391  0   0   0   0   0
Data7   0.002624223 0.001134584 0.00095511  0.000419934 0.000401011 0.001739761 0.00272583  0.002566717 0.000520735 0.002311674 0.006287944 0   6.29E-05    0.000143882 3.05E-05    0.000491366 0   0   3.38E-05    0   0.001782002 0.000957104 0.002594763 0.000527704 0.000105097 0.001192619 3.13E-05    0   0.000744602 0.000252461
Data8   0.392777683 0.383875286 0.451499522 0.684663315 0.387394299 0.357992026 0.488406597 0.423473155 0.27267563  0.47454646  0.331020526 0.484041709 0.735955056 0.338841956 0.781699147 0.625403622 0.313596491 0.270545891 0.379259109 0.498913043 0.372438372 0.446271644 0.606698813 0.305593668 0.360535996 0.29889739  0.328710081 0.521222594 0.419924299 0.584111756

Edit: I seem to have fixed it by changing the MAX_OBS definition - pretty sure I have a fundamental misunderstanding of what that actually means. I'll have to look into that. Thanks again for the help!

编辑:我似乎已经通过更改MAX_OBS定义来修复它 - 非常肯定我对这实际意味着什么有一个根本的误解。我要调查一下。再次感谢您的帮助!

1 个解决方案

#1


2  

try this:

尝试这个:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define LENGTH 30
#define MAX_OBS 80000

typedef struct{
    char c_name[LENGTH];
    char s_name[LENGTH];
    double value[MAX_OBS];
} Sample;//Duplication of type and variable names should be avoided. pointed out by Jonathan Leffler.

int main(void){
    char line[1024];
    FILE *input = fopen("data.txt", "r");

    fgets(line, sizeof(line), input);

    int total_columns = 0;
    char *p = strtok(line, "\t\n");

    while(p){
        ++total_columns;
        p = strtok(NULL, "\t\n");
    }
    --total_columns;//first column is field name
    rewind(input);
 //*******************************************************************************
    Sample *sample = malloc(total_columns * sizeof(*sample));//To allocate in the stack is large. So allocate by malloc.

    fscanf(input, "%*s\t");//skip first column
    for (int i = 0; i < total_columns; i++){
        fscanf(input, "%[^\t\n]\t", sample[i].c_name);//\n for last column
    }
    fscanf(input, "%*s\t");//skip first column
    for (int i = 0; i < total_columns; i++){
        fscanf(input, "%[^\t\n]\t", sample[i].s_name);
    }
    int r;
    for(r = 0; r < MAX_OBS; ++r){
        if(EOF==fscanf(input, "%*s")) break;
        for (int i = 0; i < total_columns; i++){
            fscanf(input, "%lf", &sample[i].value[r]);
        }
    }
    fclose(input);

    //test print
    printf("%s\n", sample[0].c_name);
    printf("%s\n", sample[0].s_name);
    for(int i = 0; i < r; ++i)
        printf("%f\n", sample[0].value[i]);
    printf("\n%s\n", sample[total_columns-1].c_name);
    printf("%s\n", sample[total_columns-1].s_name);
    for(int i = 0; i < r; ++i)
        printf("%f\n", sample[total_columns-1].value[i]);
    free(sample);
}

#1


2  

try this:

尝试这个:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define LENGTH 30
#define MAX_OBS 80000

typedef struct{
    char c_name[LENGTH];
    char s_name[LENGTH];
    double value[MAX_OBS];
} Sample;//Duplication of type and variable names should be avoided. pointed out by Jonathan Leffler.

int main(void){
    char line[1024];
    FILE *input = fopen("data.txt", "r");

    fgets(line, sizeof(line), input);

    int total_columns = 0;
    char *p = strtok(line, "\t\n");

    while(p){
        ++total_columns;
        p = strtok(NULL, "\t\n");
    }
    --total_columns;//first column is field name
    rewind(input);
 //*******************************************************************************
    Sample *sample = malloc(total_columns * sizeof(*sample));//To allocate in the stack is large. So allocate by malloc.

    fscanf(input, "%*s\t");//skip first column
    for (int i = 0; i < total_columns; i++){
        fscanf(input, "%[^\t\n]\t", sample[i].c_name);//\n for last column
    }
    fscanf(input, "%*s\t");//skip first column
    for (int i = 0; i < total_columns; i++){
        fscanf(input, "%[^\t\n]\t", sample[i].s_name);
    }
    int r;
    for(r = 0; r < MAX_OBS; ++r){
        if(EOF==fscanf(input, "%*s")) break;
        for (int i = 0; i < total_columns; i++){
            fscanf(input, "%lf", &sample[i].value[r]);
        }
    }
    fclose(input);

    //test print
    printf("%s\n", sample[0].c_name);
    printf("%s\n", sample[0].s_name);
    for(int i = 0; i < r; ++i)
        printf("%f\n", sample[0].value[i]);
    printf("\n%s\n", sample[total_columns-1].c_name);
    printf("%s\n", sample[total_columns-1].s_name);
    for(int i = 0; i < r; ++i)
        printf("%f\n", sample[total_columns-1].value[i]);
    free(sample);
}