获取一个文件中列的值的平均值,按另一个文件AWK中的因子分组

时间:2022-12-09 16:03:40

This problem is hard to describe in one sentence, so forgive me if the title doesn't capture what I write below..

这个问题很难用一句话来描述,所以请原谅我,如果标题不能捕捉到我在下面写的内容。

I have two files, the first file (file1.txt) contains:

我有两个文件,第一个文件(file1.txt)包含:

Chr1    1   0
Chr1    2   0
Chr1    3   3
Chr1    4   0
Chr1    5   5
Chr1    6   0
Chr1    7   0
Chr1    8   0
Chr1    9   0
Chr1    10  7
Chr1    11  0
Chr1    12  0
Chr1    13  0
Chr1    14  9
Chr1    15  0
Chr1    16  0
Chr1    17  0
Chr1    18  0
Chr1    19  0
Chr1    20  0
Chr2    1   0
Chr2    2   0
Chr2    3   0
Chr2    4   9
Chr2    5   10
Chr2    6   1
Chr2    7   0
Chr2    8   0
Chr2    9   0
Chr2    10  0

Chr1 and Chr2 stand for chromosomes (column1) and column2 contain the positions on the chromosome. Notice how the number always starts from 1 and then goes up to an larger number but unknown number (for Chr1, it ends at 20). The third column contains a count at that chromosome and position.

Chr1和Chr2代表染色体(第1列),第2列包含染色体上的位置。注意数字总是从1开始然后上升到更大的数字但未知数字(对于Chr1,它以20结束)。第三列包含该染色体和位置的计数。

File2.txt looks like this:

File2.txt看起来像这样:

Chr1    1   10
Chr1    5   15
Chr1    10  20
Chr1    15  25
Chr5    1   10

It specifies windows that are 10 positions apart (will be in the sorted order: the starting position goes up in increments of 5, but the window size is 10).

它指定相隔10个位置的窗口(将按排序顺序排列:起始位置以5为增量上升,但窗口大小为10)。


I need to average the count within each window.

我需要在每个窗口内平均计数。

The window for Chr1 position 1 to Chr1 position 10 has combined count 0+0+3+0+5+0+0+0+0+7=15 So the average is 15/10(size of the window) = 1.5

Chr1位置1到Chr1位置10的窗口已合并计数0 + 0 + 3 + 0 + 5 + 0 + 0 + 0 + 0 + 7 = 15因此平均值为15/10(窗口大小)= 1.5

The window for Chr1 position 5 to Chr1 position 15 has combined count 5+0+0+0+0+7+0+0+0+9+0=21 So the average is 21/11(size of the window) = 1.909

Chr1位置5到Chr1位置15的窗口组合计数5 + 0 + 0 + 0 + 0 + 7 + 0 + 0 + 0 + 9 + 0 = 21因此平均值为21/11(窗口大小)= 1.909

The window for Chr1 position 10 to Chr1 position 20 has combined count 7+0+0+0+9+0+0+0+0+0+0=16 So the average is 16/11(size of the window) = 1.454

Chr1位置10到Chr1位置20的窗口组合计数7 + 0 + 0 + 0 + 9 + 0 + 0 + 0 + 0 + 0 + 0 = 16因此平均值为16/11(窗口大小)= 1.454

The window for Chr1 position 15 to Chr1 position 25 (last 5 positions are out of range) has combined count 0+0+0+0+0+0=0 So the average is 0

Chr1位置15到Chr1位置25的窗口(最后5个位置超出范围)已合并计数0 + 0 + 0 + 0 + 0 + 0 = 0因此平均值为0

The window for Chr5 position 1 to Chr5 position 10 (no records in file1.txt) So the average is 0

Chr5位置1到Chr5位置10的窗口(file1.txt中没有记录)所以平均值为0

The output should be:

输出应该是:

Chr1    1   10  1.5
Chr1    5   15  1.909
Chr1    10  20  1.454
Chr1    15  25  0
Chr5    1   10  0

Notice how Chr2 isn't in the output file because there weren't any windows specified for Chr2 in file2..

注意Chr2不在输出文件中,因为在file2中没有为Chr2指定任何窗口。

I've coded something in perl to solve the problem, however, it is rather slow due to the large size of file1.txt. Is this problem solvable using awk? I'm hoping it might offer a faster (and shorter) solution..

我在perl中编写了一些代码来解决问题,然而,由于file1.txt的大小,它相当慢。这个问题是否可以使用awk解决?我希望它可以提供更快(更短)的解决方案..

I'm guessing the solution would involve an associative array, but so far all I've figured out is how to join on column1 and 2, which is not even close to solving the problem..

我猜这个解决方案会涉及一个关联数组,但到目前为止我所知道的是如何加入column1和2,这甚至不能解决问题。

awk 'FNR==NR{a[$1,$2]=$3;next}{ print a[$1,$2]}' file1.txt file2.txt

Or is this problem not suited for awk?

或者这个问题不适合awk?

1 个解决方案

#1


You're loading the data from the first file into the array properly. Then when you read the second file, you need to loop through the values selected by the range, calculating the average.

您正在将第一个文件中的数据正确加载到数组中。然后,当您读取第二个文件时,您需要遍历范围所选的值,计算平均值。

awk 'FNR==NR{a[$1,$2]=$3;next}
     { total = 0;
       for(i = $2; i <= $3; i++) total += a[$1,i];
       print $1, $2, $3, total/($3-$1+1);
     }' file1.txt file2.txt

#1


You're loading the data from the first file into the array properly. Then when you read the second file, you need to loop through the values selected by the range, calculating the average.

您正在将第一个文件中的数据正确加载到数组中。然后,当您读取第二个文件时,您需要遍历范围所选的值,计算平均值。

awk 'FNR==NR{a[$1,$2]=$3;next}
     { total = 0;
       for(i = $2; i <= $3; i++) total += a[$1,i];
       print $1, $2, $3, total/($3-$1+1);
     }' file1.txt file2.txt