如何在Perl中合并两个FASTA文件(一个带换行符的文件)?

时间:2023-01-15 12:41:12

I have two following Fasta file:

我有两个关注Fasta文件:

file1.fasta

file1.fasta

>0
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
>1
GTTAAGTTATATCAAACTAAATATACATACTATAAA
>2
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC

file2.qual

file2.qual

>0
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 15 40 40
>1
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40
40 40 40 40 40 40 40 40 40 40 40
>2
40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4
40 8 3 29 10 19 18 40 19 15 5

Note the line break in "qual" file for each fasta header - marked with ">". Number of file header ('>') is the same for both files. Number of numerical qualities = sequence length.

请注意每个fasta标题的“qual”文件中的换行符 - 标有“>”。两个文件的文件头数('>')相同。数字质量数=序列长度。

What I want to do is to append this two files yielding:

我想要做的是附加这两个文件产生:

GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40
GTTAAGTTATATCAAACTAAATATACATACTATAAA  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC  40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4 40 8 3 29 10 19 18 40 19 15 5

But somehow my code below fail to do it correctly? Especially the second line of each entry in 'qual' file doesn't get printed.

但不知何故,我的代码无法正确执行?特别是'qual'文件中每个条目的第二行都没有打印出来。

use strict;
use Data::Dumper;        
use Carp;
use File::Basename;      

my $fastafile = $ARGV[0] || "reads/2039F.2.fasta"; 
my $base      = basename( $fastafile, ".fasta" );
my $qualfile  = "reads/" . $base . ".qual";
print "$qualfile\n";

open SEQ, '<', $fastafile or die $!; #Seq
open PRB, '<', $qualfile or die $!; #quality


while (my $seq = <SEQ>) {
     my $qual = <PRB>;
     chomp($seq);
     chomp($qual);

     if ($seq =~ /^>/ || $qual =~ /^>/) {
         next;
     }
     else {
         print "$seq\t$qual\n";      
     }

}

What's the correct way to do it?

这样做的正确方法是什么?

3 个解决方案

#1


3  

You're missing the 2nd (and every subsequent) line of the quality scores and would also miss additional sequence lines. For this and code re-use purposes, the way to handle FASTA sequences is as whole entries/records:

您错过了质量得分的第二行(以及随后的每一行),并且还会错过其他序列行。为了这个和代码重用目的,处理FASTA序列的方式是整个条目/记录:

local $/ = "\n>";
while (my $seq = <SEQ>) {
     my $qual = <PRB>;
     chomp($seq);  $seq =~ s/^>*.+\n//;  $seq =~ s/\n//g;
     chomp($qual);  $qual =~ s/^>*.+\n//;  $qual =~ s/\n/ /g;

     print "$seq\t$qual\n";      

}

You could also easily capture the FASTA header in the first replace.

您还可以在第一次替换时轻松捕获FASTA标头。

#2


8  

The problem is you are advancing through the file in parallel, so when the line is ">" in one file, it might not be ">" in the next.

问题是你正在并行浏览文件,所以当一行中的行为“>”时,下一行可能不是“>”。

The way you are reading the data is in pairs, like so:

您正在读取数据的方式是成对的,如下所示:

1: >0 
2: >0
1: GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
1: >1
2: 40 40 40 40 40 40 40 40 15 40 40
1: GTTAAGTTATATCAAACTAAATATACATACTATAAA
2: >1
1: >2
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40
1: GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC
2: 40 40 40 40 40 40 40 40 40 40 40
1: EOF
2: >2
1: EOF
2: 40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4
1: EOF
2: 40 8 3 29 10 19 18 40 19 15 5

The same set of data applied your looping rules would do this:

应用循环规则的同一组数据将执行此操作:

1: GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
1: GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC
2: 40 40 40 40 40 40 40 40 40 40 40

So you need to either separate the looping logic out or find a way to make the files match.

因此,您需要将循环逻辑分开或找到使文件匹配的方法。

Here is an attempt at separating the seeking, but I haven't tested it.

这是尝试分离寻求,但我还没有测试过。

fileIO: {
  while( 1 ){ 
   my $seq; 
   my $qual  = q{};
   while( 1 ){ 
     $seq = <SEQ>; 
     last fileIO if not $seq;  # stop at end of file
     last if $seq !~ /^>/; 
  }
  while( 1 ){ 
     my $qual_in = <PRB>;
     last fileIO if not $qual_in; # stop at end of file 
     last if $qual_in =~ /^>/ and $qual ne q{}; 
     next if $qual_in =~ /^>/ and $qual eq q{}; 
     $qual .= $qual_in;
  }
  print "$seq \n $qual \n";

 }
}

Update

I re-factored the above code into a single function that will read a chunk from an arbitrary file handle as needed, it seems to work as needed. Note of course I experimented here a little with a trick I've been meaning to use for something practical.

我将上面的代码重新分解为一个函数,它将根据需要从任意文件句柄中读取一个块,它似乎可以根据需要工作。注意当然我在这里尝试了一些技巧,我一直想用于实用的东西。

use strict;
use warnings;

# 
#  readUntilNext( $fileHandle, \$scalar_ref ); 
#
#  returns 0 when nothing could be read from the fileHandle. 
#  otherwise returns 1; 
#

sub readUntilNext {
    my ($fh)            = shift;
    my ($output)        = shift;
    my ($output_buffer) = '';
    while (1) {
        my $line = <$fh>;
        if ( !$line ) { # No more data
            # No data to flush to user, return false.
            return 0 if $output_buffer eq q{};
            last;  # data to  flush to user, loop exit. 
        }
        if ( $line =~ /^>/ ) {
            # Didn't get anything, keep looking. 
            next if $output_buffer eq q{};
            # Got something, flush data to user. 
            last;
        }
        chomp($line);
        $output_buffer .= $line;
    }
    # Data to flush to user 
    # Write to the scalar-reference 
    $$output .= $output_buffer;
    return 1;
}

open my $m, '<', 'a.txt';
open my $n , '<', 'b.txt';
# Creates 2 scalar references every loop, and only loops as long 
# as both files have data. 
while ( readUntilNext( $m, \my $seq ) && readUntilNext( $n, \my $qual ) ) {
    print "$seq\t$qual\n";
}

And the above code, tested, does exactly what you want to do.

经过测试的上述代码完全符合您的要求。

Note on that \my stuff

注意那个\我的东西

while( readUntilNext( $m, \my $seq ) ) { 
}

is fundamentally the same as

从根本上讲是一样的

my $seq; 
while( readUntilNext( $m, \$seq ) ) { 
}

Except for the fact the former creates a new scalar every time, guaranteeing that the same value wont be visible to a sucessive loop;

除了前者每次创建一个新标量的事实,保证相同的值不会对过度循环可见;

so it becomes more like:

所以它变得更像:

while( 1 ){ 
 my $seq; 
 last if not readUntilNext($m, \$seq);
 do { 
    # loop body here
 }
}

#3


4  

Here is a solution not using perl, but plain shell commands:

这是一个不使用perl但是使用普通shell命令的解决方案:

prompt>grep -v '^>[0-9]' file1.fasta > tmp1
prompt>(tr '\012' ' ' < file2.qual; echo) | sed 's/>[0-9]* /\n/g' | sed 1d > tmp2
prompt>paste tmp1 tmp2
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT    40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40
GTTAAGTTATATCAAACTAAATATACATACTATAAA    40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC    40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4 40 8 3 29 10 19 18 40 19 15 5
prompt>

I searched many years for the paste command (knowing "this is a super basic operation, someone must already have implemented something to solve this problem").

我搜索了多年的粘贴命令(知道“这是一个超级基本的操作,有人必须已经实现了一些东西来解决这个问题”)。

The second command line first translates all newlines to spaces, and the echo command is added to add a final newline to the input (because sed will ignore lines lacking EOL), thereby joining all the input lines into one single line which then the sed command splits up again (portability note: not all sed programs will work with arbitrary line lengths, but GNU sed does).

第二个命令行首先将所有换行转换为空格,并添加echo命令以向输入添加最终换行符(因为sed将忽略缺少EOL的行),从而将所有输入行连接成一行然后是sed命令再次拆分(可移植性说明:并非所有sed程序都可以使用任意行长度,但GNU sed确实如此)。

#1


3  

You're missing the 2nd (and every subsequent) line of the quality scores and would also miss additional sequence lines. For this and code re-use purposes, the way to handle FASTA sequences is as whole entries/records:

您错过了质量得分的第二行(以及随后的每一行),并且还会错过其他序列行。为了这个和代码重用目的,处理FASTA序列的方式是整个条目/记录:

local $/ = "\n>";
while (my $seq = <SEQ>) {
     my $qual = <PRB>;
     chomp($seq);  $seq =~ s/^>*.+\n//;  $seq =~ s/\n//g;
     chomp($qual);  $qual =~ s/^>*.+\n//;  $qual =~ s/\n/ /g;

     print "$seq\t$qual\n";      

}

You could also easily capture the FASTA header in the first replace.

您还可以在第一次替换时轻松捕获FASTA标头。

#2


8  

The problem is you are advancing through the file in parallel, so when the line is ">" in one file, it might not be ">" in the next.

问题是你正在并行浏览文件,所以当一行中的行为“>”时,下一行可能不是“>”。

The way you are reading the data is in pairs, like so:

您正在读取数据的方式是成对的,如下所示:

1: >0 
2: >0
1: GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
1: >1
2: 40 40 40 40 40 40 40 40 15 40 40
1: GTTAAGTTATATCAAACTAAATATACATACTATAAA
2: >1
1: >2
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40
1: GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC
2: 40 40 40 40 40 40 40 40 40 40 40
1: EOF
2: >2
1: EOF
2: 40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4
1: EOF
2: 40 8 3 29 10 19 18 40 19 15 5

The same set of data applied your looping rules would do this:

应用循环规则的同一组数据将执行此操作:

1: GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
1: GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC
2: 40 40 40 40 40 40 40 40 40 40 40

So you need to either separate the looping logic out or find a way to make the files match.

因此,您需要将循环逻辑分开或找到使文件匹配的方法。

Here is an attempt at separating the seeking, but I haven't tested it.

这是尝试分离寻求,但我还没有测试过。

fileIO: {
  while( 1 ){ 
   my $seq; 
   my $qual  = q{};
   while( 1 ){ 
     $seq = <SEQ>; 
     last fileIO if not $seq;  # stop at end of file
     last if $seq !~ /^>/; 
  }
  while( 1 ){ 
     my $qual_in = <PRB>;
     last fileIO if not $qual_in; # stop at end of file 
     last if $qual_in =~ /^>/ and $qual ne q{}; 
     next if $qual_in =~ /^>/ and $qual eq q{}; 
     $qual .= $qual_in;
  }
  print "$seq \n $qual \n";

 }
}

Update

I re-factored the above code into a single function that will read a chunk from an arbitrary file handle as needed, it seems to work as needed. Note of course I experimented here a little with a trick I've been meaning to use for something practical.

我将上面的代码重新分解为一个函数,它将根据需要从任意文件句柄中读取一个块,它似乎可以根据需要工作。注意当然我在这里尝试了一些技巧,我一直想用于实用的东西。

use strict;
use warnings;

# 
#  readUntilNext( $fileHandle, \$scalar_ref ); 
#
#  returns 0 when nothing could be read from the fileHandle. 
#  otherwise returns 1; 
#

sub readUntilNext {
    my ($fh)            = shift;
    my ($output)        = shift;
    my ($output_buffer) = '';
    while (1) {
        my $line = <$fh>;
        if ( !$line ) { # No more data
            # No data to flush to user, return false.
            return 0 if $output_buffer eq q{};
            last;  # data to  flush to user, loop exit. 
        }
        if ( $line =~ /^>/ ) {
            # Didn't get anything, keep looking. 
            next if $output_buffer eq q{};
            # Got something, flush data to user. 
            last;
        }
        chomp($line);
        $output_buffer .= $line;
    }
    # Data to flush to user 
    # Write to the scalar-reference 
    $$output .= $output_buffer;
    return 1;
}

open my $m, '<', 'a.txt';
open my $n , '<', 'b.txt';
# Creates 2 scalar references every loop, and only loops as long 
# as both files have data. 
while ( readUntilNext( $m, \my $seq ) && readUntilNext( $n, \my $qual ) ) {
    print "$seq\t$qual\n";
}

And the above code, tested, does exactly what you want to do.

经过测试的上述代码完全符合您的要求。

Note on that \my stuff

注意那个\我的东西

while( readUntilNext( $m, \my $seq ) ) { 
}

is fundamentally the same as

从根本上讲是一样的

my $seq; 
while( readUntilNext( $m, \$seq ) ) { 
}

Except for the fact the former creates a new scalar every time, guaranteeing that the same value wont be visible to a sucessive loop;

除了前者每次创建一个新标量的事实,保证相同的值不会对过度循环可见;

so it becomes more like:

所以它变得更像:

while( 1 ){ 
 my $seq; 
 last if not readUntilNext($m, \$seq);
 do { 
    # loop body here
 }
}

#3


4  

Here is a solution not using perl, but plain shell commands:

这是一个不使用perl但是使用普通shell命令的解决方案:

prompt>grep -v '^>[0-9]' file1.fasta > tmp1
prompt>(tr '\012' ' ' < file2.qual; echo) | sed 's/>[0-9]* /\n/g' | sed 1d > tmp2
prompt>paste tmp1 tmp2
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT    40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40
GTTAAGTTATATCAAACTAAATATACATACTATAAA    40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC    40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4 40 8 3 29 10 19 18 40 19 15 5
prompt>

I searched many years for the paste command (knowing "this is a super basic operation, someone must already have implemented something to solve this problem").

我搜索了多年的粘贴命令(知道“这是一个超级基本的操作,有人必须已经实现了一些东西来解决这个问题”)。

The second command line first translates all newlines to spaces, and the echo command is added to add a final newline to the input (because sed will ignore lines lacking EOL), thereby joining all the input lines into one single line which then the sed command splits up again (portability note: not all sed programs will work with arbitrary line lengths, but GNU sed does).

第二个命令行首先将所有换行转换为空格,并添加echo命令以向输入添加最终换行符(因为sed将忽略缺少EOL的行),从而将所有输入行连接成一行然后是sed命令再次拆分(可移植性说明:并非所有sed程序都可以使用任意行长度,但GNU sed确实如此)。