序列比较功能无法按预期工作

时间:2022-09-01 21:00:09

Amateur Python coder trying to learn here, so I wanted to ask whats going on with my script. I cant really work out where its going wrong. (think around line 24 or difference = "%s-%s [%d]" %( seq1[i], seq2[i], i)).

业余Python编码器试图在这里学习,所以我想问一下我的脚本是怎么回事。我真的无法解决它出错的地方。 (考虑第24行或差异=“%s-%s [%d]”%(seq1 [i],seq2 [i],i))。

The function is to take a load of sequences, rename them (this bit is working) and then compare each sequence to a reference sequence (here just the first sequence in the file) and if a letter in the sequence does not match the reference print the difference and location. However as you can see below this is not working

该函数是对一系列序列进行重命名,重命名它们(此位有效),然后将每个序列与参考序列(此处只是文件中的第一个序列)进行比较,如果序列中的字母与参考打印不匹配差异和位置。但是,正如您在下面看到的那样,这是行不通的

Here is a mock up input file - http://pastebin.com/AH2zxdBn

这是一个模拟输入文件 - http://pastebin.com/AH2zxdBn

import re
from Bio.Alphabet import generic_dna, generic_protein
from Bio import SeqIO

def compare_seqs( seq1, seq2 ):

  similar = 0
  diff    = 0

  diff_positions = []

  for i in range(0, len( seq1 )):
    if ( seq1[ i ] != seq2[ i ]):
      difference = "%s-%s [%d]" %( seq1[i], seq2[i], i)
      diff_positions.append( difference )
#    else:
#       similar += 1


  return ",".join( diff_positions )


new_seq = []

reference_sequence = ""
reference_name     = ""

outfile = open("test_out.csv", 'w')

for record in SeqIO.parse(open('test.fa', 'ru'), 'fasta', generic_protein):
    record_id = re.sub(r'\d+_(\d+_\d\#\d+)_\d+', r'\1', record.id)


    if ( not reference_sequence ):
      reference_sequence = record.seq
      reference_name     = record_id
      #continue
    print "\t".join([reference_name, record_id, compare_seqs(reference_sequence, record.seq)])

Here is the output I am getting, which is incorrect as pos 454 in 7065_8#4 actually = P

这是我得到的输出,这是不正确的,因为7065_8#4中的pos 454实际上= P.

7065_8#1    7065_8#1    
7065_8#1    7065_8#2    
7065_8#1    7065_8#3    
7065_8#1    7065_8#4    E-G [245]
7065_8#1    7065_8#5

1 个解决方案

#1


2  

The best way to troubleshoot this is definitely breaking it down into smaller pieces and verifying each one.

解决这个问题的最佳方法是将它分解成更小的部分并验证每个部分。

Here's a minimal difference implementation:

这是一个最小差异实现:

def compare_sequences(seq1, seq2):
    for index, (a, b) in enumerate(zip(seq1, seq2)):
        if a != b:
            yield index, a, b

Here it is working:

这是工作:

print list(compare_sequences('abcdef', 'abddef'))

Which gives me

这给了我

[(2, 'c', 'd')]

You can use this as a simple proof that it's working. What I'd recommend doing is isloating the input into your function and verifying that it works as expected.

你可以使用它作为它工作的简单证据。我建议做的是将输入调用到您的函数中并验证它是否按预期工作。

Maybe there's an issue with the input having whitespace or a newline where you do not expect it which is throwing everything off?

也许有一个问题,输入有空格或换行符,你不指望它抛出一切?

#1


2  

The best way to troubleshoot this is definitely breaking it down into smaller pieces and verifying each one.

解决这个问题的最佳方法是将它分解成更小的部分并验证每个部分。

Here's a minimal difference implementation:

这是一个最小差异实现:

def compare_sequences(seq1, seq2):
    for index, (a, b) in enumerate(zip(seq1, seq2)):
        if a != b:
            yield index, a, b

Here it is working:

这是工作:

print list(compare_sequences('abcdef', 'abddef'))

Which gives me

这给了我

[(2, 'c', 'd')]

You can use this as a simple proof that it's working. What I'd recommend doing is isloating the input into your function and verifying that it works as expected.

你可以使用它作为它工作的简单证据。我建议做的是将输入调用到您的函数中并验证它是否按预期工作。

Maybe there's an issue with the input having whitespace or a newline where you do not expect it which is throwing everything off?

也许有一个问题,输入有空格或换行符,你不指望它抛出一切?