Python - 计算两个特定字符串之间的字符

时间:2021-09-27 22:16:09

I made a text file containing random sequences of bases (ATCG) and want to find the longest and shortest "reading frame" within those sequences.

我制作了一个包含随机碱基序列(ATCG)的文本文件,并希望在这些序列中找到最长和最短的“阅读框”。

I was able to identify the Start- and Stop-Codons (the two "specific strings" mentioned) with "searchfile" and a for-loop and also know the basics of counting (example of code at the end) but I can't find any possibility to set those two as "boundaries" between I can count.

我能够通过“searchfile”和for循环来识别Start-and-Stop-Codons(提到的两个“特定字符串”)并且还知道计数的基础知识(最后的代码示例)但我不能发现任何可能将这两者设置为我可以计算的“边界”。

Can anybody perhaps give me a hint or tell me how such a function/operation is called so I can at least find it in a documentary or how it could look like? I found many options how to count various different things but none for counting between "x" and "y".

任何人都可以给我一个提示或告诉我如何调用这样的函数/操作,这样我至少可以在纪录片中找到它或者它看起来如何?我找到了许多选项,如何计算各种不同的东西,但没有在“x”和“y”之间计数。

Example of how I looked up the strings between which I want to count:

我如何查找我想要计算的字符串的示例:

searchfile = open('dna.txt', 'r')
for line in searchfile:
    if "ATG" in line: print (line)
searchfile.close()

whole code:

import numpy as np

BASES = ('A', 'C', 'T', 'G')
P = (0.25, 0.25, 0.25, 0.25)

def random_dna_sequence(length):
    return ''.join(np.random.choice(BASES, p=P) for _ in range(length))

with open('dna.txt', 'w+') as txtout:
    for _ in range(10):
        dna = random_dna_sequence(50)
        txtout.write(dna)
        txtout.write("\n")


searchfile = open('dna.txt', 'r')
for line in searchfile:
    if "ATG" in line: print (line)
searchfile.close()

searchfile = open('dna.txt', 'r')
for line in searchfile:
    if "ATG" in line: print (line)
    elif "TAG" in line: print (line)
    elif "TAA" in line: print (line)
    elif "TGA" in line: print (line)
    else: print ("no stop-codon detected")
searchfile.close()

Sidenote: The print instruction is only a temporary placeholder for testing. In the end i would like to set the found strings as mentioned "boundaries" (i can't find a better name for it) at that point.

旁注:打印指令只是用于测试的临时占位符。最后我想设置找到的字符串,如此提到“边界”(我找不到更好的名称)。

Some example lines from the dna.txt file:

dna.txt文件中的一些示例行:

GAAGACGCAATAGGTTCACGGCGCTCATAGGCTTGCCCTCATAGGGCTTG
TCTGAGGTAGAAGGAGCTACTGCCGTTGCAGGTGACGCCCACAGTCCTGA
GTTATTACTCCCTGACTGTCATCTGTTCGGATACCGTGCAGCGCATCGAG
AGGAGATAACGCGATCCTGAGACAGTTTACCTATATGTTCACTACGCATG
CCGAGCTGATCCGACTACTGAAGGTGAATTCTGAAGCTAATCTGCAGTTC

This is a small example (I use 10 and 50 for testing) but in the end the file shall contain 10000 sequences with 1000 characters each.

这是一个小例子(我使用10和50进行测试)但最后文件应包含10000个序列,每个序列1000个字符。

1 个解决方案

#1


What I would do is something like this:

我会做的是这样的事情:

with open("dna.txt", 'r') as searchfile:
    all_dna = searchfile.read()
    start = all_dna.index("ATG")
    rem_dna = all_dna[start + 3:]
    end = rem_dna.index("ATG")
    needed_dna = all_dna[start:(end + 3)]
print len(needed_dna)

index finds where in a string the substring passed as an argument occurs, and will raise ValueError if the substring is not found. with is a keyword useful as a safety precaution for file I/O that ensures that the file is properly closed even if the code inside that block causes an error. If you don't want to include the starting and ending "ATG" in needed_dna, you can set that to all_dna[(start + 3):end]. The brackets, by the way, mean "take the substring of the specified string beginning at the argument before the colon (inclusive, zero-indexed) and ending at the argument after the colon (non-inclusive, also zero-indexed). This can also be used for lists, and can be used without the colon to get the character at a specific index. Hope this helps!

index查找字符串中作为参数传递的子字符串的位置,如果找不到子字符串,则会引发ValueError。 with是一个关键字,可用作文件I / O的安全预防措施,确保文件正确关闭,即使该块中的代码导致错误。如果您不想在needed_dna中包含起始和结束“ATG”,可以将其设置为all_dna [(start + 3):end]。顺便说一句,括号表示“从冒号前的参数(包括零索引)开始,到冒号后的参数(非包含,也为零索引),从指定字符串的子串开始。也可以用于列表,并且可以在没有冒号的情况下使用以获取特定索引处的字符。希望这有帮助!

#1


What I would do is something like this:

我会做的是这样的事情:

with open("dna.txt", 'r') as searchfile:
    all_dna = searchfile.read()
    start = all_dna.index("ATG")
    rem_dna = all_dna[start + 3:]
    end = rem_dna.index("ATG")
    needed_dna = all_dna[start:(end + 3)]
print len(needed_dna)

index finds where in a string the substring passed as an argument occurs, and will raise ValueError if the substring is not found. with is a keyword useful as a safety precaution for file I/O that ensures that the file is properly closed even if the code inside that block causes an error. If you don't want to include the starting and ending "ATG" in needed_dna, you can set that to all_dna[(start + 3):end]. The brackets, by the way, mean "take the substring of the specified string beginning at the argument before the colon (inclusive, zero-indexed) and ending at the argument after the colon (non-inclusive, also zero-indexed). This can also be used for lists, and can be used without the colon to get the character at a specific index. Hope this helps!

index查找字符串中作为参数传递的子字符串的位置,如果找不到子字符串,则会引发ValueError。 with是一个关键字,可用作文件I / O的安全预防措施,确保文件正确关闭,即使该块中的代码导致错误。如果您不想在needed_dna中包含起始和结束“ATG”,可以将其设置为all_dna [(start + 3):end]。顺便说一句,括号表示“从冒号前的参数(包括零索引)开始,到冒号后的参数(非包含,也为零索引),从指定字符串的子串开始。也可以用于列表,并且可以在没有冒号的情况下使用以获取特定索引处的字符。希望这有帮助!