查找字符索引的文件偏移量,忽略换行符

时间:2021-01-22 15:45:51

I have a text file of 3GB size (a FASTA file with DNA sequences). It contains about 50 million lines of differing length, though the most lines are 70 characters wide. I want to extract a string from this file, given two character indices. The difficult part is, that newlines shall not be counted as character.

我有一个3GB大小的文本文件(带有DNA序列的FASTA文件)。它包含大约5000万行不同长度的行,尽管大多数行是70个字符宽。我想从这个文件中提取一个字符串,给出两个字符索引。困难的部分是,换行不应算作字符。

For good speed, I want to use seek() to reach the beginning of the string and start reading, but I need the offset in bytes for that.

为了获得良好的速度,我想使用seek()来到字符串的开头并开始读取,但我需要以字节为单位的偏移量。

My current approach is to write a new file, with all the newlines removed, but that takes another 3GB on disk. I want to find a solution which requires less disk space.

我目前的方法是编写一个新文件,删除所有换行符,但在磁盘上需要另外3GB。我想找到一个需要更少磁盘空间的解决方案。

Using a dictionary mapping each character count to a file offset is not practicable either, because there would be one key for every byte, therefore using at least 16bytes*3 billion characters = 48GB.

使用将每个字符计数映射到文件偏移量的字典也是不可行的,因为每个字节都有一个密钥,因此使用至少16字节* 30亿个字符= 48GB。

I think I need a data structure which allows to retrieve the number of newline characters that come before a character of certain index, then I can add their number and the character index to obtain the file offset in bytes.

我想我需要一个数据结构,它允许检索在某个索引的字符之前出现的换行符的数量,然后我可以添加它们的数字和字符索引来获得文件偏移量(以字节为单位)。

1 个解决方案

#1


1  

The SamTools fai index was designed just for this purpose. Which makes a very small compact index file with enough information to quickly seek to any point in the fasta file for any record inside as long as the file is properly formatted

SamTools fai指数专为此目的而设计。这使得一个非常小的紧凑索引文件具有足够的信息,只要文件格式正确,就可以快速搜索fasta文件中的任何内容以获取任何记录

You can create a SamTools index using samtools faidx command.

您可以使用samtools faidx命令创建SamTools索引。

You can then use other programs in the SamTools package to pull out subsequences or alignments very quickly using the index.

然后,您可以使用SamTools包中的其他程序使用索引快速提取子序列或对齐。

see http://www.htslib.org/doc/samtools.html for usage.

有关用法,请访问http://www.htslib.org/doc/samtools.html。

#1


1  

The SamTools fai index was designed just for this purpose. Which makes a very small compact index file with enough information to quickly seek to any point in the fasta file for any record inside as long as the file is properly formatted

SamTools fai指数专为此目的而设计。这使得一个非常小的紧凑索引文件具有足够的信息,只要文件格式正确,就可以快速搜索fasta文件中的任何内容以获取任何记录

You can create a SamTools index using samtools faidx command.

您可以使用samtools faidx命令创建SamTools索引。

You can then use other programs in the SamTools package to pull out subsequences or alignments very quickly using the index.

然后,您可以使用SamTools包中的其他程序使用索引快速提取子序列或对齐。

see http://www.htslib.org/doc/samtools.html for usage.

有关用法,请访问http://www.htslib.org/doc/samtools.html。