jellyfish K-mer analysis and genome size estimate

时间:2023-03-09 17:12:35
jellyfish  K-mer analysis and genome size estimate
http://www.cbcb.umd.edu/software/jellyfish/
http://www.genome.umd.edu/jellyfish.html
https://github.com/gmarcais/Jellyfish/releases
wget https://github.com/gmarcais/Jellyfish/releases/download/v2.2.3/jellyfish-2.2.3-CentOS6.tar.gz
tar -zxvf jellyfish-2.2.3-CentOS6.tar.gz
jellyfish就在bin里面,直接将这个可执行程序复制到你的环境变量目录里就可以用了
服务器安装包地址:
/home/cmiao/jellyfish-2.2.3
jellyfish不能用fq.gz 要先转为fq才行
用gunzip -c *.fq.gz > *.fq 
$ jellyfish count -t 30 -C -m 21 -s 150G  --min-quality=20 --quality-start=33 ./*.fastq
Assume a haploid genome, for simplicity. In the picture provided, the first peak at depth ~31 indicates amount of 1-copy content (in other words, the genome has exactly 1 copy of that kmer, so it is unique). The weak peak at ~62x indicates the amount of 2-copy content. Everything under ~11x can be assumed to be error kmers, unrelated to genome size.

So, to estimate manually, take the sum of the counts of unique kmers under the first peak and multiply by 1; add the sum of the counts of unique kmers under the peak at 2x the depth of the first peak and multiply by 2; etc, for all peaks. This will give you the haploid genome size. So if your genome is tetraploid, the actual size will be 1/4 of your result, since the first peak will correspond to mutations present on only 1 ploidy (1/0/0/0 genotype).

You can make this more accurate by modelling the peaks as a sum of Gaussian curves, but that probably won't change the result much. Of course, this method is subjective because calling peaks is subjective.

Please note - I think 17-mers are too short for this kind of analysis. I prefer 31-mers because they are the longest computationally-efficient kmers. Also, FYI, BBNorm is faster than Jellyfish and can also generate kmer-frequency histograms:

khist.sh in=reads.fq hist=khist.txt

Also, it makes more sense to plot these things as log-log rather than linear-linear; and the Y-axis should be count, not frequency, which is useless for the purpose of genome-size estimation.

Outline

  1. count k-mer occurence using Jellyfish (jellyfish count)
  2. summarize as histogram (jellyfish histo)
  3. plot graph with R
  4. determine the total number of k-mer analyzed and the peak position
  5. compare the peak shape with poisson distribution

Count k-mer occurence

In this example we have 5 pair of fastq files in three different subdirectories. The file to process can be specified with "*/*.qf.fastq" and veriied with ls.

$ ls */*.qf.fastq
run1/s_1_1_sequence.qf.fastq  run2/s_2_2_sequence.qf.fastq
run1/s_1_2_sequence.qf.fastq  run3/s_1_1_sequence.qf.fastq
run2/s_1_1_sequence.qf.fastq  run3/s_1_2_sequence.qf.fastq
run2/s_1_2_sequence.qf.fastq  run3/s_2_1_sequence.qf.fastq
run2/s_2_1_sequence.qf.fastq  run3/s_2_2_sequence.qf.fastq

Next, we issue the jellyfish count command

jellyfish count -t 8 -C -m 25 -s 5G -o spec1_25mer --min-quality=20 --quality-start=33 */*.qf.fastq
-t 8
specifies the number of threads to be used. This value should be equal to the number of cores on the machine or the number of slots you reserved through job management system ($NSLOTS in SGE or UGE).
-C
specifies the both strands are considered. If you do not specify this, the apparent depth would be half, --- that is undesirable
-m 25
specified that now you are counting for 25 mer (i.e., k=25)
-s 5G
is some kind of magical number specification of hash size. This should be as high as the physical memory allows. The higher the faster, but exceeding the available memory leads to failure or extremely slow counting.
-o spec1_25mer
specifies the prefix of output file names.
--quality-start=33
specified that your fastq file have 33 based quality value string. Be careful on the dataformat. There are cases that your data are 64 based depending on the sequending system and software versions. This is relevant only when you specify --min-quality
--min-quality=20
specifies that nucleotide having qv lower than 20 should not included in the count. This selection reduces the k-mers derived from sequence errors and make the peak clearer.
*/*.qf.fastq
will be expanded to the ten filenames explained above by the shell and passed to jellyfish as input files

summarize as histogram (jellyfish histo)

First confirm that you got the output file

$ ls spec1_25mer*
spec1_25mer_0

now that there is a single file spec1_25mer_0

$ jellyfish histo -o spec1_25mer.histo spec1_25mer_0

Confirm that you got the output

$ ls spec1_25mer*
spec1_25mer_0  spec1_25mer.histo

Examine the numbers by your eyes

$ head -25 spec1_25mer.histo
1 461938583
2 95606044
3 19280477
4 13836754
5 11018480
6 9555090
7 8557935
8 7863244
9 7319505
10 6920880
11 6589723
12 6321923
13 6148638
14 6036120
15 5972264
16 5962234
17 5987696
18 6051171
19 6154429
20 6297373
21 6485135
22 6700579
23 6932570
24 7217627
25 7533211
运行出错:
terminate called after throwing an instance of 'jellyfish::invertible_hash::ErrorAllocation'
  what():  Failed to allocate 628292358736 bytes of memory
Aborted (core dumped)
解决:
50G should be more than enough. That is the amount of memory I usually am using and I have never had any problems. This also means that you probably do not need the high memory nodes.
freemao
FAFU