Binning And Mapping In Gingko Fixed

3 minute read

Making bins

In general terms, ginkgo’s approach is to construct bins with the property that an equal number of sequences map uniquely to points in the bin. These bins are constructed by first, fixing three parameters:

N, the expected number of reads per bin
k, the length of the sequences produced by the sequencing machine (150, for example)
align, the name of the alignment program (bwa or bowtie)

Next, within each chromosome with sequence C, consider in succession the sequences C[i:i+k] and map them to the reference genome using the chosen alignment program. Keep only those reads that:

have a quality score >= 30
map uniquely to the reference genome (indicated by a ‘NM:i:0’ in the sam/bam file output by the chosen aligner)

The first bin then runs from 0 to the end of the Nth uniquely mapped sequence; the next from the beginning of the N+1-st uniquely mapped sequence to the end of sequence 2N, and so on.

In fact, ginkgo is a little bit cleverer than this:

it locates the centromere of each chromosome by finding the longest string of ‘N’ in the sequence
it divides the portions of the chromosome on either side of the centromere into bins
since the length of these portions isn’t a multiple of N, it distributes the remainder over the bins

Finally, ginkgo works out the GC content of each bin and saves that as well. This is done simply by counting the GgCc characters as a fraction of the length of the sequence (excluding N characters). This information is stored in a separate file and used later for bias correction.

See:

genome/scripts where the code for this is located
buildGenome, the particular script that controls the process.

Mapping reads to bins in Ginkgo (compare `bedtools coverage`)

Ginkgo maps the reads to bins by assigning the read to the bin where the read begins. This is in contrast to the bedtools coverage command, which maps a read to a bin if it overlaps that bin. In practice, the differences are slight. In one experiment, 90% of the bin counts computed by bedtools agreed exactly with the counts from ginkgo, and the remaining 10% of the counts differed by less than 1%.

More specifically, we can illustrate this with some specific calculations. We work with the following files to begin:

SCW-11_Bone-marrow.bed is the initial data file computed by running samToBed on a bam file from the alignment output.
variable_500000_76_bwa is the file computed by ginkgo identifying bins with 500kb unique reads of length 76 bp based on bwa alignment. This file can be downloaded along with the many others from the ginkgo home site.

The small C++ program binUnsorted.cpp in the ginkgo distribution maps reads to the bins. It takes as arguments:

the file of bins,
the length of that file,
the bed input file,
the name of the sample to be put as the first line of the output file,
and the name of the output file.

$ $GINKGO_PATH/scripts/binUnsorted variable_500000_76_bwa `wc -l < variable_500000_76_bwa` \
SCW-11_Bone-marrow.bed SCW-11_Bone-marrow SCW-11_Bone-marrow_mapped

!~/GitHub/ginkgo/scripts/binUnsorted variable_500000_76_bwa `wc -l < variable_500000_76_bwa` SCW-11_Bone-marrow.bed SCW-11_Bone-marrow SCW-11_Bone-marrow_mapped

import pandas as pd
import numpy as np
import seaborn as sns

binCounts = pd.read_table('SCW-11_Bone-marrow_mapped',names=['counts'],skiprows=1)
binCounts.head()

	counts
0	3811
1	1342
2	2080
3	1553
4	2351

Excluding read counts greater than 5000 (of which there are 28), we obtain a nice poisson/negative binomial appearing distribution of counts.

sns.distplot(binCounts[binCounts['counts']<5000],kde=False)

Let’s retry this using bedtools coverage. First, the bin file variable_500000_76_bwa needs to be put in bed format. Currently it is only a list of the consecutive ends of the bins. This can be done pretty easily with awk.

!cat variable_500000_76_bwa | awk 'BEGIN {x=0} ; /chr[0-9XY]+/ {if ($2>=x) { print($1"\t"x"\t"$2); x=$2 } else {x=0 ; print($1"\t"x"\t"$2); x=$2}}' > variable_500000_76_bwa.bed

Now we use bedtools coverage to compute counts. We use the -F .75 flag so that a read must overlap a bin by at least 75% of its length to be counted. Given the large bin sizes and the short reads this will make sure each read gets assigned to only one bin.

!bedtools coverage -counts -F .75 -a variable_500000_76_bwa.bed -b SCW-11_Bone-marrow.bed > SCW-11_Bone-marrow.counts
Counts = pd.read_table('SCW-11_Bone-marrow.counts',names=['chr','start','end','bedtools_count'])

Counts['ginkgo_counts']=binCounts['counts']

Counts.head()

	chr	start	end	bedtools_count	ginkgo_counts
0	chr1	0	978962	3811	3811
1	chr1	978962	1483420	1339	1342
2	chr1	1483420	1989002	2080	2080
3	chr1	1989002	2493106	1553	1553
4	chr1	2493106	3045688	2351	2351

The difference between the two counts is very slight – less than 1%. The two counts are identical 89% of the time, and never differ by even 1%.

Counts['diff']=Counts['bedtools_count']-Counts['ginkgo_counts']
print('Percent of bins where the counts differ:',100*(Counts['diff']!=0).sum()/Counts.shape[0],'percent')
print('Largest percent difference (in absolute value):',100* np.abs((Counts['diff']/Counts['bedtools_count'])).max(),"percent")

Percent of bins where the counts differ: 11.509501613481534 percent Largest percent difference (in absolute value): 0.7905138339920948 percent

Jeremy Teitelbaum

Binning And Mapping In Gingko Fixed

Making bins

Mapping reads to bins in Ginkgo (compare `bedtools coverage`)

You May Also Enjoy

Updates

Polya’s Urn

deFinetti’s Theorem Part II

deFinetti’s Theorem

Jeremy Teitelbaum

Making bins

Mapping reads to bins in Ginkgo (compare bedtools coverage)

You May Also Enjoy

Updates

Polya’s Urn

deFinetti’s Theorem Part II

deFinetti’s Theorem

Mapping reads to bins in Ginkgo (compare `bedtools coverage`)