Notes on sequencing files
Paired End Reads
In paired end sequencing, each “piece” of DNA gets sequenced from both ends. So, for example, if you have 100 bp reads, and a 300 bp fragment, then you will get the sequence of the first and last 100 bp of the fragment. The middle part is missing. (It can also happen that the two parts overlap). The aligner matches these two pieces up using the RNEXT/PNEXT fields and the SAM flags.
If both ends of the read are mapped to the same chromosome, then RNEXT is ‘=’ and PNEXT is the start position of the paired read. The TLEN field gives the distance between the left most and right most bases in the pair provided both are on the same chromosome, so TLEN is the length of the entire fragment. TLEN is positive on the left segment, and negative on the right. The flags can tell you if both fragments are on the forward or reverse strand.
PCR Duplicates
Hypothesis is that cuts in the genome for sequencing are random and independent so the chance of getting two identical reads is essentially zero. To get 100x coverage using ~100 bp reads means you need 30 billion reads of the 30 billion read genome. So the expected number of times each base pair is the starting pont of a read is one. The chance of picking the same starting point twice is 1 in 30 billion.
PCR duplicates are flagged in the SAM/BAM file by setting the 0x400 bit in the SAM flags.
When looking at duplicates in paired end sequencing, the aligner tests if the entire fragment is a duplicate by looking at both of the pieces – it’s a duplicate if both pieces are identical.
Comparing the sequence to the reference
The SAM/BAM file contains the sequences together with a variety of information that allows at least a partial comparison of the sequence to the reference. The key fields for this are the CIGAR string, the NM field, and the MD field.
The CIGAR string describes how the sequence was mapped against the reference – which positions were matched, which skipped, and which added. But a ‘match’ could mean that the corresponding base pairs are different. For example, for 100 bp reads, a CIGAR score of 100M means that the aligner’s optimal alignment lines up 100 bases from the read against 100 base pairs of the reference – but you can’t know from this how many of those bases might be different.
The NM field gives the edit distance between the reference and the sequence. So NM:i:0 means an edit distance of zero, so CIGAR of 100M and NM:i:0 means an exact match.
The MD field gives you information about the actual substitutions compared against the reference. For example, a CIGAR of 100M and an MD:Z: field of 2C76A4G4A5C4 means 2 matches, then the reference has a C, then 76 matches, then the reference has an A, then 4 matches, then the reference has an A, then 5 matches, then the reference has a C, then 4 matches. (Total of 2+1+76+1+4+1+4+1+5+1+4=100).
One thing to remember: the MD field won’t show insertions into the sequence because it only provides information about the reference. However, insertions are marked in the CIGAR string. So for example, suppose a 100 base read matches against a 99 base region in the reference, with one insertion. The CIGAR will look something like 40M1I59M but the MD field would be MD:Z:99 – this tells us that there are 40 identical bases, then an insertion of 1 base in the read, then another 59 matches. The inserted base is in position 41 of the read.