CBS on real data
The file SCW-11_Bone-marrow.bedtools.counts
is “real” count data from a low-coverage,
whole genome sequencing run. The counts were computed using ‘bedtools coverage’ and the bins
were as described in the discussion of ginkgo binning
The resulting segmentation plot looks like this.
data:image/s3,"s3://crabby-images/36b4d/36b4d3a0ebd5c3db77026ea3a6deaec0134edc66" alt="segmented data"
The code to produce this was:
# load the libraries and get set up
import cbs
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style('darkgrid')
# read in the data, focus on chromosome 1, and drop outliners
df = pd.read_table('SCW-11_Bone-marrow.bedtools.counts',header=None,names=['chr','start','end','counts'])
df1 = df[df['chr']=='chr1']
threshold = np.percentile(df1['counts'].values,95)
df1a = df1[df1['counts']<threshold]
data =df1a['counts'].values
# segment, validate, and draw the figure
L = cbs.segment(data)
S = cbs.validate(data,L)
ax = cbs.draw_segmented_data(data,S,title='Segmentation of counts from chromosome 1')