filtering
Performs data filtering for mapped reads based on their alignment quality, and file format conversion before high-level data analysis,before visulizing reproducibility among biological replicates.
Arguments
Required arguments:
-s, --sam: Input file folder of SAM files exported from alignment for filtering.-cs, --chromosome_sizes: Input file of sorted chromosome sizes information.-bl, --blacklist: Input file in BED format with genomic region that should be excluded.
Optional arguments:
-o, out_dir: Output directory, default = current working directory.-sn, --spike_in_normInput must be either “True” or “False”. If the files are being spike-in normalized after filtering or not. If set to “True” the function will not create BedGraph files, as these are created by thespike_in_calibrationfunction. Default=False-atac, --atac_seq_shitInput must be either “True” or “False”. Shift start point in ATAC-seq data +4 on the forward (+) strand and -5 on the backwards (-) strand. Default=False
Example Usage
Example usage for samples that should not be shifted:
$ epimapper filtering -s /Users/me/results/Epimapper/alignment/removeDuplicate/sam_duplicates_removed -cs /Users/me/in_folder/hg38_chromosome_sizes.txt -bl /Users/me/in_folder/hg38_blacklist.bed -o /Users/me/results
Example usage for ATAC-seq data that should be shiftet:
$ epimapper filtering -s /Users/me/results/Epimapper/alignment/removeDuplicate/sam_duplicates_removed -atac True -cs /Users/me/in_folder/hg38_chromosome_sizes.txt -bl /Users/me/in_folder/hg38_blacklist.bed -o /Users/me/results
Example usage for data that should be spike-in normalized after filtering:
$ epimapper filtering -s /Users/me/results/Epimapper/alignment/removeDuplicate/sam_duplicates_removed -sn True -cs /Users/me/in_folder/hg38_chromosome_sizes.txt -bl /Users/me/in_folder/hg38_blacklist.bed -o /Users/me/results
Output
Like all the other functions in EpiMapper Python package, the function will create a main Epimapper output directiry, if it is not already present in the chosen output directory. Further, this function will create multiple folders for each file conversion.
Natrually, the “bam” folder will contain BAM files, both unedited as well as a mapped and sorted version. The “bed” folder will contain BED files, which are sorted and filtered on alignment quality score framgent length (< 1000 bp) and blacklist removed. Here, the blacklist removed files are in 500 pb bin form, meaning that the BED file has this form [chromosome start end count], where each “start” and “end” are a 500bp intervall and “counts” contains the number of fragments which were present in this intervall.
The “bedgraph” folder contains the BedGraph version of the sorted bed files. Lastly, the function will also create two JPG files contraining correlation heatmap plots. These plots are based on the log-transformed counts in blacklist removed 500 bin files, and the “corrcoef_heatmap4logCount_all.jpg” includes every bin entry, while “corrcoef_heatmap4logCount_filtered_gt8.jpg” only accepts bin entries with a total count of 8 across all samples.
Epimapper
|- alignment
| |- bam
| | |- "sample-name".bam
| | |- "sample-name".mapped_sorted.bam
| |- bed
| | |- "sample-name".fragments.bed
| | |- "sample-name".fragments_sorted.bed
| | |- "sample-name".fragments_sorted.500b.windows.BlackListFiltered.bed
| |- bedgraph
| | |- "sample-name".fragments.bedgraph
|- summary_tables
| |- corrcoef_heatmap4logCount_all.jpg
| |- corrcoef_heatmap4logCount_filtered_gt8.jpg