BWA-DoAll

Comai Lab, UC Davis Genome Center 
Meric Lieberman, Isabelle Henry, 2012 
This work is the property of UC Davis Genome Center – Comai Lab 

Use at your own risk. 
We cannot provide support. 
All information obtained/inferred with this script is without any implied warranty of fitness for any purpose or use whatsoever. 

BWA-DoAll, a package for the batch alignment and processing of FASTQ libraries.

Current version can be downloaded here

This program processes library FASTQ (.fq) files and a produces .sorted.bam files as well as a series of intermediary files (see Figure).

INPUT:
This program must be given a folder containing a series of fastq files (.fq). The script will look at the contents of the folder and run on all files within that folder that start with “lib” and end with “.fq”.

OUTPUT:
Folders for sam, sai, sorted bam, bai and original fq files will be created and the files for each library placed in their respective folder. Non-overamp sams and uncut fq file and folders will also be created if the corresponding options are selected (see below). 

This process includes the following:

  • If the appropriate option is selected, the reads in the fq file are first scanned for the presence of chimeric reads. This is only relevant for low-complexity libraries, for which a restriction enzyme was used during library preparation. In this case, the script searches for the presence of the restriction site within each sequence read (other than at the very beginning or at the very end). If a site is found, the read is cut immediately after the restriction site. New .fq files are generated with nc appended to the filename (which stand for no chimeric). The original fastq files are retained, and a file indicating how many reads were cut is is generated called “rescan-cut-log.txt”.
  • If a paired end alignment is specified, the INTERLEAVED pair ended Fastq file is split into forward/reverse files for processing. These files are deleted after processing.
  • Specified alignment (single or paired) is performed with bwa.
  • Overamp: this part outputs the number of “unique” and the total number of aligned reads to a file called “master-OverAmp.txt”. A read is considered clonal if is map to the exact same position as another. If reads are mapped as pairs, two pairs are considered clonal if both the F and R reads map to the same position as another pair of reads.
    • If the -o/–overamp option is used, a new “unique.sam” file is created. Clonal reads are discarded and only unique reads are retained. The “unique” file will be used to continfor further processing, instead of the original sam file.
  • The .bam files are generated from the sam files and the .sorted.bam files are generated subsequently. Finally, the original unsorted bams are deleted.
  • Here is a tool that explains the samtool flags.

PARAMETERS, default value in []:

REQUIRED:
-d or --database, the reference database for mapping (.fa file)
OPTIONAL:
-c or --chimeric, remove chimeric reads, generated new nc.fq files [Off]
-o or --overamp, remove duplicate reads, use unique .sam files [Off]
-p or -- paired, align and process in paired-ended mode, requires input .fq files be interleaved i.e. a file in which the forward and reverse reads from one cluster follow each other directly. [single-ended]
-t or --thread, use additional threads for alignment [1]
-q or --trimqual, Default mapping quality, for use as the bwa aln -q X during alignment (see bwa) [20]

NOTE: The default parameters assume overamp-3.py is in the same directory as this program, and that bwa and samtools have been installed to /usr/bin/ If this is not true, the path of all three (overamp, bwa, samtools) must be specified with the following command line parameters:

--bwa or -b, path for bwa
--samtools or -a, path for samtools
--scripts or -s, path to overamp-2

The sorted.bam files that are generated here can then be used to create a mpileup file using our mplieup script package, which is then used for mutation detection / genotyping using our MAPS package.

OverAmp-3.py

DESCRIPTION: This program looks at a sam file, identified the unique reads and optionally, retains only the unique reads in a new unique.sam file. It can be used on a single or paired ended alignment. A read is considered clonal if is map to the exact same position as another. If reads are mapped as pairs, two pairs are considered clonal if both the F and R reads map to the same position as another pair of reads.

INPUT: This program takes a samfile as input

OUTPUT: This program outputs a unique read sam file, and prints alignment statistics to the screen. The output columns are [Filename, #unique, #aligned, #unique, total reads].

NOTE: This program is designed to be used in conjunction wit bwa-samtools-do-all.py, with its printed output being redirected to a combined all lib files, that is why there are no headers for the printed output. Also, overamp generates the unique files while doing the counts, not independently.

PARAMETERS, default value in []:

REQUIRED:
-f or --samfile, The input.sam file
-o or--outfile, The output unique .sam file name
OPTIONAL:
-s or --sort, sort sam by chromosome and start position
-p or --paired, switches to pair-end sam file. Default is single ended
%d bloggers like this: