Mpileup Generation and Parsing Tools

Comai Lab, UC Davis Genome Center 
Meric Lieberman, Isabelle Henry, 2012 
This work is the property of UC Davis Genome Center – Comai Lab 

Use at your own risk. 
We cannot provide support. 
All information obtained/inferred with this script is without any implied warranty of fitness for any purpose or use whatsoever.


[ Current version can be downloaded here.]

This package works in two steps. The first takes sorted.bam files generated by Samtools and uses the built-in mpileup function to produce a table of coverage by position across all libraries. The documentation pages for Samtools, sorted.bam files, and mpileup file format and usage can be found here. In part 2, this table is then used as input to produce a parsed mpileup file, a format developed for use with our MAPS mutation and polymorphism survey. It is basically a simplified version of the mpileup file, in which the information is summarized and compacted.

Please note that this is intended to be used in conjunction with our MAPS and bwa-doall packages, so the sorted.bam files, if produced by bwa-doall, will end in “sorted.bam” and either have ”lib” or ”Lib” in the file name. If the sorted.bam files are not generated by our bwa-doall scripts, you will have to make sure that all sorted.bam file names follow these criteria as well. For example, ulibE263.sorted.bam corresponds to library”uE263″ which will appear in all result tables filenames.

For multicore/threading purposes, the second program reads the entire file into memory before running, so it is recommended not to run it on systems with limited memory. It typically requires > 1.5 times the size of the parsed mpileup file in RAM to run. For example, if the mpileup.txt file is 20 gbps, > 30 Gbps of RAM are necessary to run the program on the whole file at once. Depending on the number of samples and the coverage, mpileup files can be very large and the required RAM will not be available. In this case, it is recommended to break the mpileup file into smaller chunks to be processed separately, and leave them in chunks if the result parsed mpileups are to be subsequently used in the MAPS pipeline. Both programs use command line parameters to set options, a list of which can be found i) by using a simple -h flag after the program name while executing (./program.py -h), ii) here in the program descriptions, and iii) at the top of the programs themselves.

Note: To see run parameters:

./beta-run-mpileup.py.py -h OR ./beta-run-mpileup.py --help
AND
./beta-mpileup-parser_vMEM-O.py -h OR ./ beta-mpileup-parser_vMEM-O.py --help

Run example, for a folder of sorted.bams and a mpileup file resultfilename.txt

First place all sorted.bam files in a directory, the first program runs on all the files in the current working directory.

Step 1: To run using the reference genome genome.fa, with mapping quality cutoff and base quality cutoff of 21, and max per-BAM depth of 4000:

./beta-run-mpileup.py  --reference_file /path/to/reference/used/in/mapping/genome.fa --output_file resultfilename.txt --mapqual 21 --basequal 21 --maxdepth 4000
OR
./beta-run-mpileup.py  -r/path/to/reference/used/in/mapping/genome.fa -o resultfilename.txt -q 21 -Q 21 -d 4000

Step 2: Continue to parse the file from above, with 2 threads.

./beta-mpileup-parser_vMEM-O.py --thread 2 --mpileup_file resultfilename.txt
OR
./beta-mpileup-parser_vMEM-O.py -t 2 -f resultfilename.txt 

Note that a result parsed mpileup file with the name‚ “parsed_” resultfilename.txt will be generated.

Below is the individual documentation for the two programs of the mpileup package.

Part 1: beta-run-mpileup.py 

This program is meant to be run on a directory of sorted.bam files. It will generate a mpileup file with columns for each library.

INPUT: This program is run in a folder full of .sorted.bam files as input

OUTPUT: This program outputs a mpileup file.

NOTE: If the program samtools is not in /usr/bin, then the path to samtools must be specified using the command line parameters

PARAMETERS, default value in []:

REQUIRED:
-r or reference_file, The alignment reference (fasta format) [required]
-o or--output_file, The output mpileup.txt filename [required]
OPTIONAL:
-q or --mapqual, Minimum mapping quality for an alignment to be used [20]
-Q or --basequal, Minimum base quality for a base to be considered [20]
-d or --maxdepth, Max per-BAM depth coverage to avoid excessive memory usage [8000]
-s or --samtools, File path to Samtools [/usr/bin/samtools]
Part 2: beta-mpileup-parser_vMEM-O.py 

This program parses a mpileup file to a simplified format to be used with our MAPS mutation and genotyping package.

INPUT: This program takes an mpileup.txt file as input

OUTPUT: This program outputs a parsed mpileup.txt file.

NOTE: This program reads the entire file into memory before parsing, so it is recommended not to be run on systems with limited memory. It typically requires 1.5 times the size of the mpileup file in RAM to run. Many machines will not have this, and in this case it is recommended to break the mpileup file into smaller chunks to be processed separately. (i. e. by chromosome or scaffold) If being used with the MAPS package: the first step in MAPS is also threaded so it may be best to leave the chunks separate and process them individually in MAPS as well. Results can be combined at the end without compromising the results. This program is threaded, so it can be used with the -t flag to specify the number of cores to be used

PARAMETERS, default value in []:

REQUIRED:
-f or --mpileup_file, The input mpileup file.
OPTIONAL:
-t or --thread, Number of cores to be used while processing. [1]
%d bloggers like this: