Tuesday, December 18, 2012

SNPs with SamTools

SNPs with SamTools

These are kind of my messy notes on SNP bioinformatics. Its best to consult the manuals and/or documentation first. SeqAnswers also is a great source to find a Q&A forum of other bioinformaticians doing similar work. A screenshot is attached to show to expected file output.

Bowtie and Tophat are run on a shared resource HPC computer. You can install them on your home folder or you can use module load when you write your script. To see what programs are already on Saguaro, you can type module avail. On your local machine, you can run http://samtools.sourceforge.net/cns0.shtml SamTools and BCFTools (part of Samtools). You should also probably have the IGV Genome browser, as well. SamTools can be used to find variations. It is a suite for storing, manipulating and analyzing alignments, like Bowtie outputs Samtools has two formats- user friendly SAM and binary BAM output. Sam tools can be used to find SNPs in a Bowtie output file. Run bowtie to align the reads, -S for sam output.


Here is the Saguaro script: 

------------------------------------------------------------------
#!/bin/bash

#PBS -l nodes=12
#PBS -l walltime=96:00:00
#PBS -j oe
#PBS -o /scratch/username/aapl49

cd /scratch/username/
module load bowtie
tophat -r 139 -o aapl49 -p 12 /scratch/username/kmer56_q20 Sample_DNA.fastq  Sample_DNA.fastq

exit
-------------------------------------------------------------------


I got all the files back from Tophat and downloaded them to my local drive from Saguaro.

Don't forget to add SamTools to your path.
cd /home/username 
ls -a
open .bash_profile
PATH=$PATH\:/Users/username/progs; export PATH
PATH=$PATH\:/Users/username/progs/samtools118/bcftools ; export PATH

We want accepted_hits.BAM file
(.bed will be a tab-delineated file, with insertions and deletions)
Put Sam Tools in your path, otherwise go in folder and run ./samtools

General Pattern:
samtools index accepted_hits.bam
samtools view -b accepted_hits.bam chr2 > accepted_hits.chr2.bam
samtools sort accepted_hits.chr2.bam file.bam

Convert SAM into BAM to sort
samtools view -bS -o ec_snp.bam ec_snp.sam
Then the BAM file is sorted (preparing for SNP calling)
samtools sort ec_snp.bam ec_snp.sorted

Sort the BAM files
samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>
./samtools sort /Volumes/scratch/aapl49/accepted_hits49.bam /Volumes/scratch/aapl49/accepted_hits49_sorted
./samtools sort /Volumes/scratch/aapl75/accepted_hits75.bam /Volumes/scratch/aapl75/accepted_hits75_sorted

My colleague says that I don't need to sort or index the files, as they come out of Tophat already sorted. 

What I do need is the .fasta file to be indexed into a .fai file


SamTools SNP calling procedure:
From a sorted BAM alignment, raw SNP and indel calls are acquired by:
samtools mpileup /Volumes/scratch/aapl49/accepted_hits49_sorted.bam > raw.49.pileup
samtools  mpileup /Volumes/scratch/aapl75/accepted_hits75_sorted.bam > raw.75.pileup

To extract a certain sequence, faidx will index the file and create <ref.fasta>.fai on the disk
samtools faidx /Volumes/scratch/kmer56_q20.fa [region1 [...]]
you can rename .fa files to .fasta (its just the same)

samtools faidx ref.fasta

If you get this error message: [fai_build_core] different line length in sequence 'scaffold50'. Segmentation fault. Use:
sed '/^$/d' myFile > tt
to reformat the linebreaks.

Learn how to reformat text files in the BASH using the SED command! (cut/sort/replace/find)
samtools faidx /Volumes/scratch/kmer56_q20.fasta
The > denotes outputting the file (instead of displaying it in bash).
On the MacOSX, the general text editor will default to saving files in a .rtf format. This is a pain. You can change the setting so the default is a .txt file. I highly recommend doing this to prevent create symbols/problems in running things later on.


My colleague Dr. Walter Eckalbar emailed me the following information:

samtools mpileup -uf <genome> <bam1> <bam2> | bcftools view -bvcg - > <var.raw.bcf>

#To make vcf file:
&& bcftools view <var.raw.bcf> | vcfutils.pl varFilter -D10 > <var.flt.vcf>

#To make index file
bgzip var.flt.vcf
tabix -p vcf var.flt.vcf.gz


samtools mpileup -uf <genome> <bam1> <bam2> | bcftools view -bvcg - > <var.raw.bcf>
samtools mpileup -uf /Volumes/scratch/kmer56_q20.fasta  /Volumes/scratch/aapl49/49snps.in.bam /Volumes/scratch/aaplgg70/gg70snps.in.bam /Volumes/scratch/aapl75/75snps.in.bam /Volumes/scratch/aapl20a/20asnps.in.bam  | bcftools view -bvcg - > /Volumes/scratch/49_75_gg70_20a_snps.var.raw.bcf
This has all 4 individuals for comparison.

#To make vcf file:
&& bcftools view <var.raw.bcf> | vcfutils.pl varFilter -D10 > <var.flt.vcf>
&& bcftools view kmer56_q20.49to75.var.raw.bcf | vcfutils.pl varFilter -D10 > kmer56_q20.49to75.var.flt.vcf
&& bcftools view /Volumes/scratch/kmer56_q20.49to75.var.raw.bcf  | vcfutils.pl varFilter -D10 > /Volumes/scratch/kmer56_q20.49to75.var.flt.vcf

#To make index file
bgzip var.flt.vcf
bgzip /Volumes/scratch/kmer56_q20.49to75.var.flt.vcf
tabix -p vcf var.flt.vcf.gz
tabix -p vcf /Volumes/scratch/kmer56_q20.49to75.var.flt.vcf.gz
then I get a kmer56_q20.49to75.var.flt.vcf.gz.tbi file


You will want the program IGV to browse the files then. This is a Java program made by the Broad Institute. 


Additional Notes:
Quality index is logarithmic:
Q10 means that there is a 10% error rate
Q20 0.01 1% error cutoff
Q30 is 0.001 chance of error

Fasta is a reference file.
Raw data to align are fastq files.


Running Bowtie will give .ebwt, this is the indexed reference file.
Then use .ebwt as an input for alignment in Topcoat.
Create new directories for each run in topcoat, because it will generate non-unique file names.

Tophat better for picking up indels between the species.

Reading the output file: Where are the SNPS? 

If this is one of the lines in your pileup file:
chr2 218208 G 30 ....................,......... +(+++)++++!#+++!+)+++!*****(#(

The 5th column with the dots and commas is the one of interest. Those are the "reads" that map to that position.
What you are looking for is letters instead of dots and commas, those are "mismatches" to the reference and in theory your SNPs (of course, you have to have plenty of mismatches to the reference for it to be considered a variant).

First, sort the BAM files

To Index or Not to Index?
samtools index /Volumes/scratch/aapl49/accepted_hits49_sorted.bam
samtools index /Volumes/scratch/aapl75/accepted_hits75_sorted.bam 

Index file <aln.bam>.bai will be created.

From this I got the file outputs:
/Volumes/scratch/aapl49/accepted_hits49_sorted.indexed.bam.bai
/Volumes/scratch/aapl75/accepted_hits75_sorted.indexed.bam.bai

To check help: samtools mpileup
samtools view /Volumes/scratch/aapl49/accepted_hits49_sorted.bam chr2:20,100,000-20,200,000
samtools view /Volumes/scratch/aapl75/accepted_hits75_sorted.bam chr2:20,100,000-20,200,000
samtools tview accepted_hits49_sorted.bam /Volumes/scratch/kmer56_q20.fa

samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
samtools tview aln.sorted.bam ref.fasta

It might be better just to call for a certain chromosome at a time. 


Retrieve and print stats in the index file. The output is a TAB delimited file with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.
samtools idxstats /Volumes/scratch/aapl49/accepted_hits49_sorted.bam
samtools idxstats /Volumes/scratch/aapl75/accepted_hits75_sorted.bam

Even though i specified an output directory, the raw.pileup and raw49.pileup files were placed in the samtools directory

The resultant output should be further filtered by:
samtools.pl varFilter raw.pileup | awk '$6>=20' > final.pileup
perl /Users/username/progs/samtools118/samtools.pl varFilter raw.pileup | awk '$6>=20' > final.pileup
to rule out error-prone variant calls caused by factors not considered in the statistical model.

No comments:

Post a Comment