Friday, January 31, 2014

Phylogenetics and Repeat Analysis

BACKGROUND ON THE PROJECT
For my master's thesis, I evaluated LINES and SINES in 4 Anole lizard species which derive from a single lineage of an adaptive radiation. LINES are long interspersed nuclear elements about 6 kb in length. SINES are shorter than LINES and contain a tRNA-like region, as well as a 3' sequence that is identical to a corresponding LINE partner. Autonomous repeats, like LINES, have the ability to cut and paste themselves throughout the genome, whereas nonautonomous elements, like SINES, rely on the transcriptional machinery generated by LINES for their transposition. How these elements evolve over time is of great interest, particularly the functional consequences of repeat-driven genomic rearrangements. Through the use of molecular phylogenetic tools, the evolution of repeat elements in four Anolis species may be better understood. While my thesis is embargoed until Dec 2015, I am reporting on methods not included in my final version as an educational resource.

BACKGROUND ON PHYLOGENETICS

Several methods were used to infer phylogenetic relationships. Detailed descriptions of these methods are provided in the methods section. Two distance-based methods and one character-based method were used to construct phylogenetic trees for the 4 Anolis lizard species. Distance-based methods for constructing phylogenetic trees first calculate pairwise distances between sequences, generate a matrix of those values, then construct the tree based on those values (Pevsner, 2011). With distance-based methods for predicting molecular phylogeny, the branch length corresponds to the divergence of sequences, which is based on the number of differences present.

The most frequently used distance-based method is the Neighbor-Joining (NJ) method. The NJ method begins with one cluster of sequences, unrooted and in a star-like structure. Then, through many iterations, connected neighbors are progressively joined together and a hierarchical topology is generated. The NJ algorithm creates a tree with minimal branch lengths and does not assume equal rates of substitution among branches.

The other most frequently used distance-based method is UPGMA, or unweighted pair group method of arithmetic averages. This method assumes a molecular clock, uniform rates of amino acid substitutions, and clusters sequences based on the generated distance matrix. The underlying assumptions can be problematic if the rate of substitution varies among branches and subfamilies, necessitating correction algorithms like the Poisson correction or the Gamma correction.

In contrast to distance-based methods, character based methods like maximum likelihood methods generate phylogenetic analyses by first estimating and optimizing parameters for a chosen model, accounting for stochasticity and calculating the most likely tree topology under the given model . Maximum likelihood trees are considered highly trustworthy evidence of phylogenetic relatedness.

After the trees were constructed, a method to gauge their accuracy was needed. Bootstrapping provides a way to “replicate” the findings over multiple trials. The bootstrap algorithm generates a randomized artificial data set that is very similar to the original one and through running many replicates, the accuracy of the original data set can be better determined. A greater “sample size” statistically produces more accurate results because the effects of randomness and bias are minimized. Therefore, a minimum of 1,000 bootstraps were used for all trees constructed in this thesis. When indicated, 500 bootstraps were used for the maximum likelihood trees, which were the most computationally intensive to produce.

BIOINFORMATICS
High-throughput next-gen sequencing creates a tremendous amount of data, that must be navigated effectively in order for analysis to occur. In order to conduct my research, I had to incorporate a wide range of bioinformatic and computational tools.

  LINE ANALYSIS
Several clades of LINES were analyzed, including the L1, CR1, RTE1 and RTE-BovB repeat element clades. Since the full length elements range from 5 to 10 kbp, only the reverse transcriptase (RT) domains were used to differentially select LINE sequences. The RT is roughly 300 base-pairs, but in some cases, as with Anolis apletophallus, sequences were truncated due to contig length and alignments of proteins 150 base pairs or more were used.

LINE/L1 CLADE ANALYSIS

Using Repbase, I pulled a consensus sequence for the Anolis carolinensis LINE element called L1_AC_1. I uploaded the DNA sequence into NCBI’s ORF Finder program. Next, I used NCBI’s web browser blastp program to find matching protein sequences for each L1_AC_1 – L1_AC_20 element.


NCBI’s ORF Finder user interface

Next, the protein alignments were visualized the protein sequence for just the L1-encoded reverse transcriptase was extracted.   


ORF Finder displays all 6 open reading frames



Next, I used NCBI’s in web blastp program to retrieve protein matches.


NCBI’s ORF Finder identified Anolis reverse transcriptase domains


Using ORF Finder, the coding regions were visualized.


NCBI’s ORF Finder alignment of homologous reverse transcriptase sequence



The alignment of the acar to the reverse transcriptase non-LTR-like protein is depicted above.

I was able to obtain the DNA sequence for Anolis carolinensis through Repbase. I then determined the region that codes for L1-encoded reverse transcriptase (RT). Using acar’s RT, I conducted a tblastn to find the transcriptase in other species.

Next, I collected all the L1_AC_1  –  L1_AC_20 elements for each species. An example is given below for Anolis apletophallus, for the L1_AC_20 elements.

Because LINES can be 5 - 10kb on average, with high divergences, it is very difficult to conduct a sequence analysis on them. The ability to locate LINES is highly dependent on the quality of the genome assembly, in particular the presence of long scaffolds, over 6 kb. The 1000 and 3000 bp libraries used allowed for the assembly of longer scaffolds from smaller contigs. Only afre and aaur contained average scaffold lengths larger than 6 kb. Subsequently, protein sequences of the highly conserved reverse transcriptase domains were used for a 4 way anole comparisons of LINES.

TREE-BUILDING ANALYSIS
Before I compared different anole species, I first made sure that the L1_AC proteins within one species aligned to each other.
Anolis auratus alignment of a LINE L1_AC_1 protein


L1_AC elements were identified through a tblastn search using Geneious (Biomatters Ltd.). Then, a Neighbor-joining tree was created using the Jukes-Cantor genetic distance model, and a neighbor-joining tree building method. 1,000 bootstraps were executed and set the threshold to 10%.

In order to prevent a biased result, each sequence screened for unique identities. If sequences are replicated, the tree construction will be biased. Likewise, the under-representation of repeats in an alignment file will give results biased towards greater divergences than is biologically observed. Thus, vigilant efforts were taken to reduce bias as much as possible.

The protein sequence for acar’s L1_AC_1 reverse transcriptase is listed below.

>acar_L1_AC_1_2p RT 2nd translation ORFinder
EATIIMIPKEDADLSETKNYRPISLLNTDYKIFTKILANRLKKFLNGWIGEDQTGFLPSRTIKDNVRVILDAMEFYEQHNQREVGFLSVDAEKAFDNLNWNFFKLLFQELDLGTQIQNGINSIYEDQWAKIQINGQETDKIVISKGTRQGCPLSPLIFIMALEI
LLKAIKKDINLQGIKIDNQDYKYRAFADDVICIIENPIQNIHKWIAKIDE
FGKVAGLKINKKKTMMLTKNMSKRKQKELQDIAGLETPLKLKYLGIWI

After I collected the DNA sequences for twenty L1_AC elements for each species, I then concatenated them into files to form a FASTA database with only species-specific L1_AC elements. Below is an example for Anolis auratus. See appendix for more on bioinformatic methods

Next I made a tblastn database in each folder with the L1_cat.fasta file for each species. An example using Anolis auratus is below.

makeblastdb -in /Users/cmay/30sep_aaur_L1_AC_1-20_cat.fasta -dbtype nucl -out aaur_L1

Then I did a protein blast using the acar reverse transcriptase from L1_AC_1 as a query

tblastn -query /Users/cmay/30Sep_acar_L1_AC_1_ORF_prot.txt -db aaur_L1 -out aaurdb_acarRTq_L1_AC_1_out.txt

After the protein sequences were collected, they were imported into Geneious for analysis. An alignment was created with MUSCLE, using 8 iterations. The following options were chosen to execute the protein alignments.
 
Geneious MUSCLE alignment options


 After aligning the proteins, I generated a consensus sequence using Geneious. This consensus will be the basis of comparison, used in MEGA to determine the percent divergence between different sequences. 

Once the proteins were aligned, a consensus sequence was generated and a tree was then constructed. The following options were used to construct trees. A minimum of 1,000 bootstraps were run for every tree featured, unless otherwise noted. While the support threshold varied per repeat element and species, values ranged from 10% to 25%. 


Geneious Tree building options


PHYLOGENETIC ANALYSIS
The phylogenetic analysis of protein sequences is based on amino acid substitution rates and statistical models that infer relatedness during the pairwise comparison of sequences (Pevsner 2009). In constructing phylogenetic trees, several methods were used. Using Geneious, two types of trees were constructed that rely on a distance, or similarity matrix. My phylogenetic analyses were conducted primarily on protein sequences. 

Due to the fragmented nature of repeats and rates of divergence between organisms, building trees with DNA sequences proved slightly problematic. While I constructed trees using both DNA and protein, the trees for proteins were focused on most heavily. The DNA trees provided additional support for the phylogenetic relationships I identified using proteins, in particular the identification of new repeat element families for the other species.

Phylogenetic trees were constructed from protein sequences. It is generally accepted that homologous protein sequences that share 85% similarity are conserved. (Tamura et al., 2011). Several criteria were used for designating a family status within the LINE/L1. The most important criteria was evidence of phylogenetic relatedness, based on bootstrap values. Values over 80 were considered strong evidence of relatedness, while values over 60 were considered weak and only used circumstantially to infer relatedness of very distal branches and in conjunction with other modes of evidence such as distance matrix data or maximum likelihood trees.

Multiple methods were used to construct trees including 2 distance-based methods, NJ and UPGMA, and 1 character-based method, Maximum Likelihood (ML) trees. Pairwise deletion was used to evaluate all of the gaps. Each model was tested in MEGA for reliability via the maximum likelihood statistical method.

MEGA Model Analysis menu options


For one species, it was determined that the best model to use is the JTT+G+F, or the Jones-Taylor-Thornton model with a gamma distribution rate among sites.

MEGA Model testing using Maximum likelihood for  species.

As these were the options initially chosen based on recommendations from the MEGA tutorial on the evolutionary analysis of amino acids, little revision needed to be done, except for changing the rates among sites from a uniform distribution to a gamma distribution.  

MEGA Pairwise Distances menu options for a species

To model continuous variables with a skewed distribution, the gamma distribution is often used, which is particularly useful for dealing with unequal substitution rates across sites (Pevsner, 2011). I used the MEGA defaults for the gamma parameter, with 1 as a gamma parameter for the distance estimation, and 5 for constructing maximum likelihood trees.

First, the individual L1 clade was characterized for each species. The L1 clade, depicted below, for Anolis carolinensis was previously identified to have 20 distinct families (Novick et al., 2011). In order to compare acar with the 3 new species, protein trees were constructed.

PHYLOGENETIC TREES
In conducting phylogenetic analyses, the choice to use either protein or DNA sequences needed to be determined. Because prior publications on Anolis carolinensis LINE/L1s were based on nucleotide sequences, I reproduced a phylogenetic DNA tree using only the protein ORF for the reverse transcriptase. Indeed, the protein tree is nearly identical to the DNA tree published in Novick et al., 2011. A DNA and protein tree are depicted below.





Neighbor-joining phylogenetic tree for the LINE/L1 clade in Anolis carolinensis, based on consensus DNA sequences. 500 bootstraps were used and the support threshold was set to 25%.





A Neighbor-Joining tree of Anolis carolinensis LINE/L1 clade based on consensus protein sequences. 1,000 bootstraps were used and the support threshold was set to 10%.

Well supported branching structures were observed on both trees, which were highly similar. Due to the close similarity of the DNA and protein NJ trees, it may be reasonable to use solely protein sequences in the molecular phylogenetic analysis. DNA trees were also produced, but are not pictured here.

Next, NJ and UPGMA trees were generated. Using the NJ tree, family designations were created. In cases of ambiguous tree placement, several factors were taken into account. In addition to generating NJ and UPGMA trees, I also generated a maximum likelihood tree. 

I then imported each alignment (including the consensus) into MEGA and I conducted a pair-wise divergence, with 1,000 bootstraps. MEGA produced a distance matrix with divergences and standard errors for those divergences.




 Distance Matrix from MEGA

This tells me what percent divergence from the family consensus was. I used this value to compare rates of sequence evolution among my various species.