Thursday, March 27, 2014

Finding Genes

How do you go about finding a gene? There are many resources online that are free to the public. 

The first step is to get the actual gene sequence. Ensembl is one of the best sites to look. Choose the species and then type in the gene of interest. In this case, I am going to compare the DSCAM gene in rat and in chicken.

What you'll see next is a summary of the gene with a description, where it's located and near the bottom is a cartoon visualization of the gene structure. The red boxes indicate an exon, where as the kinked "V" shape line between them are introns. You can get a general idea of how similar a gene is by visually comparing them, in this case between rat and chicken. Do they have the same number of exons and introns? Do they differ in size? 

To compare how similar gene sequences are between these two species, the sequences (either DNA or protein) should be downloaded. This will allow you to align them side by side to see what percent similarity they have.

DSCAM gene sequence of Rat

DSCAM sequence of chicken

In order to get the actual sequence, or the text, click on the export data button. 

Next, you will be given the choice of getting the DNA sequence (coding) or the protein sequence (peptide). Recall that introns are spliced out during mRNA processing, and that the final mRNA product has exons that are transcribed. This is known as the coding sequence

Below is the protein sequence. 

Proteins (which are made from units of amino acids assembled together) also come from this coding sequence, in particular groups of 3 nucleotides called the triplet codon. Since there is redundancy, meaning that one amino acid can have several ways of being made, it is often easier to compare a gene by its protein sequence if there is a lot of evolutionary distance between 2 organisms (ie. worms and turtles). If you try to compare the same gene in two distantly related species using nucleotide sequences, there may be too many changes between them to easily compare them. Changes to single nucleotides don't matter as much when you compare proteins, because even a single base change may end up producing the exact same amino acid, so it is more or less irrelevant to the functional consequence, or how the protein will act in the body of the animal.

SOC-DBR. (2014)

So, after you copy and paste the two sequences you want to compare (in this case DSCAM in chicken and in rat), you can align them using a free public tool called ClustalW. However, there is a large variety of alignment tools, some which have visualizations, and predictions about what the components of the gene are. I personally used Geneious a lot, but its not free (however you can download and use during the free trial period).  

 ClustalW will show you a side by side comparison of the protein sequence. L stands for leucine. S stands for serine. V stands for valine and so on. (You can google amino acid abbreviations to find a chart with all of them).  

You can visually see where there are differences in the amino acids and where there are similarities. The sequence will wrap around to the next line, with the numbers to the right indicating what amino acid you are at when the row ends. In this case, there are 450 total amino acids (or 1350 nucleotide bases) total. The first row is the rat sequence and the second row is the chicken sequence.

An important thing to look for are changes in amino acids that are different in charge. An amino acid can have polar, non-polar, negative, positive and neutral properties. If a mutation causes an polar amino acid to change to a non-polar amino acid, this may affect the final protein structure by changing its overall biochemical properties. This type of change is called a nonsynoymous change. In evolution, these types of changes have resulted in countless gain of function mutations, where all the sudden a new protein acts in the organism and may help drive evolution (if it offers some evolutionary advantage). One fascinating example of this is the FOXP2 gene in humans which helped human language to develop. 

As an exercise, you can try to look for some of these nonsynonymous changes above. For instance, a lysine (+) to a glutamate (-) change might alter how the final protein acts in the body.

UC-Davis Chemwiki. (2014).

You can also access statistics on how many amino acids match. In this case, there is 95% similarity between these two genes. This appears to be a highly conserved gene between rat and chicken. However, that 5% of difference may result in noticeable differences in the final protein structure. For instance, if there are many more negatively charged amino acids added, it can affect the way the overall protein folds or is shaped.

In the figure above, rat is the first line and chicken is the second line. The | lines inbetween denote where there is a perfect match between them. With 95% of the amino acids identical, the inference is that this gene is important, and so a selective pressure exists to keep it in the gene pool. 

For instance, if a mutation occurs that alters the protein, it may have huge consequences that affect fitness, ie if an truncated protein is produced that happens to create a disease (like in Cystic fibrosis), or perhaps this gene is necessary for proper development and embryos that don't have a normal copy die before they're born. The fact that a gene is highly conserved suggests that it has an important function.
There is another tool I like to use when comparing genes between species: The USCS Genome Browser. If you do a Blat search for one sequence, you can see if it shares homology with genes in other species.

This gives a visual of the alignment and allows you to see what the gene structure looks like in other species. In this case, the gene is highly similar in human, rat and chicken. 

Friday, January 31, 2014

Phylogenetics and Repeat Analysis

For my master's thesis, I evaluated LINES and SINES in 4 Anole lizard species which derive from a single lineage of an adaptive radiation. LINES are long interspersed nuclear elements about 6 kb in length. SINES are shorter than LINES and contain a tRNA-like region, as well as a 3' sequence that is identical to a corresponding LINE partner. Autonomous repeats, like LINES, have the ability to cut and paste themselves throughout the genome, whereas nonautonomous elements, like SINES, rely on the transcriptional machinery generated by LINES for their transposition. How these elements evolve over time is of great interest, particularly the functional consequences of repeat-driven genomic rearrangements. Through the use of molecular phylogenetic tools, the evolution of repeat elements in four Anolis species may be better understood. While my thesis is embargoed until Dec 2015, I am reporting on methods not included in my final version as an educational resource.


Several methods were used to infer phylogenetic relationships. Detailed descriptions of these methods are provided in the methods section. Two distance-based methods and one character-based method were used to construct phylogenetic trees for the 4 Anolis lizard species. Distance-based methods for constructing phylogenetic trees first calculate pairwise distances between sequences, generate a matrix of those values, then construct the tree based on those values (Pevsner, 2011). With distance-based methods for predicting molecular phylogeny, the branch length corresponds to the divergence of sequences, which is based on the number of differences present.

The most frequently used distance-based method is the Neighbor-Joining (NJ) method. The NJ method begins with one cluster of sequences, unrooted and in a star-like structure. Then, through many iterations, connected neighbors are progressively joined together and a hierarchical topology is generated. The NJ algorithm creates a tree with minimal branch lengths and does not assume equal rates of substitution among branches.

The other most frequently used distance-based method is UPGMA, or unweighted pair group method of arithmetic averages. This method assumes a molecular clock, uniform rates of amino acid substitutions, and clusters sequences based on the generated distance matrix. The underlying assumptions can be problematic if the rate of substitution varies among branches and subfamilies, necessitating correction algorithms like the Poisson correction or the Gamma correction.

In contrast to distance-based methods, character based methods like maximum likelihood methods generate phylogenetic analyses by first estimating and optimizing parameters for a chosen model, accounting for stochasticity and calculating the most likely tree topology under the given model . Maximum likelihood trees are considered highly trustworthy evidence of phylogenetic relatedness.

After the trees were constructed, a method to gauge their accuracy was needed. Bootstrapping provides a way to “replicate” the findings over multiple trials. The bootstrap algorithm generates a randomized artificial data set that is very similar to the original one and through running many replicates, the accuracy of the original data set can be better determined. A greater “sample size” statistically produces more accurate results because the effects of randomness and bias are minimized. Therefore, a minimum of 1,000 bootstraps were used for all trees constructed in this thesis. When indicated, 500 bootstraps were used for the maximum likelihood trees, which were the most computationally intensive to produce.

High-throughput next-gen sequencing creates a tremendous amount of data, that must be navigated effectively in order for analysis to occur. In order to conduct my research, I had to incorporate a wide range of bioinformatic and computational tools.

Several clades of LINES were analyzed, including the L1, CR1, RTE1 and RTE-BovB repeat element clades. Since the full length elements range from 5 to 10 kbp, only the reverse transcriptase (RT) domains were used to differentially select LINE sequences. The RT is roughly 300 base-pairs, but in some cases, as with Anolis apletophallus, sequences were truncated due to contig length and alignments of proteins 150 base pairs or more were used.


Using Repbase, I pulled a consensus sequence for the Anolis carolinensis LINE element called L1_AC_1. I uploaded the DNA sequence into NCBI’s ORF Finder program. Next, I used NCBI’s web browser blastp program to find matching protein sequences for each L1_AC_1 – L1_AC_20 element.

NCBI’s ORF Finder user interface

Next, the protein alignments were visualized the protein sequence for just the L1-encoded reverse transcriptase was extracted.   

ORF Finder displays all 6 open reading frames

Next, I used NCBI’s in web blastp program to retrieve protein matches.

NCBI’s ORF Finder identified Anolis reverse transcriptase domains

Using ORF Finder, the coding regions were visualized.

NCBI’s ORF Finder alignment of homologous reverse transcriptase sequence

The alignment of the acar to the reverse transcriptase non-LTR-like protein is depicted above.

I was able to obtain the DNA sequence for Anolis carolinensis through Repbase. I then determined the region that codes for L1-encoded reverse transcriptase (RT). Using acar’s RT, I conducted a tblastn to find the transcriptase in other species.

Next, I collected all the L1_AC_1  –  L1_AC_20 elements for each species. An example is given below for Anolis apletophallus, for the L1_AC_20 elements.

Because LINES can be 5 - 10kb on average, with high divergences, it is very difficult to conduct a sequence analysis on them. The ability to locate LINES is highly dependent on the quality of the genome assembly, in particular the presence of long scaffolds, over 6 kb. The 1000 and 3000 bp libraries used allowed for the assembly of longer scaffolds from smaller contigs. Only afre and aaur contained average scaffold lengths larger than 6 kb. Subsequently, protein sequences of the highly conserved reverse transcriptase domains were used for a 4 way anole comparisons of LINES.

Before I compared different anole species, I first made sure that the L1_AC proteins within one species aligned to each other.
Anolis auratus alignment of a LINE L1_AC_1 protein

L1_AC elements were identified through a tblastn search using Geneious (Biomatters Ltd.). Then, a Neighbor-joining tree was created using the Jukes-Cantor genetic distance model, and a neighbor-joining tree building method. 1,000 bootstraps were executed and set the threshold to 10%.

In order to prevent a biased result, each sequence screened for unique identities. If sequences are replicated, the tree construction will be biased. Likewise, the under-representation of repeats in an alignment file will give results biased towards greater divergences than is biologically observed. Thus, vigilant efforts were taken to reduce bias as much as possible.

The protein sequence for acar’s L1_AC_1 reverse transcriptase is listed below.

>acar_L1_AC_1_2p RT 2nd translation ORFinder

After I collected the DNA sequences for twenty L1_AC elements for each species, I then concatenated them into files to form a FASTA database with only species-specific L1_AC elements. Below is an example for Anolis auratus. See appendix for more on bioinformatic methods

Next I made a tblastn database in each folder with the L1_cat.fasta file for each species. An example using Anolis auratus is below.

makeblastdb -in /Users/cmay/30sep_aaur_L1_AC_1-20_cat.fasta -dbtype nucl -out aaur_L1

Then I did a protein blast using the acar reverse transcriptase from L1_AC_1 as a query

tblastn -query /Users/cmay/30Sep_acar_L1_AC_1_ORF_prot.txt -db aaur_L1 -out aaurdb_acarRTq_L1_AC_1_out.txt

After the protein sequences were collected, they were imported into Geneious for analysis. An alignment was created with MUSCLE, using 8 iterations. The following options were chosen to execute the protein alignments.
Geneious MUSCLE alignment options

 After aligning the proteins, I generated a consensus sequence using Geneious. This consensus will be the basis of comparison, used in MEGA to determine the percent divergence between different sequences. 

Once the proteins were aligned, a consensus sequence was generated and a tree was then constructed. The following options were used to construct trees. A minimum of 1,000 bootstraps were run for every tree featured, unless otherwise noted. While the support threshold varied per repeat element and species, values ranged from 10% to 25%. 

Geneious Tree building options

The phylogenetic analysis of protein sequences is based on amino acid substitution rates and statistical models that infer relatedness during the pairwise comparison of sequences (Pevsner 2009). In constructing phylogenetic trees, several methods were used. Using Geneious, two types of trees were constructed that rely on a distance, or similarity matrix. My phylogenetic analyses were conducted primarily on protein sequences. 

Due to the fragmented nature of repeats and rates of divergence between organisms, building trees with DNA sequences proved slightly problematic. While I constructed trees using both DNA and protein, the trees for proteins were focused on most heavily. The DNA trees provided additional support for the phylogenetic relationships I identified using proteins, in particular the identification of new repeat element families for the other species.

Phylogenetic trees were constructed from protein sequences. It is generally accepted that homologous protein sequences that share 85% similarity are conserved. (Tamura et al., 2011). Several criteria were used for designating a family status within the LINE/L1. The most important criteria was evidence of phylogenetic relatedness, based on bootstrap values. Values over 80 were considered strong evidence of relatedness, while values over 60 were considered weak and only used circumstantially to infer relatedness of very distal branches and in conjunction with other modes of evidence such as distance matrix data or maximum likelihood trees.

Multiple methods were used to construct trees including 2 distance-based methods, NJ and UPGMA, and 1 character-based method, Maximum Likelihood (ML) trees. Pairwise deletion was used to evaluate all of the gaps. Each model was tested in MEGA for reliability via the maximum likelihood statistical method.

MEGA Model Analysis menu options

For one species, it was determined that the best model to use is the JTT+G+F, or the Jones-Taylor-Thornton model with a gamma distribution rate among sites.

MEGA Model testing using Maximum likelihood for  species.

As these were the options initially chosen based on recommendations from the MEGA tutorial on the evolutionary analysis of amino acids, little revision needed to be done, except for changing the rates among sites from a uniform distribution to a gamma distribution.  

MEGA Pairwise Distances menu options for a species

To model continuous variables with a skewed distribution, the gamma distribution is often used, which is particularly useful for dealing with unequal substitution rates across sites (Pevsner, 2011). I used the MEGA defaults for the gamma parameter, with 1 as a gamma parameter for the distance estimation, and 5 for constructing maximum likelihood trees.

First, the individual L1 clade was characterized for each species. The L1 clade, depicted below, for Anolis carolinensis was previously identified to have 20 distinct families (Novick et al., 2011). In order to compare acar with the 3 new species, protein trees were constructed.

In conducting phylogenetic analyses, the choice to use either protein or DNA sequences needed to be determined. Because prior publications on Anolis carolinensis LINE/L1s were based on nucleotide sequences, I reproduced a phylogenetic DNA tree using only the protein ORF for the reverse transcriptase. Indeed, the protein tree is nearly identical to the DNA tree published in Novick et al., 2011. A DNA and protein tree are depicted below.

Neighbor-joining phylogenetic tree for the LINE/L1 clade in Anolis carolinensis, based on consensus DNA sequences. 500 bootstraps were used and the support threshold was set to 25%.

A Neighbor-Joining tree of Anolis carolinensis LINE/L1 clade based on consensus protein sequences. 1,000 bootstraps were used and the support threshold was set to 10%.

Well supported branching structures were observed on both trees, which were highly similar. Due to the close similarity of the DNA and protein NJ trees, it may be reasonable to use solely protein sequences in the molecular phylogenetic analysis. DNA trees were also produced, but are not pictured here.

Next, NJ and UPGMA trees were generated. Using the NJ tree, family designations were created. In cases of ambiguous tree placement, several factors were taken into account. In addition to generating NJ and UPGMA trees, I also generated a maximum likelihood tree. 

I then imported each alignment (including the consensus) into MEGA and I conducted a pair-wise divergence, with 1,000 bootstraps. MEGA produced a distance matrix with divergences and standard errors for those divergences.

 Distance Matrix from MEGA

This tells me what percent divergence from the family consensus was. I used this value to compare rates of sequence evolution among my various species.