In the past, two kinds of Markov models have been considered to describe protein sequence evolution. Codon-level models have been mechanistic, with a small number of parameters designed to take into account features such as transition-transversion bias, codon frequency bias and synonymous-nonsynonymous amino acid substitution bias. Amino acid models have been empirical, attempting to summarize the replacement patterns observed in large quantities of data and not explicitly considering the distinct factors that shape protein evolution.
We have estimated the first empirical codon model using Maximum Likelihood techniques. Our results show that modelling the evolutionary process is improved by allowing for single, double and triple nucleotide changes; the affiliation between DNA triplets and the amino acid they encode is a main factor driving evolution; and the nonsynonymous-synonymous rate ratio is a suitable measure to classify substitution patterns observed for different proteins. However, are the double and triple changes instantaneous? We aim to take advantage of newly available re-sequencing data to further improve our models and understanding of the evolution of protein coding sequences. Specifically, we estimate empirical codon substitution models from re-sequencing data of several Drosophila species.