Using RNA 3D Structure Data in SCFG/MRF Models to do Sequence Alignment and Motif Inference
Presenter
October 31, 2007
Abstract
RNA 3D structure files contain essentially complete information about the interactions that form the 3D structure of an RNA molecule for a given organism. Homologous molecules in other organisms will have very similar 3D structures, but we expect to see sequence variability due to structurally neutral base substitutions, insertions, and deletions, among other things. RNA databases have far more RNA sequences than RNA 3D structures, and this will always be the case. We wish to use the 3D structure data to make inferences about the 3D structure of homologous molecules on the basis of their sequences.
We think of homologous RNA sequences as being random variants of the molecule for which we have a 3D structure. If the probabilistic model of this variation is simple enough, we can use it to align the sequences to the 3D structure, and thus infer the structural role of each base in the sequence. Stochastic context free grammars (SCFGs) can account for the nested Watson-Crick basepairs prevalent in RNA, and by choosing appropriate basepair substitution probabilities, they can be used to model structurally neutral basepair substitutions for non-Watson-Crick basepairs as well. We use an SCFG formalism enhanced by production rules based on Markov Random Fields (MRF). This allows us to model base triples such as are found in sarcin motifs, and local crossing interactions such as are found in kink turn internal loops. However, as usual with SCFG, it does not allow us to model longer-range pseudoknots.
The SCFG/MRF model can be used for two purposes: First, to make RNA multiple sequence alignments based on the 3D structure of one molecule, without reference to a hand-curated seed alignment. Second, to infer the 3D structure of small motifs such as internal loops from their sequences.