Abstract
Patricia Babbitt
University of California, San Francisco (UCSF)
While much attention has been directed at mathematical and statistical issues for creating accurate multiple alignments, consideration of which sequences (or parts of sequences) and structures to align is less well explored. This issue is especially important for investigation of structure-function relationships in large sets of highly diverse homologs for which the proteins of unknown function are far greater than those that have been biochemically or structurally characterized. How can these choices be tuned to address different types of biological questions and improve functional inference? We discuss what we have learned about choosing representative sequences for creating MSAs from studies of several large and functionally diverse enzyme superfamilies and provide examples for how biologically informed questions can be framed using this context. Sequence similarity networks built to summarize on a large scale relationships among members of several of these superfamilies are used to illustrate new challenges for creating MSAs as the volume of sequence data continues to increase.