External sources of information in the multiple sequence alignment process, and
not just relying on primary-sequence information, can give valuable hints to
possible sequence homologies that may not be obvious from sequence comparison
Given the huge amount of sequence annotation that is being produced on a daily basis, integrating such external information into the alignment process can contribute to produce biologically more meaningful alignments. DIALIGN-PFAM identifies possible domains in protein sequences by using PFAM, it then uses this information to align protein sequences with DIALIGN and with a recently developed graph-theoretical approach to multiple alignment.
DIALIGN-PFAM takes a file in fasta format containing a set of protein sequences as an input. HMMER is used for the purpose of scanning these sequences against PFAM. After scanning is finished, segments of sequences matching the same domain in PFAM are assembled together forming what we call a domain block. Thus, we get a set of local alignments each related to one protein domain. As a final step, anchor pointswill be extracted from these blocks. Theyare used in the final multiple alignment process which is done by DIALIGN. The anchoring option of DIALIGN is used in order to integrate the anchor points produced from the previous step in the alignment process.
Below is a list of the steps the user will go through when using DIALIGN-PFAM:
Each protein family in PFAM is represented by a model consisting of one or several multiple sequence alignments of domains and Hidden Markov Models (HMM) derived from these alignments. Thus, the first step in our approach is to detect common domains in a set of sequences and then aligning these domains together. In order to scan the input sequences against PFAM we use HMMER. More precisely, we use the program Hmmscan which searches sequences against a given profile HMM database.
HMMER assigns quality scores to matches between sequences and models of proteins and domains in a database. In order to control which hits are used by our algorithm, we use two threshold values for E-values of HMMER hits. The first threshold Em concerns the E-value of the matched models and ensures that only models with E-value less than Em are taken into consideration. Those profiles which satisfy the first threshold condition are further filtered with a second threshold Ed for domains. Only those domains that have an E-value less than Ed are considered in our procedure. As default values, we used the values of 5x10−3 for Em and 10−4 for Ed. (Note that a model in Pfam can comprise more than one protein domainour first threshold applies to full models, the second one to single domains).
After blocks extraction process if finished, the user will get as an output the set of constructed blocks. Each block is related to one protein domain. The user has the option to view these blocks either locally or globally. Local view shows only the single segments constituting a given block; note that these segments may contain gaps. On the other hand, global view of a given block will show the full sequences which have matches to a certain domain, the matches are colored so that the user can see clearly where the block's segment are located on each of the involved sequences. The user can either choose to include all the constructed blocks in the multiple alignment process or discard some.
To integrate the domain blocks derived from PFAM hits with similarities at the primary-sequence level, we use the MSA program DIALIGN. We use the domain blocks as anchor points for DIALIGN. DIALIGN has an anchoring option, where users can specify local alignments as anchor points that should be preferentially aligned. Since, in general, not all selected anchor points can be included in one single output alignment, the program greedily selects a consistent subset of the proposed anchor points, i.e. a subset fitting into one single multiple alignment.
The scores of the anchor points derived from a given block is defined as the sum of the scores of the segments that are part of the block. (The scores of anchor points determine their priority in the greedy selection of a consistent set of anchor points.) As a result, in this approach, we first align the segments that are part of the constructed domain blocks are aligned. The rest of the sequences is then aligned by DIALIGN under the constraints defined by the selected anchor points.
Consider the following set of seven sequences as input to Dialign-Pfam:
After submitting these sequences, Dialign-Pfam will scan each of the input sequences against Pfam. Blocks building process will start then, where the overlapping parts of matches of input sequences to the same protein domain in pfam are considered as a domain block.
For our example, the output for this step is the following:
The first column shows the domains which were matched in Pfam, in our example, five domains were matched:Thioredoxin, Glutaredoxin, SH3BGR, AhpC-TSA andRedoxin.
The second column shows how many sequences out from the total number of input sequences matched that specific protein domain specified in the same row. From this example, five sequences matched Thioredoxin domain, three sequences matched Glutaredoxindomain, and so on.
The checkboxes column next to the domain names allow the user to select/deselect domain matches that are to be used as anchor points for the final multiple alignment by DIALIGN (by default, all domain matches are selected).
In order to view the block for each specific domain, press on view in the third column. The following image shows the Glutaredoxin block:
The first column displays the sequences which were found to match Glutaredoxin domain. The second column specifies the starting position where the match of the sequence to Glutaredoxin starts. The third column shows the alignment of matches to Glutaredoxin domain.
In order to view the position of Pfam matches within input sequences, press on view in the fourth column. For Glutaredoxin, the output will look like:
Column one and two are the sequences names and starting positions. The third column shows the Pfam matches (in red) within the input sequences. For long sequences, it is possible to zoom in using the zoom bar in order to see the whole sequences and locate the red matches more clearly.
of our alignment approach are blocks of segments of the input sequences
matching to the same corresponding position of some PFAM domain.We call such a
block a domain block. More precisely, a domain block consists of segments that
are matched with gaps to the same segment of a PFAM domain. For each column in
a domain block, the corresponding positions in the involved segments are
required to match to the same position in the PFAM domain that is associated
with this block.
The minimum number of segments in any given block is two, i.e. we can't build a block using only one segment. Thus, we can't build protein domain blocks for protein domains where only one sequence was found to be matching to.
Anchor points in DIALIGN are pairs of residues that are to be aligned (or, more generally, pairwise alignments that are to be included into a multiple alignment). In our approach, we use pairs ofsegments contained in our domain blocks as anchor points for DIALIGN. To ensure that all segments ofa domain block are connected directly or indirectly by anchor points, we define anchor points connecting segment 1 with segment 2, segment 2 with segment 3, segment 3 with segment 4 ... etc.
Layal Al Ait, Eduardo Corel, Burkhard Morgenstern: Using protein-domain information for multiple sequence alignment IEEE 12th international conference for Bioinformatics and Bioengineering 2012, LArnaka, Cyprus.
Layal Al-Ait: firstname.lastname@example.org