Pairwise Alignment algorithms

When aligning two sequences, the algorithm will identify the optimal relationship between them. This is done by comparing every letter in one sequence with every letter in the other. The algorithm will account for matches and mismatches and compute the best mathematical path through these matches and mismatches. It will also have to account for insertions or deletions in either sequence. The following two sequences can be aligned by inserting gaps to bring identical residues in line with each other:

Before alignment
AFGIVHKLIVS
AFGIHKIVS

After alignment
AFGIVHKLIVS
AFGI-HK-IVS

The pairwise alignment methods used in Geneious are based on dynamic programming using the Needleman & Wunsch (1970) or Smith & Waterman (1981) algorithms. They are available in global and local variants. A global alignment ensures that every part of two sequences are aligned. A local alignment will align the areas of best similarity such as when only part of the two sequences are related, for example, multi-domain protein sequences. Forcing a global alignment on a multi-domain sequence would not be sensible since the alignment implies that there is a similarity between the sequences for the entire sequence length and parts of the sequences in this case would be unrelated.

Note: An alignment is mathematically optimal and may not necessarily be biologically optimal as you will see in the following exercises.

In addition to the choice of algorithm, you will need to be aware of the scoring scheme and gap penalty settings as these both affect the quality and sensitivity of the alignment method. Scoring matrices allow you to control the sensitivity to mismatches and substitutions during alignment and gap penalties will determine how easy it is for gaps to be opened and extended.

Scoring schemes - For DNA these tend to be quite simple and based on perfect matches and mismatches. Proteins have more complicated substitution tables where similar residue types will score positively even if they do not match exactly, for example, Isoleucine and Leucine. Scoring tables are typically designed to favour a certain level of identity. If your sequences are going to be very similar to each other you can use a strict scoring table but if you want more sensitivity you can use a more relaxed scoring table. With DNA alignments you could choose the scoring scheme that is weighted towards 93% identity where mismatches are scored very negatively and will favour alignments with a high level of identity. Other scoring schemes with lower mismatch scores will be more forgiving of lower identity alignments. Similarly, for protein alignments, you can choose Blosum90 which penalises mismatches and conservative substitutions more than Blosum45 would.

Gap penalties - In addition to the scoring scheme, the gap penalties you select will also affect your alignment quality. The gap open penalty is the cost for starting a gap. Typically this is quite high to discourage gaps. However, the gap extension penalty would be much lower as it would allow gaps to propogate once they have started. This gap open and gap extension scheme is referred to as an affine gap and is intended to clump the gaps together and allow for long runs of gaps. If the gap open penalty is too low then the alignment algorithm will favour opening gaps rather than mismatches and you can end up with a 'spaghetti' alignment which is nonsense. Affine gaps are sometimes thought to produce more biologically realistic alignments but you need to balance the cost of opening and extending such gaps and it may be better to use a lower open and higher extend or even have them be equal. Where the two penalties are the same you will be using linear gap penalties where every '-' costs the same. For certain alignments you might find that it is better to use linear gaps because it allows the program to put single gaps in more cheaply where appropriate (e.g. a beta turn in a protein) when clumping as affines tend to do wouldn't produce a true alignment.

The balance of gap penalties, scoring schemes and the algorithm you use is the skill you will learn during this module as we explore the effects of these parameters.

Exercise 1: Using dotplots to explore relationships
Exercise 2: Aligning DNA sequences
Exercise 3: Aligning protein sequences