Alignment Quality Analysis
QUALITY SCORES
Clustal X´Â ¹è¿ÀÇ °¢ column¿¡ ´ëÇØ 'conservation score'¸¦ ±âÀÔÇÔÀ¸·Î½á ¹è¿ÀÇ Áú (quality of an alignment)À» Ç¥½ÃÇÏ¿© ÁØ´Ù. ³ôÀº score´Â Àß º¸Á¸µÈ columÀ» Ç¥½ÃÇϰí, ³·´Â score´Â ³·Àº º¸Á¸À» °¡¸®Å²´Ù. Quality curve°¡ alignment ÇÏ´Ü¿¡ ±×·ÁÁø´Ù.
¹è¿¿¡¼ ³ª»Û score°¡ ³ª¿À´Â ÇϳªÀÇ Àܱ⳪ ¼¿ Á¶°¢µéÀ» Ç¥½ÃÇϴµ¥ µÎ°¡Áö ¹æ¹ýÀÌ Á¦°øµÈ´Ù.
³·Àº scoreÀÇ ÀܱâµéÀº Áß°£ Á¤µµÀÇ ºóµµ·Î ¸ðµç ¼¿µé¿¡¼ ³ªÅ¸³¯ °ÍÀ¸·Î ±â´ëµÈ´Ù. ¿Ö³ÄÇÏ¸é ±×µéÀÇ steady divergence´Â ÀÚ¿¬ÀûÀÎ ÁøÈ°úÁ¤¿¡ ±âÀÎÇϱ⠶§¹®ÀÌ´Ù. °¡Àå ºÐ±âµÈ ¼¿µéÀº °¡Àå µ¿¶³¾îÁø °Íµé (outliers)À» °¡Áú °Í °°´Ù. ±×·¯³ª highlighted ÀܱâµéÀº À߸ø ¹è¿µÈ ¼¿µéÀ» ÁöÀûÇϴµ¥ ƯÈ÷ À¯¿ëÇÏ´Ù. Highlighted residueµéÀÇ ¹ÐÁýÀº À߸øµÈ ¹è¿ÀÓÀ» °·ÂÇÏ°Ô ÁöÀûÇÑ´Ù. À̰ÍÀº ¿©·¯ °¡Áö ÀÌÀ¯·Î ¹ß»ýÇÒ ¼ö ÀÖ´Ù. ¿¹¸¦ µé¾î:
1. Alignment algorithmÀÇ ½ÇÆÐ¿¡¼ ±âÀÎÇÑ ºÎºÐÀûÀÎ ¶Ç´Â ÀüüÀûÀÎ misalignmentµéÀº alignment°¡ µÇ±â ¾î·Á¿î °æ¿ì¿¡ ¹ß»ýÇÑ´Ù.
2. ÁÖ¾îÁø set ³»ÀÇ ¼¿µé Áß ÃÖ¼ÒÇÑ Çϳª°¡ ºÎºÐÀûÀ¸·Î ¶Ç´Â ÀüüÀûÀ¸·Î ´Ù¸¥ ¼¿µé°ú ¿¬°üµÇÁö ¾ÊÀ» ¶§ partial ¶Ç´Â total misalignmentµéÀÌ »ý±ä´Ù. »ç¿ëÀÚµéÀº ¼¿µéÀÇ setÀÌ ¹è¿µÉ ¼ö ÀÖ´Â Áö Á¡°ËÇÏ¿©¾ß ÇÑ´Ù.
3. ´Ü¹éÁú ¼¿ ³»ÀÇ Frameshift translation errorµé ¶§¹®¿¡ ±¹ºÎÀûÀ¸·Î mismatchµÈ ¿µ¿ªµéÀÌ °Á¶µÈ´Ù. À̵éÀº database entries¿¡¼ ³î¶ö Á¤µµ·Î ÀÚÁÖ ¹ß»ýÇÑ´Ù. ¸¸¾à ÀǽɵǸé source DNAÀÇ 3-frame translationÀ» °Ë»çÇØ¾ßÇÑ´Ù.
°¡²û °Á¶µÈ ÀܱâµéÀº ¾î¶² »ý¹°ÇÐÀû Á߿伺ÀÇ ¿µ¿ªµéÀ» Áö½ÃÇϱ⵵ ÇÑ´Ù. À̰ÍÀº °¡·É ´Ü¹éÁú ¹è¿ÀÌ main sequence set¿¡ ºñÇØ »õ·Î¿î ±â´ÉµéÀ» ȹµæÇÑ ¼¿À» Æ÷ÇÔÇϰí ÀÖÀ» °æ¿ì ¹ß»ýÇÒ Áöµµ ¸ð¸¥´Ù. »ý¹°ÇÐÀû ¼³¸íÀ» À̲ø¾î³¾ ¼ö ÀÖ±â Àü¿¡´Â ¼¿µéÀÇ ½Ç¼ö³ª ÀÚ¿¬ÀûÀÎ ºÐ±â¿Í °°Àº ´Ù¸¥ ¼³¸íµéÀº ¹èÁ¦ÇÏ´Â °ÍÀÌ Áß¿äÇÏ´Ù.
LOW-SCORING SEGMENTS
¹è¿¿¡¼ ½Å·ÚÇÒ ¼ö ¾ø´Â ¿µ¿ªµéÀº Low-Scoring Segments optionÀ» »ç¿ëÇÏ¿© °Á¶ÇÒ ¼ö ÀÖ´Ù. Sequence-weighted profileÀº ³ª»Û score¸¦ °¡Áö´Â ¼¿µé¿¡ ÀÖ´Â ¾î¶² ºÎºÐµéÀ» °¡¸®Å°´Âµ¥ »ç¿ëµÈ´Ù. Profile calculationÀº ¾î´À Á¤µµ ½Ã°£ÀÌ °É¸®±â ¶§¹®¿¡ LOW-SCORING SEGMENTS¸¦ °è»êÇϱâ À§ÇØ optionÀÌ Á¦°øµÈ´Ù. ±×·± ´ÙÀ½ segment display´Â ½Ã°£ÀÌ °É¸®´Â °è»êÀ» ¹Ýº¹ÇÒ ÇÊ¿ä¾øÀÌ toggled on or off.
Low-scoring segment calculationÀ» ÀÚ¼¼È÷ ¾Ë·Á¸é ÇÏ´ÜÀÇ CALCULATION sectionÀ» Âü°íÇ϶ó.
LOW-SCORING SEGMENT PARAMETERS
MINIMUM LENGTH OF SEGMENTS: ªÀº ºÎºÐµé (¶Ç´Â ÇϳªÀÇ Àܱâ Á¶Â÷)Àº Àü°³µÉ ºÎºÐÀÇ ÃÖ¼ÒÇÑÀÇ ±æÀ̸¦ Áõ°¡½ÃÅ´À¸·Î¼ ¼û°ÜÁú ¼ö ÀÖ´Ù.
DNA MARKING SCALE Àº °Á¶µÈ Àü°³·ÎºÎÅÍ ´ú Áß¿äÇÑ ºÎºÐµéÀ» Á¦°ÅÇϱâ À§ÇØ »ç¿ëµÈ´Ù.
´õ ¸¹Àº ºÎºÐµéÀ» Àü°³Çϱâ À§Çؼ´Â scaleÀ» Áõ°¡½ÃŰ°í °¡Àå Á߿伺ÀÌ ³·Àº ºÎºÐÀº scaleÀ» °¨¼Ò½ÃÄÑ Á¦°ÅÇÑ´Ù.
PROTEIN WEIGHT MATRIX: °¢ ¾Æ¹Ì³ë»ê ÀܱâµéÀÇ ¼·Î¿¡ ´ëÇÑ À¯»ç¼ºÀ» ±â¼úÇÏ´Â scoring tableÀÌ´Ù. ÀÌ matrix´Â sequence-weighted profile scoreµéÀ» °è»êÇϴµ¥ »ç¿ëµÈ´Ù. ³× °³ÀÇ 'in-built' Log-Odds matrxµéÀÌ Á¦°øµÈ´Ù: the Gonnet PAM 80, 120, 250, 350 matrices. ¼¿µéÀÌ ¹ÐÁ¢ÇÏ°Ô °ü·ÃµÇ¾úÀ» ¶§´Â µ¿ÀÏÇÑ °Íµé°ú °¡Àå ¼±È£µÇ´Â º¸Á¸ÀûÀΠġȯµé¿¡¸¸ ³ôÀº Á¡¼ö¸¦ ÁÖ´Â ´õ ¾ö°ÝÇÑ matrix°¡ ´õ Àû´çÇÒÁö ¸ð¸¥´Ù. ´õ ºÐ±âµÈ ¼¿µé¿¡ ´ëÇØ¼´Â ¸¹Àº ´Ù¸¥ ºó¹øÇÑ Ä¡È¯µé¿¡ ³ôÀº Á¡¼ö¸¦ ÁÖ´Â "softer" matrixµéÀÌ Àû´çÇÏ´Ù. ÀÌ optionÀº ÀÚµ¿À¸·Î low-scoring segmentµéÀ» Àç°è»êÇÑ´Ù.
DNA WEIGHT MATRIX: µÎ °³ÀÇ hard-coded matrixµéÀÌ ÀÌ¿ëµÈ´Ù:
1) IUB. À̰ÍÀº Çٻ꼿µéÀ» ºñ±³Çϱâ À§ÇØ BESTFIT¿¡ ÀÇÇØ »ç¿ëµÇ´Â default scoring matrixÀÌ´Ù. X'µé°ú N'µéÀº IUB ambiguity symbol ¾î´À °Í¿¡ ´ëÇØ¼µµ matchµé·Î Ãë±ÞµÈ´Ù. All matches score 1.0; all mismatches for IUB symbols score 0.9.
2) CLUSTALW(1.6). The previous system used by ClustalW, in which matches score
1.0 and mismatches score 0. All matches for IUB symbols also score 0.
»õ·Î¿î matrix´Â ÆÄÀϸíÀÌ ¼Ò¹®Àڷθ¸ ±¸¼ºµÇ¾î ÀÖÀ¸¸é disk¿¡ ÀÖ´Â file·ÎºÎÅÍ ÀÐÇôÁø´Ù.
»õ·Î¿î weight matrix¿¡ ÀÖ´Â °ªµéÀº similarityµéÀ̾î¾ß ÇÏ°í ºó¹øÇÏÁö ¾ÊÀº ġȯ¿¡ ´ëÇØ¼´Â NEGATIVEÀ̾î¾ßÇÒ °ÍÀÌ´Ù.
INPUT FORMAT. »õ·Î¿î matrix¿¡ »ç¿ëµÇ´Â formatÀº BLAST program°ú µ¿ÀÏÇÏ´Ù. # character·Î ½ÃÀ۵Ǵ ¾î¶² lineµéÀÌ´øÁö commentµé·Î ÃßÁ¤µÈ´Ù. ù ¹øÂ° non-comment lineÀº ¾Æ¹Ì³ë»êµéÀÌ ¾î¶² ¼ø¼·ÎµçÁö 1 letter code·Î ¿°ÅµÇ¾î¾ß Çϸç, * character°¡ À̾îÁø´Ù. ÀÌ°Í µÚ¿¡´Â °¢ ¾Æ¹Ì³ë»ê¿¡ ´ëÇØ ÇϳªÀÇ row¿Í ÇϳªÀÇ columnÀ» °¡Áö´Â scoreµéÀÇ square matrix°¡ µÚµû¶ó¾ß¸¸ ÇÑ´Ù. MatrixÀÇ ¸¶Áö¸· row ¿Í column (corresponding to the * character)Àº Àüü matrix¿¡¼ °¡Àå ³·Àº score¸¦ °¡Áø´Ù.
QUALITY SCORE PARAMETERS
Alignment display ÇÏ´Ü¿¡ ±×·ÁÁø column 'quality scores'¸¦ ´ÙÀ½ÀÇ optionµéÀ» »ç¿ëÇÏ¿© ¹Ì¸® ¹Ù²Ü ¼ö ÀÖ´Ù.
SCORE PLOT SCALE: À̰ÍÀº 1 ºÎÅÍ 10±îÁöÀÇ ¼ö·® °ªÀ¸·Î quality score plotÀÇ scaleÀ» ¹Ù²Ù´Âµ¥ »ç¿ëÇÑ´Ù.
RESIDUE EXCEPTION CUTOFF: À̰ÍÀº 1 ºÎÅÍ 10±îÁöÀÇ ¼ö·® °ªÀ¸·Î alignment display¿¡ °Á¶µÈ residue exceptionµéÀÇ ¼ö¸¦ ¹Ù²Ù´Âµ¥ »ç¿ëµÈ´Ù (À̰Ϳ¡ ´ëÇÑ ¼³¸íÀº ÇÏ´Ü¿¡ CALCULATION OF RESIDUE EXCEPTIONS sectionÀ» º¸¶ó)
PROTEIN WEIGHT MATRIX: °¢ ¾Æ¹Ì³ë»êµéÀÇ ¼·Î¿¡ ´ëÇÑ À¯»ç¼ºÀ» º¸¿©ÁÖ´Â scoring table
DNA WEIGHT MATRIX: µÎ °³ÀÇ hard-coded matrixµéÀÌ ÀÌ¿ëµÈ´Ù: IUB and CLUSTALW(1.6).
Weight matrixµé¿¡ ´ëÇØ ´õ ¸¹ÀÌ ¾Ë°í ½ÍÀ¸¸é À§¿¡ ±â¼úµÈ Low-scoring Segments Weight Matrix ºÎºÐÀ» ÂüÁ¶Ç϶ó.
Quality score calculationµéÀÇ ÀÚ¼¼È÷ ¾Ë°í ½ÍÀ¸¸é ÇÏ´ÜÀÇ CALCULATION sectionÀ» ÂüÁ¶Ç϶ó.
SHOW LOW-SCORING SEGMENTS
low-scoring segment display´Â toggled on or off µÉ ¼ö ÀÖ´Ù. ÀÌ optionÀº profile scoreµéÀ» Àç°è»êÇÏÁö´Â ¾Ê´Â´Ù.
SHOW EXCEPTIONAL RESIDUES
ÀÌ optionÀº alignment quality calculations¿¡¼ °¡Àå ³ª»Û scoreÀ» °¡Áö´Â °³°³ÀÇ Àܱ⸦ °Á¶ÇÑ´Ù. À¯´Þ¸® score°¡ ³·Àº ÀܱâµéÀº ȸ»ö ¹è°æ¿¡ Èò»ö ¹®ÀÚ¸¦ »ç¿ëÇÏ¿© Ç¥½ÃµÈ´Ù.
SAVE QUALITY SCORES TO FILE
Alignment display ÇÏ´Ü¿¡ ±×·ÁÁø quality scoreµéµµ text file¿¡ ÀúÀåµÉ ¼ö ÀÖ´Ù. ¹è¿¿¡ ÀÖ´Â °¢ columnÀº output fileÀÇ ÇÑ line¿¡ ¾²¿©Áö°í, quality scoreÀÇ °ªÀº lineÀÇ ¸»´Ü¿¡ ³õÀδÙ. ÇöÀç display »ó¿¡ ¼±ÅÃµÈ ¼¿µé¸¸ÀÌ file¿¡ ¾²¿©Áø´Ù. Quality scoreµéÀÇ ÇѰ¡Áö »ç¿ëÀº ´Ü¹éÁú ¼¿³»ÀÇ Àܱâµé¿¡ ¼¿ º¸Á¸¿¡ µû¶ó »öÄ¥À» ÇÏ´Â °ÍÀÌ´Ù. ÀÌ·± ¹æ½ÄÀ¸·Î º¸Á¸µÈ Ç¥¸éÀܱâµéÀº ligand-binding siteµé°ú °°Àº ±â´ÉÀûÀÎ ¿µ¿ªµéÀ» Á¤Çϱâ À§ÇØ °Á¶µÈ´Ù.
CALCULATION OF QUALITY SCORES
±æÀÌ nÀ» °¡Áø ¼¿µé m °³ÀÇ ¹è¿À» °¡Áö°í ÀÖ´Ù°í °¡Á¤Çغ¸ÀÚ. ±×·¯¸é ¹è¿Àº ´ÙÀ½°ú °°Àº °ÍÀÌ´Ù:
A11 A12 A13 .......... A1n
A21 A22 A23 .......... A2n
.
.
Am1 Am2 Am3 .......... Amn
¶ÇÇÑ C(i,j)´Â Àܱâ i¸¦ Àܱâ j¿Í ¹è¿Çϴµ¥ ´ëÇÑ scoreÀÎ Å©±â RÀ» °¡Áö´Â residue comparison matrixÀ» °¡Áø´Ù°í ÇÏÀÚ. ¹è¿¿¡¼ jth À§Ä¡ÀÇ º¸Á¸¿¡ ´ëÇÑ score¸¦ °è»êÇϰíÀÚ ÇÑ´Ù¸é,
To do this, we define an R-dimensional sequence space. For the jth position in the alignment, each sequence consists of a single residue which is assigned a point S in the space. S has R dimensions, and for sequence i, the rth dimension is defined as:
Sr = C(r,Aij)
We then calculate a consensus value for the jth position in the alignment. This value X also has R dimensions, and the rth dimension is defined as:
Xr = ( SUM (Fij * C(i,r)) ) / m
1<=i<=R
where Fij is the count of residues i at position j in the alignment.
Now we can calculate the distance Di between each sequence i and the consensus position X in the R-dimensional space.
Di = SQRT ( SUM (Xr - Sr)(Xr - Sr) )
1<=i<=R
The quality score for the jth position in the alignment is defined as the mean of the sequence distances Di.
The score is normalised by multiplying by the percentage of sequences which have residues (and not gaps) at this position.
CALCULATION OF RESIDUE EXCEPTIONS
The jth residue of the ith sequence is considered as an exception if the distance Di of the sequence from the consensus value P is greater than (Upper Quartile + Inter Quartile Range * Cutoff). The value used as a cutoff for displaying exceptions can be set from the SCORE PARAMETERS menu. A high cutoff value will only display very significant exceptions; a low value will allow more, less significant, exceptions to be highlighted.
(NB. Sequences which contain gaps at this position are not included in the
exception calculation.)
CALCULATION OF LOW-SCORING SEGMENTS
Suppose we have an alignment of m sequences of length n. Then, the alignment can be written as:
A11 A12 A13 .......... A1n
A21 A22 A23 .......... A2n
.
.
Am1 Am2 Am3 .......... Amn
We also have a residue comparison matrix of size R where C(i,j) is the score for aligning residue i with residue j.
We calculate sequence weights by building a neighbour-joining tree, in which branch lengths are proportional to divergence. Summing the branches by branch ownership provides the weights. See (Thompson et al., CABIOS, 10, 19 (1994) and Henikoff et al.,JMB, 243, 574 1994).
To find the low-scoring segments in a sequence Si, we build a weighted profile of the remaining sequences in the alignment. Suppose we find residue r at position j in the sequence; then the score for the jth position in the sequence is defined as
Score(Si,j) = Profile(j,r) where Profile(j,r) is the profile score
for residue r at position j in the
alignment.
These residue scores are summed along the sequence in both forward and backward directions. If the sum of the scores is positive, then it is reset to zero. Segments which score negatively in both directions are considered as 'low-scoring' and will be highlighted in the alignment display.