Alignment Quality Analysis

Alignment Quality Analysis

QUALITY SCORES

Clustal X는 배열의 각 column에 대해 'conservation score'를 기입함으로써 배열의 질 (quality of an alignment)을 표시하여 준다. 높은 score는 잘 보존된 colum을 표시하고, 낮는 score는 낮은 보존을 가리킨다. Quality curve가 alignment 하단에 그려진다.

배열에서 나쁜 score가 나오는 하나의 잔기나 서열 조각들을 표시하는데 두가지 방법이 제공된다.

낮은 score의 잔기들은 중간 정도의 빈도로 모든 서열들에서 나타날 것으로 기대된다. 왜냐하면 그들의 steady divergence는 자연적인 진화과정에 기인하기 때문이다. 가장 분기된 서열들은 가장 동떨어진 것들 (outliers)을 가질 것 같다. 그러나 highlighted 잔기들은 잘못 배열된 서열들을 지적하는데 특히 유용하다. Highlighted residue들의 밀집은 잘못된 배열임을 강력하게 지적한다. 이것은 여러 가지 이유로 발생할 수 있다. 예를 들어:

1. Alignment algorithm의 실패에서 기인한 부분적인 또는 전체적인 misalignment들은 alignment가 되기 어려운 경우에 발생한다.

2. 주어진 set 내의 서열들 중 최소한 하나가 부분적으로 또는 전체적으로 다른 서열들과 연관되지 않을 때 partial 또는 total misalignment들이 생긴다. 사용자들은 서열들의 set이 배열될 수 있는 지 점검하여야 한다.

3. 단백질 서열 내의 Frameshift translation error들 때문에 국부적으로 mismatch된 영역들이 강조된다. 이들은 database entries에서 놀랄 정도로 자주 발생한다. 만약 의심되면 source DNA의 3-frame translation을 검사해야한다.

가끔 강조된 잔기들은 어떤 생물학적 중요성의 영역들을 지시하기도 한다. 이것은 가령 단백질 배열이 main sequence set에 비해 새로운 기능들을 획득한 서열을 포함하고 있을 경우 발생할 지도 모른다. 생물학적 설명을 이끌어낼 수 있기 전에는 서열들의 실수나 자연적인 분기와 같은 다른 설명들은 배제하는 것이 중요하다.

LOW-SCORING SEGMENTS

배열에서 신뢰할 수 없는 영역들은 Low-Scoring Segments option을 사용하여 강조할 수 있다. Sequence-weighted profile은 나쁜 score를 가지는 서열들에 있는 어떤 부분들을 가리키는데 사용된다. Profile calculation은 어느 정도 시간이 걸리기 때문에 LOW-SCORING SEGMENTS를 계산하기 위해 option이 제공된다. 그런 다음 segment display는 시간이 걸리는 계산을 반복할 필요없이 toggled on or off.

Low-scoring segment calculation을 자세히 알려면 하단의 CALCULATION section을 참고하라.

LOW-SCORING SEGMENT PARAMETERS

MINIMUM LENGTH OF SEGMENTS: 짧은 부분들 (또는 하나의 잔기 조차)은 전개될 부분의 최소한의 길이를 증가시킴으로서 숨겨질 수 있다.

DNA MARKING SCALE 은 강조된 전개로부터 덜 중요한 부분들을 제거하기 위해 사용된다.

더 많은 부분들을 전개하기 위해서는 scale을 증가시키고 가장 중요성이 낮은 부분은 scale을 감소시켜 제거한다.

PROTEIN WEIGHT MATRIX: 각 아미노산 잔기들의 서로에 대한 유사성을 기술하는 scoring table이다. 이 matrix는 sequence-weighted profile score들을 계산하는데 사용된다. 네 개의 'in-built' Log-Odds matrx들이 제공된다: the Gonnet PAM 80, 120, 250, 350 matrices. 서열들이 밀접하게 관련되었을 때는 동일한 것들과 가장 선호되는 보존적인 치환들에만 높은 점수를 주는 더 엄격한 matrix가 더 적당할지 모른다. 더 분기된 서열들에 대해서는 많은 다른 빈번한 치환들에 높은 점수를 주는 "softer" matrix들이 적당하다. 이 option은 자동으로 low-scoring segment들을 재계산한다.

DNA WEIGHT MATRIX: 두 개의 hard-coded matrix들이 이용된다:

1) IUB. 이것은 핵산서열들을 비교하기 위해 BESTFIT에 의해 사용되는 default scoring matrix이다. X'들과 N'들은 IUB ambiguity symbol 어느 것에 대해서도 match들로 취급된다. All matches score 1.0; all mismatches for IUB symbols score 0.9.

2) CLUSTALW(1.6). The previous system used by ClustalW, in which matches score

1.0 and mismatches score 0. All matches for IUB symbols also score 0.

새로운 matrix는 파일명이 소문자로만 구성되어 있으면 disk에 있는 file로부터 읽혀진다.

새로운 weight matrix에 있는 값들은 similarity들이어야 하고 빈번하지 않은 치환에 대해서는 NEGATIVE이어야할 것이다.

INPUT FORMAT. 새로운 matrix에 사용되는 format은 BLAST program과 동일하다. # character로 시작되는 어떤 line들이던지 comment들로 추정된다. 첫 번째 non-comment line은 아미노산들이 어떤 순서로든지 1 letter code로 열거되어야 하며, * character가 이어진다. 이것 뒤에는 각 아미노산에 대해 하나의 row와 하나의 column을 가지는 score들의 square matrix가 뒤따라야만 한다. Matrix의 마지막 row 와 column (corresponding to the * character)은 전체 matrix에서 가장 낮은 score를 가진다.

QUALITY SCORE PARAMETERS

Alignment display 하단에 그려진 column 'quality scores'를 다음의 option들을 사용하여 미리 바꿀 수 있다.

SCORE PLOT SCALE: 이것은 1 부터 10까지의 수량 값으로 quality score plot의 scale을 바꾸는데 사용한다.

RESIDUE EXCEPTION CUTOFF: 이것은 1 부터 10까지의 수량 값으로 alignment display에 강조된 residue exception들의 수를 바꾸는데 사용된다 (이것에 대한 설명은 하단에 CALCULATION OF RESIDUE EXCEPTIONS section을 보라)

PROTEIN WEIGHT MATRIX: 각 아미노산들의 서로에 대한 유사성을 보여주는 scoring table

DNA WEIGHT MATRIX: 두 개의 hard-coded matrix들이 이용된다: IUB and CLUSTALW(1.6).

Weight matrix들에 대해 더 많이 알고 싶으면 위에 기술된 Low-scoring Segments Weight Matrix 부분을 참조하라.

Quality score calculation들의 자세히 알고 싶으면 하단의 CALCULATION section을 참조하라.

SHOW LOW-SCORING SEGMENTS

low-scoring segment display는 toggled on or off 될 수 있다. 이 option은 profile score들을 재계산하지는 않는다.

SHOW EXCEPTIONAL RESIDUES

이 option은 alignment quality calculations에서 가장 나쁜 score을 가지는 개개의 잔기를 강조한다. 유달리 score가 낮은 잔기들은 회색 배경에 흰색 문자를 사용하여 표시된다.

SAVE QUALITY SCORES TO FILE

Alignment display 하단에 그려진 quality score들도 text file에 저장될 수 있다. 배열에 있는 각 column은 output file의 한 line에 쓰여지고, quality score의 값은 line의 말단에 놓인다. 현재 display 상에 선택된 서열들만이 file에 쓰여진다. Quality score들의 한가지 사용은 단백질 서열내의 잔기들에 서열 보존에 따라 색칠을 하는 것이다. 이런 방식으로 보존된 표면잔기들은 ligand-binding site들과 같은 기능적인 영역들을 정하기 위해 강조된다.

CALCULATION OF QUALITY SCORES

길이 n을 가진 서열들 m 개의 배열을 가지고 있다고 가정해보자. 그러면 배열은 다음과 같은 것이다:

A11 A12 A13 .......... A1n

A21 A22 A23 .......... A2n

Am1 Am2 Am3 .......... Amn

또한 C(i,j)는 잔기 i를 잔기 j와 배열하는데 대한 score인 크기 R을 가지는 residue comparison matrix을 가진다고 하자. 배열에서 jth 위치의 보존에 대한 score를 계산하고자 한다면,

To do this, we define an R-dimensional sequence space. For the jth position in the alignment, each sequence consists of a single residue which is assigned a point S in the space. S has R dimensions, and for sequence i, the rth dimension is defined as:

Sr = C(r,Aij)

We then calculate a consensus value for the jth position in the alignment. This value X also has R dimensions, and the rth dimension is defined as:

Xr = ( SUM (Fij * C(i,r)) ) / m

1<=i<=R

where Fij is the count of residues i at position j in the alignment.

Now we can calculate the distance Di between each sequence i and the consensus position X in the R-dimensional space.

Di = SQRT ( SUM (Xr - Sr)(Xr - Sr) )

1<=i<=R

The quality score for the jth position in the alignment is defined as the mean of the sequence distances Di.

The score is normalised by multiplying by the percentage of sequences which have residues (and not gaps) at this position.

CALCULATION OF RESIDUE EXCEPTIONS

The jth residue of the ith sequence is considered as an exception if the distance Di of the sequence from the consensus value P is greater than (Upper Quartile + Inter Quartile Range * Cutoff). The value used as a cutoff for displaying exceptions can be set from the SCORE PARAMETERS menu. A high cutoff value will only display very significant exceptions; a low value will allow more, less significant, exceptions to be highlighted.

(NB. Sequences which contain gaps at this position are not included in the

exception calculation.)

CALCULATION OF LOW-SCORING SEGMENTS

Suppose we have an alignment of m sequences of length n. Then, the alignment can be written as:

A11 A12 A13 .......... A1n

A21 A22 A23 .......... A2n

Am1 Am2 Am3 .......... Amn

We also have a residue comparison matrix of size R where C(i,j) is the score for aligning residue i with residue j.

We calculate sequence weights by building a neighbour-joining tree, in which branch lengths are proportional to divergence. Summing the branches by branch ownership provides the weights. See (Thompson et al., CABIOS, 10, 19 (1994) and Henikoff et al.,JMB, 243, 574 1994).

To find the low-scoring segments in a sequence Si, we build a weighted profile of the remaining sequences in the alignment. Suppose we find residue r at position j in the sequence; then the score for the jth position in the sequence is defined as

Score(Si,j) = Profile(j,r) where Profile(j,r) is the profile score

for residue r at position j in the

alignment.

These residue scores are summed along the sequence in both forward and backward directions. If the sum of the scores is positive, then it is reset to zero. Segments which score negatively in both directions are considered as 'low-scoring' and will be highlighted in the alignment display.