Sensitivity and selectivity of
Sencel's search software
The sensitivity and selectivity of the Smith-Waterman and ParAlign
algorithms have been extensively evaluated and compared with other
commonly used programs. See [4]
for details.
The evaluation was carried out by testing how well the different
programs where able to correctly identify which protein domain sequences
belonged to the same superfamily in the SCOP
(Structural Classification Of Proteins) database [8].
The PDB40D-B database provided by Brenner et al. [10],
contains 1323 protein domains that has been extracted from the PDB
database of proteins with known 3D structure. The sequences in the
PDB40D-B database has little sequence similarity to each other (less
than 40% identical amino acids). Each of these proteins have been
assigned to a superfamily by the classification used in the SCOP
database. Each of these 1323 sequences where used as a query in
a search against the rest of the protein domains.
The sensitivity of each program was measured by the coverage, which
is the fraction of correctly identified homologues (true positives).
The coverage indicates what fraction of structurally similar proteins
one may expect to identify based on sequence alone using the different
programs.
The selectivity of the programs is measured by the number of errors
or incorrect homology assignments made (false positives). The number
of errors per query (EPQ) is in general equal to the expect value
used as the threshold in the searches.
In the graph below, the coverage is plotted against the number
of errors per query (EPQ).

Click here
or on the picture for a larger view.
The lines representing ParAlign (blue) and Sencel's Smith-Waterman
implementation (SWMMX) (green) more or less overlap on the
graph, indicating how equal their results are. Together with SSEARCH
(orange) they perform best. FASTA with ktup 1 (pink) follows
right behind. The performances of BLAST (red) and FASTA with
ktup
2 (black) are clearly inferior. The difference between SSEARCH
and Sencel's Smith-Waterman implementation is due to the different
statistical models used.
Test conditions: The test was carried out as descibed by
Brenner et al. [10]. The
sensitivity and selectivity test was performed using the 1323
protein
domain sequences in the PDB40D-B database provided by Brenner et
al. derived from the PDB and SCOP [8]
databases. The BLOSUM62 score matrix [7]
was used in combination with a gap penalty of 11+k, where k
is the
gap length. Default statistics and no prefiltering was used. Program
versions: ParAlign version 3.3.6 [3,
4], NCBI BLAST version 2.2.6 [6,
9], FASTA [5]
and SSEARCH [11] version 3.4t23b2.
|