Sensitivity and selectivity of Sencel's search software

The sensitivity and selectivity of the Smith-Waterman and ParAlign algorithms have been extensively evaluated and compared with other commonly used programs. See [4] for details.

The evaluation was carried out by testing how well the different programs where able to correctly identify which protein domain sequences belonged to the same superfamily in the SCOP (Structural Classification Of Proteins) database [8]. The PDB40D-B database provided by Brenner et al. [10], contains 1323 protein domains that has been extracted from the PDB database of proteins with known 3D structure. The sequences in the PDB40D-B database has little sequence similarity to each other (less than 40% identical amino acids). Each of these proteins have been assigned to a superfamily by the classification used in the SCOP database. Each of these 1323 sequences where used as a query in a search against the rest of the protein domains.

The sensitivity of each program was measured by the coverage, which is the fraction of correctly identified homologues (true positives). The coverage indicates what fraction of structurally similar proteins one may expect to identify based on sequence alone using the different programs.

The selectivity of the programs is measured by the number of errors or incorrect homology assignments made (false positives). The number of errors per query (EPQ) is in general equal to the expect value used as the threshold in the searches.

In the graph below, the coverage is plotted against the number of errors per query (EPQ).

Click here or on the picture for a larger view.

The lines representing ParAlign (blue) and Sencel's Smith-Waterman implementation (SWMMX) (green) more or less overlap on the graph, indicating how equal their results are. Together with SSEARCH (orange) they perform best. FASTA with ktup 1 (pink) follows right behind. The performances of BLAST (red) and FASTA with ktup 2 (black) are clearly inferior. The difference between SSEARCH and Sencel's Smith-Waterman implementation is due to the different statistical models used.

Test conditions: The test was carried out as descibed by Brenner et al. [10]. The sensitivity and selectivity test was performed using the 1323 protein domain sequences in the PDB40D-B database provided by Brenner et al. derived from the PDB and SCOP [8] databases. The BLOSUM62 score matrix [7] was used in combination with a gap penalty of 11+k, where k is the gap length. Default statistics and no prefiltering was used. Program versions: ParAlign version 3.3.6 [3, 4], NCBI BLAST version 2.2.6 [6, 9], FASTA [5] and SSEARCH [11] version 3.4t23b2.