PredSL

In order to test the accuracy and specificity of the method we used 5-fold cross-validation. The cleavage site prediction evaluation is considered correct if it is within 5 positions of the annotated cleavage site. The results were:

PLANT DATASET

PredSL

Sequence Type Sensitivity Specificity MCC Cl.site pred.accuracy

Chloroplast 89.6 79.1 81.6 34.2

Mitochondrial 89.1 87.1 84.2 23.2

Secreted 95.5 91.8 91.2 91.4

Other 70.3 85.7 74.0 -

Total 86.7 - - 50.3

TargetP

Sequence Type Sensitivity Specificity MCC Cl.site pred.accuracy

Chloroplast 95.3 90.4 89.5 45.8

Mitochondrial 82.0 90.1 82.9 46.2

Secreted 94.3 93.4 92.2 66.7

Other 89.8 89.5 87.9 -

Total 90.9 - - 54.0

Home

NONPLANT DATASET

PredSL

Sequence Type Sensitivity Specificity MCC Cl.site pred.accuracy

Mitochondrial 87.6 83.9 80.4 18.3

Secreted 93.8 91.3 89.1 93.6

Other 79.5 85.9 76.6 -

Total 87.1 - - 53.5

TargetP

Sequence Type Sensitivity Specificity MCC Cl.site pred.accuracy

Mitochondrial 88.5 85.2 81.7 45.0

Secreted 95.1 95.6 93.2 68.4

Other 83.8 86.8 79.9 -

Total 89.2 - - 56.5

Home

Note that in the above dataset sequences that have been used in the training dataset used by TargetP have not been removed. We tested PredSL against a test set we created from the sequences that were removed from the initial dataset by the redundancy reduction, excluding the sequences that were included in the training dataset used by TargetP. We also trained PredSL with the training datasets used by TargetP and tested it against this test set.
The results are:

PLANT DATASET

PredSL

Chloroplast 92.0 88.7 88.5 34.2

Mitochondrial 93.7 84.4 86.9 23.2

Secreted 97.7 90.7 91.7 91.4

Other 81.5 95.3 83.6 -

Total 90.4 - - 50.3

PredSL trained with TargetP training datasets

- Sensitivity Specificity MCC Cl.site pred.accuracy

Chloroplast 78.1 77.5 75.0 48.3

Mitochondrial 96.1 75.3 82.1 40.8

Secreted 97.3 88.8 90.0 90.8

Other 70.6 91.6 74.6 -

Total 84.4 - - 63.2

TargetP

Sequence Type Sensitivity Specificity MCC Cl.site pred.accuracy

Chloroplast 89.8 82.0 83.3 42.0

Mitochondrial 89.8 81.4 83.1 49.3

Secreted 92.3 93.5 90.3 65.4

Other 84.7 93.3 84.7 -

Total 88.8 - - 54.0

Home

NONPLANT DATASET

PredSL

Sequence Type Sensitivity Specificity MCC Cl.site pred.accuracy

Mitochondrial 94.3 90.3 88.9 45.2

Secreted 95.2 94.4 92.5 93.9

Other 90.4 95.4 89.9 -

Total 93.3 - - 66.0

PredSL trained with TargetP training datasets

Sequence Type Sensitivity Specificity MCC Cl.site pred.accuracy

Mitochondrial 93.8 86.2 85.6 42.1

Secreted 94.8 91.6 90.1 93.8

Other 82.1 93.5 83.2 -

Total 90.2 - - 67.0

TargetP

Sensitivity Specificity MCC Cl.site pred.accuracy

Mitochondrial 93.0 90.2 88.0 50.0

Secreted 97.4 95.7 94.9 91.1

Other 90.0 94.5 88.8 -

Total 93.5 - - 70.6

Home

lTP prediction results

The Lumen prediction was tested on the whole dataset, resulting to 91.1% accuracy, compared to 88.8% provided by the LumenP predictor. The cleavage site prediction, is also improved: 88.7% of the sequences have their lTP cleavage site predicted within �2 residues, compared to 75.1% that the LumenP predictor estimates correctly. These results are based on the complete dataset of 259 sequences. On a redundancy reduced dataset of 109 sequences (40%-threshold, cd-hit, (Li et al, 2001)) we have 85.3% accuracy of prediction concerning the existence of an lTP and 82.4% accuracy concerning the prediction of the cleavage site, compared to 82.4% and 70.1% accuracy of the LumenP predictor correspondingly. On a redundancy reduced dataset of 163 sequences (30%-threshold, non-red, (Bagos et al,.....)) we have 88.9% accuracy of prediction concerning the existence of an lTP and 85.2% accuracy concerning the prediction of the cleavage site, compared to 87.7% and 75.6% accuracy of the LumenP predictor correspondingly. When tested by cross-validation we have 78.7% accuracy of prediction concerning the existence of an lTP and 66.1% accuracy concerning the prediction of the cleavage site, compared to 87.0% and 54.8% accuracy of the LumenP predictor correspondingly. Using a negative test set of 2400 sequences of proteins located in the chloroplast but not in the thylakoid lumen or membranes, only 1.5% was identified as having an lTP.

Dataset PredSL Prediction of lTP LumenP Prediction of lTP PredSL Cl.site pred.accuracy LumenP Cl.site pred.accuracy

Complete (259 seq.) 91.9 88.8 88.7 75.1

Reduced with non-red 30% (163 seq.) 88.9 87.7 85.2 75.6

Reduced with cd-hit 40% (163 seq.) 85.3 82.4 82.4 70.1

Cross-validation 78.7 87.0 66.1 54.8