In order to test the accuracy and specificity of the method we used 5-fold cross-validation. The cleavage site prediction evaluation is considered correct
if it is within 5 positions of the annotated cleavage site. The results were:
PLANT DATASET
PredSL
Sequence Type
Sensitivity
Specificity
MCC
Cl.site pred.accuracy
Chloroplast
89.6
79.1
81.6
34.2
Mitochondrial
89.1
87.1
84.2
23.2
Secreted
95.5
91.8
91.2
91.4
Other
70.3
85.7
74.0
-
Total
86.7
-
-
50.3
TargetP
Sequence Type
Sensitivity
Specificity
MCC
Cl.site pred.accuracy
Chloroplast
95.3
90.4
89.5
45.8
Mitochondrial
82.0
90.1
82.9
46.2
Secreted
94.3
93.4
92.2
66.7
Other
89.8
89.5
87.9
-
Total
90.9
-
-
54.0
PLANT DATASET
PredSL
Sequence Type | Sensitivity | Specificity | MCC | Cl.site pred.accuracy |
Chloroplast | 89.6 | 79.1 | 81.6 | 34.2 |
Mitochondrial | 89.1 | 87.1 | 84.2 | 23.2 |
Secreted | 95.5 | 91.8 | 91.2 | 91.4 |
Other | 70.3 | 85.7 | 74.0 | - |
Total | 86.7 | - | - | 50.3 |
Sequence Type | Sensitivity | Specificity | MCC | Cl.site pred.accuracy |
Chloroplast | 95.3 | 90.4 | 89.5 | 45.8 |
Mitochondrial | 82.0 | 90.1 | 82.9 | 46.2 |
Secreted | 94.3 | 93.4 | 92.2 | 66.7 |
Other | 89.8 | 89.5 | 87.9 | - |
Total | 90.9 | - | - | 54.0 |
NONPLANT DATASET
PredSL
Sequence Type | Sensitivity | Specificity | MCC | Cl.site pred.accuracy |
Mitochondrial | 87.6 | 83.9 | 80.4 | 18.3 |
Secreted | 93.8 | 91.3 | 89.1 | 93.6 |
Other | 79.5 | 85.9 | 76.6 | - |
Total | 87.1 | - | - | 53.5 |
TargetP
Sequence Type | Sensitivity | Specificity | MCC | Cl.site pred.accuracy |
Mitochondrial | 88.5 | 85.2 | 81.7 | 45.0 |
Secreted | 95.1 | 95.6 | 93.2 | 68.4 |
Other | 83.8 | 86.8 | 79.9 | - |
Total | 89.2 | - | - | 56.5 |
Note that in the above dataset sequences that have been used in the training dataset used by TargetP have not been removed. We tested PredSL against a test set we created from the sequences that were removed from the initial dataset by the redundancy reduction, excluding the sequences that were included in the training dataset used by TargetP. We also trained PredSL with the training datasets used by TargetP and tested it against this test set.
The results are:
PLANT DATASET
PredSL
Chloroplast | 92.0 | 88.7 | 88.5 | 34.2 |
Mitochondrial | 93.7 | 84.4 | 86.9 | 23.2 |
Secreted | 97.7 | 90.7 | 91.7 | 91.4 |
Other | 81.5 | 95.3 | 83.6 | - |
Total | 90.4 | - | - | 50.3 |
PredSL trained with TargetP training datasets
- | Sensitivity | Specificity | MCC | Cl.site pred.accuracy |
Chloroplast | 78.1 | 77.5 | 75.0 | 48.3 |
Mitochondrial | 96.1 | 75.3 | 82.1 | 40.8 |
Secreted | 97.3 | 88.8 | 90.0 | 90.8 |
Other | 70.6 | 91.6 | 74.6 | - |
Total | 84.4 | - | - | 63.2 |
TargetP
Sequence Type | Sensitivity | Specificity | MCC | Cl.site pred.accuracy |
Chloroplast | 89.8 | 82.0 | 83.3 | 42.0 |
Mitochondrial | 89.8 | 81.4 | 83.1 | 49.3 |
Secreted | 92.3 | 93.5 | 90.3 | 65.4 |
Other | 84.7 | 93.3 | 84.7 | - |
Total | 88.8 | - | - | 54.0 |
NONPLANT DATASET
PredSL
Sequence Type | Sensitivity | Specificity | MCC | Cl.site pred.accuracy |
Mitochondrial | 94.3 | 90.3 | 88.9 | 45.2 |
Secreted | 95.2 | 94.4 | 92.5 | 93.9 |
Other | 90.4 | 95.4 | 89.9 | - |
Total | 93.3 | - | - | 66.0 |
PredSL trained with TargetP training datasets
Sequence Type | Sensitivity | Specificity | MCC | Cl.site pred.accuracy |
Mitochondrial | 93.8 | 86.2 | 85.6 | 42.1 |
Secreted | 94.8 | 91.6 | 90.1 | 93.8 |
Other | 82.1 | 93.5 | 83.2 | - |
Total | 90.2 | - | - | 67.0 |
TargetP
Sensitivity | Specificity | MCC | Cl.site pred.accuracy | |
Mitochondrial | 93.0 | 90.2 | 88.0 | 50.0 |
Secreted | 97.4 | 95.7 | 94.9 | 91.1 |
Other | 90.0 | 94.5 | 88.8 | - |
Total | 93.5 | - | - | 70.6 |
lTP prediction results
The Lumen prediction was tested on the whole dataset, resulting to 91.1% accuracy, compared to 88.8% provided by the LumenP
predictor. The cleavage site prediction, is also improved: 88.7% of the sequences have their lTP cleavage site
predicted within ±2 residues, compared to 75.1% that the LumenP predictor estimates correctly. These results are based on the complete dataset of 259 sequences.
On a redundancy reduced dataset of 109 sequences (40%-threshold, cd-hit, (Li et al, 2001)) we have 85.3% accuracy of prediction concerning the
existence of an lTP and 82.4% accuracy concerning the prediction of the cleavage site, compared to 82.4% and 70.1% accuracy of the LumenP predictor correspondingly.
On a redundancy reduced dataset of 163 sequences (30%-threshold, non-red, (Bagos et al,.....)) we have 88.9% accuracy of prediction concerning the
existence of an lTP and 85.2% accuracy concerning the prediction of the cleavage site, compared to 87.7% and 75.6% accuracy of the LumenP predictor correspondingly.
When tested by cross-validation we have 78.7% accuracy of prediction concerning the
existence of an lTP and 66.1% accuracy concerning the prediction of the cleavage site, compared to 87.0% and 54.8% accuracy of the LumenP predictor correspondingly.
Using a negative test set of 2400 sequences of proteins located in the chloroplast but not in the thylakoid
lumen or membranes, only 1.5% was identified as having an lTP.
Dataset | PredSL Prediction of lTP | LumenP Prediction of lTP | PredSL Cl.site pred.accuracy | LumenP Cl.site pred.accuracy |
Complete (259 seq.) | 91.9 | 88.8 | 88.7 | 75.1 |
Reduced with non-red 30% (163 seq.) | 88.9 | 87.7 | 85.2 | 75.6 |
Reduced with cd-hit 40% (163 seq.) | 85.3 | 82.4 | 82.4 | 70.1 |
Cross-validation | 78.7 | 87.0 | 66.1 | 54.8 |