--> ^OC Eukaryota
--> for the dataset of mitochondrial proteins: FT TRANSIT.*Mitochondrion
--> for the dataset of chloroplasmic proteins: FT TRANSIT.*Chloroplast
--> for the dataset of secreted proteins: FT SIGNAL
We excluded from the datasets the sequences of proteins that had more than one subcellular location assigned to them and we also excluded
sequences that did not begin with M, and/or included the symbols B, A, or X
The sequences of the mitochondrial, chloroplast and secreted proteins that had > or ? in their FT TRANSIT/SIGNAL entry, were also excluded.
Finally, the sequences of proteins that had any of the words 'Cryptophyta, Glaucocystophyceae, Haptophyceae, Rhodophyta, Stramenopiles, Viridiplantae',
were extracted to the plant datasets, whereas the rest created the nonplant datasets.
After using the program cd-hit (Li et al,2001) to remove sequences with over 40% sequence similarity the concluding training datasets were as follows:
Location
Plant
Non-plant
Chloroplast
249
-
Mitochondrion
62
366
Secretory Pathway
422
5247
Cytoplasm
171
1458
Nucleus
405
3488
For the thylakoid-proteins prediction we used the datasets provided by LumenP (Emanuelsson et al, 2003)