Navigation

Links


Training dataset collection

From Release 3.5 of Uniprot (Apweiler et al, 2005) we collected all the sequences of proteins with the follwing features in their entries:
--> ^OC Eukaryota
--> for the dataset of mitochondrial proteins: FT TRANSIT.*Mitochondrion
--> for the dataset of chloroplasmic proteins: FT TRANSIT.*Chloroplast
--> for the dataset of secreted proteins: FT SIGNAL
--> for the dataset of cytoplasmic proteins: CC -!- SUBCELLULAR LOCATION.*Cytoplasmic.*
--> for the dataset of nuclear proteins: CC -!- SUBCELLULAR LOCATION.*Nuclear.*

We excluded from the datasets the sequences of proteins that had more than one subcellular location assigned to them and we also excluded sequences that did not begin with M, and/or included the symbols B, A, or X

The sequences of the mitochondrial, chloroplast and secreted proteins that had > or ? in their FT TRANSIT/SIGNAL entry, were also excluded.

Finally, the sequences of proteins that had any of the words 'Cryptophyta, Glaucocystophyceae, Haptophyceae, Rhodophyta, Stramenopiles, Viridiplantae', were extracted to the plant datasets, whereas the rest created the nonplant datasets.

After using the program cd-hit (Li et al,2001) to remove sequences with over 40% sequence similarity the concluding training datasets were as follows:


Location Plant Non-plant
Chloroplast 249 -
Mitochondrion 62 366
Secretory Pathway 422 5247
Cytoplasm 171 1458
Nucleus 405 3488



For the thylakoid-proteins prediction we used the datasets provided by LumenP (Emanuelsson et al, 2003)