Navigation

Links



We used NevProp4 (Goodman et al, 1996), which is a c implementation of the quickprop algorithm to train several neural networks.


First set of neural networks A and B:
(in order to check whether a residue belongs or not to a transit peptide of chloroplast proteins (cTP) or mitochondrial proteins (mTP) respectively)


For the training set we used equal numbers of positive and negative sequences. The positive data consisted of the sequences of the chloroplasmic and mitochondrial proteins respectively for the neural networks A and B. The negative data consisted of equal numbers of each of the other categories. Note that for the plant mitochondrial neural networks we use all the mitochondrial sequences as positive for the training (the nonplant too), as the plant mitochondrial dataset is too small. The architecture of the neural networks was:

Input units
For the N-terminal 100 amino-acid residues of each sequence, we used windows of 55, and 35 residues for the neural networks A and B respectively. Each residue was represented by 20 nodes of which only the one corresponding to the appropriate amino acid was switched on, whereas the rest remained off.
Example: Ala: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Hidden units : 4

Epochs of training : 200

Learning Rate : 0.01

Output units : 1 for the positive sets of inputs, 0 for the negative.


Second set of neural networks A1 and B1:
(in order to check whether the sequence has a cTP or an mTP)


We used the same training dataset as before. The architecture of the neural networks was:

Input units
The scores from the first neural networks for the N-terminal 100 amino-acid residues of each sequence.

Hidden units : 6

Epochs of training : 200

Learning Rate : 0.01

Output units : 1 for the positive sets of inputs, 0 for the negative.

Third set of neural networks a1-5 and b1-5: <
(for the prediction of the cleavage site of the cTP and mTP respectively)


The dataset we used to make the training sets, was created by removing from the initial dataset, all the sequences that had the words 'Probable, By similarity, Potential' in their 'FT TRANSIT ' entry, and then using cd-hit (Li et al, 2003) to reduce redundancy. We are thus left with the following datasets:

Location Plant Non-plant
Chloroplast 122 -
Mitochondrion 91 241

The training dataset, as before, consisted of equal numbers of positive and negative sequences. In order to make the method more objective, we trained 5 different networks for each case (cTP and mTP), and we use the average prediction score.
The architecture of the neural networks is:

Input units
For the positive sequences we use a window of 27 residues (-20 +6) around the annotated cleavage site for the a1-5 neural networks and 21 residues (-12 +8) for the b1-5 networks. For the negative sequences, we use similar windows around random positions in the sequences. Again each residue is represented by 20 nodes as in the first set of neural networks.

Hidden units : 9

Epochs of training : 200

Learning Rate : 0.01

Output units : 1 for the positive sets of inputs, 0 for the negative.

Final neural network:
(in order to make the prediction)


As training set we used equal numbers of each category. The architecture of the neural networks was:

Input units
The scores from all the methods used: 13 scores for the plant sequences and 7 scores for the nonplant sequences.

Hidden units : 6 scores for the plant sequences, 5 for the nonplant sequences.

Epochs of training : 200

Learning Rate : 0.01

Output units : for the plant sequences: 1 0 0 for the inputs of chloroplasmic proteins,
0 1 0 for the inputs of mitochondrial proteins,
0 0 1 for the inputs of secreted proteins and
0 0 0 for the inputs of the cytoplasmic and nuclear sets (other).

for the nonplant sequences: 1 0 for the inputs of mitochondrial proteins,
0 1 for the inputs of secreted proteins and
0 0 for the inputs of the cytoplasmic and nuclear sets (other).