Statistics
The following tables show statistical information about the SUS-BAR database.
In Table 1 the total number of pig proteins, as retrieved from the present releases of the databases, is sorted out based on the number of sequences endowed with unique Gene Ontology of the three main roots (Molecular Function (MF); Biological Process (BP); Cellular Component (CC)), with all the GO terms (All-GO), with Pfam domains (Pfam), with both Pfam and All-GO (Pfam & All-GO) terms and with a structure in the Protein database (PDB). Sequences are also listed depending on the UniProtKB branch from where there were retrieved (SwissProt, manually annotated and reviewed, and TrEMBL, automatically annotated).
Table 1. Annotation of the PIG proteome in UniProtKB and Ensembl
|
|
MF |
BP |
CC |
All-GO |
Pfam° |
Pfam &All-GO |
PDB* |
SwissProt ^[1,406] |
Sequences |
1,159 |
966 |
1,259 |
1,377 |
1,288 |
1,392 |
112 |
Terms |
765 |
1,234 |
262 |
2,261 |
961 |
3,222 |
- |
|
TrEMBL ^[18,170] |
Sequences |
9,514 |
4,556 |
5,817 |
11,230 |
13,092 |
14,934 |
0 |
Terms |
895 |
983 |
247 |
2,125 |
3,712 |
5,837 |
- |
|
Ensembl ^[15,805] |
Sequences |
12,369 |
11,295 |
10,500 |
13,583 |
12,981 |
13,832 |
77 |
Terms |
2,632 |
6,867 |
947 |
10,446 |
4,150 |
14,596 |
- |
|
Total ^[35,381] |
Sequences |
23,042 |
16,817 |
17,576 |
26,190 |
27,361 |
30,158 |
189 |
Terms |
2,657 |
6,890 |
949 |
10,496 |
4,324 |
14,820 |
- |
With our method all the pig protein sequences are aligned towards the BAR+ database and they may enter into a cluster containing statistically validated information (P value<0.01) for a specific GO term or Pfam domain. This is the case for 26,320 pig protein sequences while 9,061 remain singletons and carry along the UniProtKB or Ensembl annotation (when present). 83% of the cluster-retained sequences align towards clusters endowed with statistically validated annotation and they inherit all the cluster statistically validated GO terms and/or Pfam domains. In Table 2, with the symbol ° we identify terms that are statistically validated and have an experimental evidence code.
Table 2. Statistically validated annotation of the PIG proteome in SUS-BAR
|
|
MF |
MF° |
BP |
BP° |
CC |
CC° |
All-GO |
All-GO° |
Pfam |
Pfam & All-GO |
*PDB |
Cluster ^[26,320] |
Sequences |
17,152 |
12,380 |
16,567 |
13,359 |
16,571 |
13,187 |
19,820 |
15,785 |
20,690 |
21,793 |
9,383 |
Clusters |
6,929 |
3,973 |
6,482 |
4,599 |
6,442 |
4,497 |
8,578 |
5,974 |
9,212 |
9,941 |
3,528 |
|
Terms |
3,668 |
3,069 |
10,325 |
9,896 |
1,369 |
1,234 |
15,362 |
14,199 |
3,941 |
19,303 |
- |
|
§Singleton ^[9,061] |
Sequences |
4,596 |
11 |
3,095 |
8 |
3,084 |
10 |
5,280 |
16 |
5,697 |
6,406 |
30 |
Terms |
1,090 |
33 |
2,966 |
184 |
552 |
43 |
4,608 |
260 |
2,058 |
6,666 |
- |
|
Total ^[35,381] |
Sequences |
21,748 |
12,391 |
19,662 |
13,367 |
19,655 |
13,197 |
25,100 |
15,801 |
26,387 |
28,199 |
9,413 |
Terms |
3,730 |
3,070 |
10,533 |
9,900 |
1,393 |
1,235 |
15,656 |
14,205 |
4,220 |
19,876 |
- |
In Table 3 the effect of our annotation procedure is shown for sequences without any annotation in UniProtKB and Ensembl.
Table 3. SUS-BAR annotation of pig protein sequences not annotated in UniProtKB and Ensembl
|
|
MF |
BP |
CC |
All-GO |
Pfam |
Pfam & All-GO |
*PDB |
UniProtKB
|
Sequences |
285 |
418 |
456 |
607 |
234 |
666 |
124 |
Clusters |
240 |
358 |
396 |
526 |
204 |
580 |
101 |
|
Terms |
515 |
2,232 |
426 |
3,173 |
154 |
3,327 |
- |
|
Ensembl ^[2,370] |
Sequences |
90 |
104 |
131 |
175 |
77 |
202 |
31 |
Clusters |
73 |
83 |
104 |
142 |
58 |
159 |
23 |
|
Terms |
189 |
656 |
262 |
1,107 |
65 |
1,172 |
- |
|
Total ^[5,620] |
Sequences |
375 |
522 |
587 |
782 |
311 |
868 |
155 |
Clusters |
282 |
402 |
453 |
607 |
247 |
674 |
113 |
|
Terms |
545 |
2,311 |
467 |
3,323 |
195 |
3,518 |
- |
In Table 4 the search by Homo sapiens, Mus musculus and Bos taurus retrieves all the clusters where sequences of the three organisms share some annotation with those of the pig animal, including, when available, a structural template. Interestingly a large fraction of the pig protein sequences inherit from the clusters statistically validated annotation albeit the low sequence identity (SI< 30%) with sequences carrying information into the cluster.
Table 4. PIG sequences in clusters with other organisms
Organism |
#Clusters |
#Pig Sequences |
#Pig Sequences (SI<30%) |
#Clusters with PDB |
#Pig |
Homo sapiens |
10,475 |
22,581 |
3,958 |
3,487 |
9,314 |
Mus musculus |
9,778 |
21,648 |
4,525 |
3,430 |
9,222 |
Bos taurus |
9,303 |
21,044 |
4,238 |
3,305 |
9,050 |