Method

How SUS-BAR was generated

SUS-BAR is a data base where you can retrieve a pig protein sequence that can or cannot be annotated. The added value of our annotation system is that is based on merging all the existing information on protein features such as structure, function and functional and structural domains and to transfer annotation after a cluster-specific statistical validation. SUS-BAR methodologically derives from BAR+, the Bologna Annotation Resource that we recently implemented (available at http://bar.biocomp.unibo.it/bar2.0).

BAR+

BAR+ allows transfer of statistically validated annotation [1-2]. Briefly, the method is based on the concept that sequences can inherit the same function/s and structure from their well annotated counterparts, provided that they fall into the same cluster characterized by cluster-specific and statistically validated annotations. For generating BAR+ clusters we analyzed more than 13 million protein sequences from 988 genomes and UniProtKB release 2010_05. The BAR+ cluster building pipeline starts with an all-against-all sequence comparison with BLAST in a GRID environment [2]. Here we complement our previous BAR+ with some more 30,000 sequences downloaded from UniProtKB/SwissProt release 2012_01, including human protein variants.

The non hierarchical clustering procedure of BAR+ constrains sequence identity (SI) to be =40% on at least 90% of the global pairwise sequence alignment length (Coverage, Cov). The alignment results are then regarded as an undirected graph where nodes are proteins and links are allowed only among chains that are at least 40% identical over at least 90% of the alignment length. A cluster comprises all the connected protein nodes [1]. Depending on the annotation types of the sequences within the cluster, all new targets that fall into a cluster can inherit all the cluster specific and statistically validated annotations by transfer. When a cluster incorporates a UniProtKB entry, it inherits its annotations (GO and Pfam terms, PDB structures, SCOP classifications). Within a cluster GO and Pfam terms are then statistically validated as previously described [1] and validated terms are those endowed with P-values<0.01 [2]. Some 481,940 clusters contain validated terms and about 10 millions protein sequences falls into these validated clusters. Clusters can contain distantly related proteins that therefore can be annotated with high confidence by falling into a specific cluster. When the cluster contains structural template/s, all the sequences in the clusters can be folded on the template/s, rather irrespectively of their sequence identity to the template/s [1-2]. Structural alignments within each cluster containing templates are provided by a cluster Hidden Markov Model (HMM) and are available for downloading [2]. In the present BAR-PLUS implementation we count 1,431,294 clusters, containing 92% of all the protein sequences (available at http://bar.biocomp.unibo.it/bar2.0).

SUS-BAR

SUS-BAR is obtained by downloading 19,577 pig protein corresponding to 19,576 different sequences from UniProtKB (2012_05 Release) and selected from the complete proteome set. Another 15,805 sequences were collected from the Ensembl 67 genebuild based on Sus scrofa 10.2 pig genome assembly (http://www.ensembl.org), for a total of 35,381 different sequences. These sequences were aligned towards BAR+. When a sequence found a counterpart in BAR+ with sequence identity (SI) =40% on at least 90% of the global pairwise sequence alignment length (Coverage, Cov), the sequence was assigned to a specific cluster. Pig protein sequences not included in clusters remain singletons (see Statistics). SUS-BAR update will occur whenever new releases of the pig genome will be available or when new versions of BAR+ will be released.

For optimal rendering of the website please use: Mozilla Firefox | Google Chrome | Opera | Apple Safari