The UniProt NREF (UniProt Reference Clusters) database.
The two major objectives of UniRef are:
(i) to facilitate sequence merging in UniProt, and
(ii) to allow faster and more informative sequence similarity searches.
Although the UniProt Knowledgebase is much less redundant than UniParc, it still
contains a certain level of redundancy because it is not possible to use fully
automatic merging without risking merging of similar sequences f...
More
The UniProt NREF (UniProt Reference Clusters) database.
The two major objectives of UniRef are:
(i) to facilitate sequence merging in UniProt, and
(ii) to allow faster and more informative sequence similarity searches.
Although the UniProt Knowledgebase is much less redundant than UniParc, it still
contains a certain level of redundancy because it is not possible to use fully
automatic merging without risking merging of similar sequences from different proteins.
However, such automatic procedures are extremely
useful in compiling the UniRef databases to obtain complete coverage of
sequence space while hiding redundant sequences (but not their
descriptions) from view.
A high
level of redundancy results in several problems, including slow
database searches and long lists of similar or identical alignments
that can obscure novel matches in the output. Thus, a more even
sampling of sequence space is advantageous. This can be addressed by
clustering closely similar sequences to yield a representative subset
of sequences. Therefore, we have created various non-redundant
databases with different sequence identity cut-offs. In the UniRef90
and UniRef50 databases no pair of sequences in the representative set
has >90% or >50% mutual sequence identity. The UniRef100 database
presents identical sequences and sub-fragments as a single entry with
protein IDs, sequences, bibliography, and links to protein databases.
Less