  • The UniProt NREF (UniProt Reference Clusters) database. The two major objectives of UniRef are: (i) to facilitate sequence merging in UniProt, and (ii) to allow faster and more informative sequence similarity searches. Although the UniProt Knowledgebase is much less redundant than UniParc, it still contains a certain level of redundancy because it is not possible to use fully automatic merging without risking merging of similar sequences from different proteins. However, such automatic procedures are extremely useful in compiling the UniRef databases to obtain complete coverage of sequence space while hiding redundant sequences (but not their descriptions) from view. A high level of redundancy results in several problems, including slow database searches and long lists of similar or identical alignments that can obscure novel matches in the output. Thus, a more even sampling of sequence space is advantageous. This can be addressed by clustering closely similar sequences to yield a representative subset of sequences. Therefore, we have created various non-redundant databases with different sequence identity cut-offs. In the UniRef90 and UniRef50 databases no pair of sequences in the representative set has 90% or 50% mutual sequence identity. The UniRef100 database presents identical sequences and sub-fragments as a single entry with protein IDs, sequences, bibliography, and links to protein databases.

