RNAcentral provides a variety of quality checks for all sequences. Many of these checks are based off of Rfam. Rfam is a database of functional non-coding RNA families represented by multiple sequence alignments and consensus secondary structures. The sequence and structural information is used to build Infernal covariance models, which can be used to find new instances of RNA families and annotate genomes with non-coding RNAs.
Every release RNAcentral annotates all sequences with Rfam models. Rfam classification provides additional context to sequences with few annotations and help identify potential problems, for example, sequences which are likely contamination.
In addition to Rfam based checks we also use CPAT to analyze sequences. This tool detects possible open reading frames in sequences. We analyze all human, fly, mouse and zebrafish sequences this way. We use CPAT version 3.0.4 with the default options.
Current types of quality control
1. Incomplete sequences
When an RNAcentral sequence matches only a part of the Rfam covariance model, the sequence is flagged as incomplete.
These sequences can be browsed by searching for
2. Potential contamination
For example, when a Eukaryotic sequence matches an Rfam family that is only found in Bacteria, this could indicate bacterial contamination or taxonomic misclassification.
- Bacterial GcvB RNA detected in mouse
- Mouse rRNA matching Bacterial LSU model
- Archaeal RNA matching Bacterial SRP
Sequences of this type can be browsed with
3. Missing Rfam hits
The majority of RNAcentral sequences annotated as rRNA or tRNA match the corresponding Rfam families. However, some sequences do not match the expected Rfam families which could mean that either the sequence has an incorrect RNA type or that the Rfam model needs to be updated.
- Sequence annotated as tRNA that matches Group II intron
- Sequence annotated as rRNA that matches a riboswitch
- Sequence annotated as tRNA that matches SSU rRNA
Browse all sequences with missing matches by searching for
4. Possible ORFs
Very few sequences contain open reading frames, however, it is worth noting which do as this may the function of the ncRNA.
Browse all sequences with possible ORFs by searching for
Why some sequences do not match any Rfam families
There are several possible reasons:
not all RNA types are represented in Rfam
For example, piRNAs or mature miRNAs are too short to be accurately modelled in Rfam. Conversely, lncRNAs tend to be too long and poorly conserved, although Rfam includes several conserved lncRNA domains.
- this family may not yet exist in Rfam
- Rfam model needs to be updated to include this sequence
If you would like to suggest a new Rfam family or report an error, please get in touch using the Feedback button found at the top right of every RNAcentral page.
Improve this page