Quality Control

RNAcentral provides a variety of quality checks for all sequences. Many of these checks are based off of Rfam. Rfam is a database of functional non-coding RNA families represented by multiple sequence alignments and consensus secondary structures. The sequence and structural information is used to build Infernal covariance models, which can be used to find new instances of RNA families and annotate genomes with non-coding RNAs.

Every release RNAcentral annotates all sequences with Rfam models. Rfam classification provides additional context to sequences with few annotations and help identify potential problems, for example, sequences which are likely contamination.

In addition to Rfam based checks we also use CPAT to analyze sequences. This tool detects possible open reading frames in sequences. We analyze all human, fly, mouse and zebrafish sequences this way. We use CPAT version 3.0.4 with the default options.

Current types of quality control

1. Incomplete sequences

When an RNAcentral sequence matches only a part of the Rfam covariance model, the sequence is flagged as incomplete.

Examples

These sequences can be browsed by searching for qc_warning:"incomplete_sequence".

2. Potential contamination

For example, when a Eukaryotic sequence matches an Rfam family that is only found in Bacteria, this could indicate bacterial contamination or taxonomic misclassification.

Examples

Sequences of this type can be browsed with qc_warning:"possible_contamination".

3. Missing Rfam hits

The majority of RNAcentral sequences annotated as rRNA or tRNA match the corresponding Rfam families. However, some sequences do not match the expected Rfam families which could mean that either the sequence has an incorrect RNA type or that the Rfam model needs to be updated.

Examples

Browse all sequences with missing matches by searching for qc_warning:"missing_rfam_match".

4. Possible ORFs

Very few sequences contain open reading frames, however, it is worth noting which do as this may the function of the ncRNA.

Examples

Browse all sequences with possible ORFs by searching for qc_warning:"possible_orf"

Why some sequences do not match any Rfam families

There are several possible reasons:

  • not all RNA types are represented in Rfam

    For example, piRNAs or mature miRNAs are too short to be accurately modelled in Rfam. Conversely, lncRNAs tend to be too long and poorly conserved, although Rfam includes several conserved lncRNA domains.

  • this family may not yet exist in Rfam
  • Rfam model needs to be updated to include this sequence

If you would like to suggest a new Rfam family or report an error, please get in touch using the Feedback button found at the top right of every RNAcentral page.


Improve this page