The RNAcentral sequence similarity search enables searches against a comprehensive collection of non-coding RNA sequences from a consortium of RNA databases.
The search is powered by the nhmmer software which is more sensitive than blastn but is comparable in speed. The command executed by the sequence search is as follows:
nhmmer --qfasta --tformat fasta -o <output_file> -T 0 --F3 0.02 --rna --watson --cpu 4 -Z <number> <query> <database.fasta>
A description of what each parameter does can be seen on the software man page (manual page).
Over 50% of RNAcentral sequences are ribosomal RNAs (rRNAs). The abundance and high conservation of rRNA sequences makes it difficult to perform sequence similarity searches, as such searches are expected to match a large number of sequences and can take a long time to complete.
To get around this, the sequence similarity searches are performed against a subset of ~100,000 rRNA sequences from Ensembl, FlyBase, HGNC, MGI, PDBe, PomBase, RDP, RefSeq, RGD, SGD, TAIR, and WormBase.
Whenever a sequence is entered in the search input box, the query is compared with all RNAcentral sequences and if there is an exact match, the links to the entries matching the query are displayed in a green box. This is very quick because only identical matches are considered. To see all similar sequences, just click Submit.
Batch queries can be done either through the website or through the API. However, there is a limit on the number of sequences that can be sent through the WEB interface. This limit is currently 50 sequences.
In addition to nhmmer searches against RNAcentral, every query is automatically compared with the Rfam library of RNA families. The searches are done using the Infernal cmscan program coupled with a post-processing step. The post-processing removes any hits that overlap Rfam families from the same clan (a clan is a set of homologous families, for example LSU_rRNA_archaea, LSU_rRNA_bacteria and LSU_rRNA_eukarya). This is a unique functionality not available on the Rfam website or the EBI cmscan service that report all matching families, including the redundant overlapping hits from the same clan.
The RNAcentral sequence similarity search also generates secondary structure (2D) diagrams using the R2DT software that visualises RNA structure using standard layouts or templates. Learn more about this new feature in the R2DT paper.
Although the number of similar sequences can reach tens of thousands, for performance reasons, only the top 1000 results will be shown in each search.
The search results are available for download in a compressed folder. This folder contains the datapackage.json file with the search metadata and a directory with three different files. The similar-sequence.json and similar-sequence.txt files contain the search results. The similar-sequences.fasta file contains a list of fasta sequences.
See the API documentation to learn how to start an asynchronous job using our REST based API.
The results will be available at the same URL for up to 7 days. After this period, all data and metadata are deleted.
Please feel free to contact us or raise a GitHub issue.