FTP Archive

The FTP Archive facilitates downloading large volumes of data produced/used by RNAcentral. Being able to download these files may be useful when doing some of your own processing.

The objects in the FTP archive are produced during the release of RNAcentral, and as such they are updated with each release. The archive stores the data back to version 1.0-beta (look in the releases folder).

Most objects stored here are compressed with gzip compression; these files end with a .gz suffix. To decompress them, you can use a command like gzip -d <file path>. The file formats used are documented on the links provided in the table below.

Objects available

Name Description Format Link
Database files A dump of the postgres database. For old releases this will be removed. pg_dump pg_dump
Genome Coordinates Coordinates of RNAs in RNAcentral in each model organism, as annotated by expert databases BED
GFF3
BED,
GFF3
GO annotations Mappings of RNAcentral entries to Gene Ontology terms TSV rnacentral_rfam_annotations.tsv.gz
GPI Gene product information for selected rRNAs GPI rnacentral.gpi.gz
ID mapping Mapping of RNAcentral IDs to expert database IDs. Also available per database TSV id_mapping.tsv.gz,
per database
json JSON files containing RNAcentral IDs and their cross-reference to ensembl. Each file contains 10,000 sequences JSON json
md5 RNAcentral ID mapped to MD5 sum of each sequence TSV md5.tsv.gz
rfam RNAcentral IDs with their associated Rfam annotations. TSV rfam_annotations.tsv.gz
sequences - active RNAcentral IDs with their corresponding sequences. Active sequences are present in at least one expert database. FASTA rnacentral_active.fasta.gz
sequences - inactive RNAcentral IDs with their corresponding sequences. Inactive sequences are not currently present within any expert database FASTA rnacentral_inactive.fasta.gz
sequences - species specific RNAcentral species specific URS mapped to sequence FASTA rnacentral_species_specific_ids.fasta.gz

Most directories contain readme files that explain their contents further.

Previous releases are also available at releases and largely contain the same objects, though obviously the database has evolved over time.

Directory structure

The structure of the FTP archive is shown below.

rnacentral
|
+- current_release
|   |
|   +- database_files
|   |   |
|   |   +- pg_dump.sql.gz
|   |
|   +- genome_coordinates
|   |   |
|   |   +- bed
|   |   |   |
|   |   |   +- one gzip compressed BED file per model organism
|   |   |
|   |   +- gff3
|   |   |   |
|   |   |   +- one gzip compressed GFF3 file per model organism
|   |   |
|   |   +- readme.txt
|   |
|   +- go_annotations
|   |   |
|   |   +- rnacentral_rfam_annotations.tsv.gz
|   |
|   +- gpi
|   |   |
|   |   +- rnacentral.gpi
|   |   |
|   |   +- rnacentral.gpi.gz
|   |
|   +- id_mapping
|   |   |
|   |   +- database_mappings
|   |   |   |
|   |   |   +- one uncompressed tsv file per expert database
|   |   |
|   |   +- id_mapping.tsz.gz
|   |
|   +- json
|   |   |
|   |   +- JSON files
|   |
|   +- md5
|   |   |
|   |   +- md5.tsv.gz
|   |
|   +- rfam
|   |   |
|   |   +- rfam_annotations.tsv.gz
|   |
|   +- sequences
|       |
|       +- by_database
|       |   |
|       |   +- one uncompressed fasta file per expert database
|       |
|       +- rnacentral_active.fasta.gz
|       |
|       +- rnacentral_inactive.fasta.gz
|       |
|       +- rnacentral_species_specific_ids.fasta.gz
|
+- releases
    |
    +- Archive of releases back to 1.0beta (2014)

Improve this page