The FTP Archive facilitates downloading large volumes of data produced/used by RNAcentral. Being able to download these files may be useful when doing some of your own processing.
The objects in the FTP archive are produced during the release of RNAcentral, and as such they are updated with each release. The archive stores the data back to version 1.0-beta (look in the releases folder).
Most objects stored here are compressed with gzip compression; these files end with a .gz
suffix. To decompress them, you can use a command like gzip -d <file path>
. The file formats used are documented on the links provided in the table below.
Name | Description | Format | Link |
---|---|---|---|
Database files | A dump of the postgres database. For old releases this will be removed. | pg_dump | pg_dump |
Genome Coordinates | Coordinates of RNAs in RNAcentral in each model organism, as annotated by expert databases | BED GFF3 |
BED, GFF3 |
GO annotations | Mappings of RNAcentral entries to Gene Ontology terms | TSV | rnacentral_rfam_annotations.tsv.gz |
GPI | Gene product information for selected rRNAs | GPI | rnacentral.gpi.gz |
ID mapping | Mapping of RNAcentral IDs to expert database IDs. Also available per database | TSV | id_mapping.tsv.gz, per database |
json | JSON files containing RNAcentral IDs and their cross-reference to ensembl. Each file contains 10,000 sequences | JSON | json |
md5 | RNAcentral ID mapped to MD5 sum of each sequence | TSV | md5.tsv.gz |
rfam | RNAcentral IDs with their associated Rfam annotations. | TSV | rfam_annotations.tsv.gz |
sequences - active | RNAcentral IDs with their corresponding sequences. Active sequences are present in at least one expert database. | FASTA | rnacentral_active.fasta.gz |
sequences - inactive | RNAcentral IDs with their corresponding sequences. Inactive sequences are not currently present within any expert database | FASTA | rnacentral_inactive.fasta.gz |
sequences - species specific | RNAcentral species specific URS mapped to sequence | FASTA | rnacentral_species_specific_ids.fasta.gz |
Most directories contain readme files that explain their contents further.
Previous releases are also available at releases and largely contain the same objects, though obviously the database has evolved over time.
The structure of the FTP archive is shown below.
rnacentral
|
+- current_release
| |
| +- database_files
| | |
| | +- pg_dump.sql.gz
| |
| +- genome_coordinates
| | |
| | +- bed
| | | |
| | | +- one gzip compressed BED file per model organism
| | |
| | +- gff3
| | | |
| | | +- one gzip compressed GFF3 file per model organism
| | |
| | +- readme.txt
| |
| +- go_annotations
| | |
| | +- rnacentral_rfam_annotations.tsv.gz
| |
| +- gpi
| | |
| | +- rnacentral.gpi
| | |
| | +- rnacentral.gpi.gz
| |
| +- id_mapping
| | |
| | +- database_mappings
| | | |
| | | +- one uncompressed tsv file per expert database
| | |
| | +- id_mapping.tsz.gz
| |
| +- json
| | |
| | +- JSON files
| |
| +- md5
| | |
| | +- md5.tsv.gz
| |
| +- rfam
| | |
| | +- rfam_annotations.tsv.gz
| |
| +- sequences
| |
| +- by_database
| | |
| | +- one uncompressed fasta file per expert database
| |
| +- rnacentral_active.fasta.gz
| |
| +- rnacentral_inactive.fasta.gz
| |
| +- rnacentral_species_specific_ids.fasta.gz
|
+- releases
|
+- Archive of releases back to 1.0beta (2014)