RNAcentral LitSumm

Overview

RNAcentral LitSumm is designed to produce automated summaries for any ncRNA gene using natural language processing techniques. Currently, it is built upon OpenAI's GPT-3.5 model. These summaries are generated by selecting relevant sentences extracted from scientific literature, building a summary, and then running a series of automated checks on the summaries. This approach is described in LitSumm: Large language models for literature summarisation of non-coding RNAs. LitSumm enables researchers to quickly grasp the key information about these genes.

As an example, Homo sapiens (human) microRNA hsa-mir-944 precursor lists over 200 articles about MIR944. Using sentences from that list of articles, LitSumm was able to create the following summary:

MIR944 is a microRNA that has been studied in various contexts. It has been found to be overexpressed in patients with non-small cell lung cancer (NSCLC) [PMC5627048]. MIR944 has also been shown to modulate sensitivity to chemotherapy in other solid tumors [PMC8602334]. The MIR944 promoter is regulated by the transcription factor ΔNp63, which binds to the promoter region and activates transcription [PMC4551945]. The binding of ΔNp63 to the MIR944 promoter is reinforced by the co-regulator AP-2 [PMC4551945]. The chromatin structure of the MIR944 promoter region is relatively open in keratinocytes, which allows for transcription initiation [PMC4551945]. Conservation analyses have shown that MIR944 is a relatively young gene in evolutionary terms, with conservation limited to primates [PMC4551945]. TFAP2A and TFAP2C have been identified as co-regulators that bind to the MIR944 promoter and enhance ΔNp63-mediated activation of transcription [PMC4551945]. The activity of the MIR944 promoter is dependent on ΔNp63 expression, but not linearly dependent on its quantity [PMC4551945]. Overall, these findings suggest that MIR944 has its own independent promoter and its expression is regulated by ΔNp63 and co-regulators such as AP-2, TFAP2A, and TFAP2C.

It is important to note that this summary is automatically generated using artificial intelligence and therefore may not be completely accurate or comprehensive. We have observed the following issues with LitSumm summaries:

  • Missing references
  • Incorrectly formatted references
  • Incorrectly merging similar sentences
  • Inaccurate statements

LitSumm is under active development and more summaries will be available in the future.

Browse all sequences with LitSumm summaries

How LitSumm works

LitScan Pipeline

The foundation of the LitSumm is the LitScan pipeline. This text mining pipeline establishes connections between RNA sequences and the latest open access scientific literature. It accomplishes this by utilizing a collection of identifiers (Ids), gene names, and synonyms provided to RNAcentral by Expert Databases. LitScan scans papers available in Europe PMC, ensuring that the publications linked to RNAcentral entries remain up-to-date.

LitSumm Summarization Process

The summarization process of the LitSumm tool involves a sequence of steps to produce high-quality summaries for ncRNA genes. The process can be summarized as follows:

  1. Selection: Relevant snippets are selected from the literature using the LitScan pipeline.
  2. LLM Driven Summarisation: The selected snippets are fed into the GPT 3.5 language model, which generates coherent and concise summaries.
  3. Automated Checking: The generated summaries undergo an automated checking process to ensure accuracy and coherence.
  4. Automated Refinement: When necessary, summaries are refined automatically to improve their quality and accuracy.

Automated Checking and Refinement

LitSumm incorporates an automated checking mechanism to validate the generated summaries. This ensures that the summaries accurately reflect the content of the selected literature passages. Additionally, automated refinement processes are applied as needed to enhance the clarity and coherence of the summaries.


Improve this page