In a new study, led by SciLifeLab group leader Kristoffer Sahlin (SU), researchers from SciLifelab, Oxford Nanopore Technologies (ONT) and Penn State University have developed a new computational tool for error-correcting Oxford Nanopore transcriptomic sequence data.
Oxford nanopore technologies (ONT) is a company that produces DNA and RNA sequencing data, which is used to study the genomes and transcriptomes of organisms. The new approach published in Nature Communications reduces the median error rates in ONT transcript sequencing data from approximately 7 percent down to 0.5 to 1.1 percent.
A sequencing error leads to a substitution, deletion, or insertion of a nucleotide, which is not present in the biological sequence. The error rate in a read (long sequence readouts) refers to how frequently, on average, that a sequencing error occurs in the read. Reads typically come with different error rates, and an error rate of 7 percent means that the typical read has about one error every 14th nucleotide. The median read in ONT data currently has an error rate of around 7 percent.
“This hinders many downstream analyses such as ‘inferring’ which reads came from the same transcript, as an error can be confused with a biological mutation or exon difference”, says SciLifeLab researcher and first author Kristoffer Sahlin (SciLifeLab/SU).
In bringing down the error rate, Kristoffer Sahlin and his team demonstrate the feasibility of cost-effective cDNA full transcript length sequencing for transcriptome analysis.
“Through their ability to produce long sequence readouts, they have, for example, provided insights on human DNA variability in previously unsequenced regions, and produced more detailed maps of the transcriptome landscape of cells used for the study of various diseases”, says Kristoffer Sahlin.
Previous methods to reduce the ONT error rates has employed either experimental approaches at a cost of decreased throughput and experimental overhead (the extra amount of time spent in the lab to prepare customized protocols for sequencing) or relied on alignment to a reference genome which may not be present or represent the sample well.
Reference-based error correction is not suitable for; organisms without high-quality reference genomes; reads from genes with particularly small (and unalignable) exons, or for transcripts from gene-families that are not well represented in the reference genome due to high sequence diversity. The new reference agnostic method can be applied to both cDNA and dRNA ONT sequencing data, regardless of the organism.
ONT transcriptome sequencing has the ability to sequence through the entire transcript, as opposed to short-read sequencing. In addition, compared to other long-read technologies, ONT is a more cost-effective alternative. The only obstacle so far has been the error rates compared to accurate short-read sequencing.
“An important aspect of error-correction is the ability to preserve the variation in the original sample such as Single-nucleotide variants, SNV:s. A correction approach may be tricked, thinking that low abundant mutations or short exons are sequencing errors. We studied this in detail from many aspects and found that our algorithm produces very little overcorrection.” says Kristoffer Sahlin.
There are many applications of ONT transcriptome sequencing methodology. Common transcriptome analyses include isoform detection and quantification. Detecting novel transcripts or quantifying transcript abundance in the cell is used to study differences between cell types, how cells develop over time, and mechanisms for the development of tumor cells or defective proteins causing disease.
The team also found that their algorithm is robust when it comes to preserving biological variation in samples.
“I also want to highlight the algorithmic design we came up with to error-correct the reads in this study. Our new approach used well-studied concepts in computer science such as Minimizers and Weighted Interval Scheduling, but joined together and applied to a new context of sequence error correction. I am excited to see if we can apply these computational techniques to related bioinformatic areas where sequence error-correction, and consensus-forming, and sequence variation analysis is of importance. We will continue to explore such avenues.” says Kristoffer Sahlin.