SciLifeLab Fellow Kristoffer Sahlin (Stockholm University) has developed a new method, strobemers, to improve sequence similarity detection in many types of Bioinformatic analysis.
Sequence similarity is computed or approximated in nearly all bioinformatic applications that are based on comparing sequences. Applications include read mapping and read clustering, genome and transcriptome assembly, SNP and Structural variation detection, and more.
Many of the most popular and successful applications rely on finding shared k-mers (substrings of length k) between sequences. However, k-mers are sensitive to mutations in sequences. For example, when comparing two sequences, a single mutated nucleotide in a sequence will destroy k consecutively shared k-mers. Therefore, densely mutated regions may not share any k-mers, which in turn may cause unaligned reads, missed structural variants, or gaps in genome assemblies.
The new study, published in Genome Research, shows that strobemers can overcome this issue to some extent by the ability to match “oer” substitutions and small indels. This ability means that similar regions with many mutations or sequencing errors that do not have k-mer matches may still produce strobemer matches.
“The study is still at a proof-of-concept level. Both the theoretical properties to optimize the method as well as tailoring the method to various applied problems remain”, says the author and SciLifeLab Fellow,Kristoffer Sahlin (SU).
Several potential avenues and applications for future study and development of strobemers are discussed.
“The study suggests that strobemers can provide sensitive overlap detection of reads, a fundamental step in Genome Assembly and read clustering, as well as mapping of sequences to genomes”, says Kristoffer.
Strobemers can also be used in an algorithm for very fast Illumina short-read mapping, which has been discussed in a follow-up publication from Kristoffer Sahlin. You can read the pre-print here.