A new bioinformatics pipeline, called GenErode, has been created by a group of researchers from the SciLifeLab National Bioinformatics Infrastructure (NBIS) and the Swedish Museum of Natural History (NRM). The new pipeline may become very important for conservation experts and addresses a fundamental limitation within the field of genomic studies on endangered species.
One of the most common limitations when performing genomic studies on endangered species is that they have seldom been widely studied, unlike “model species” that are easy to maintain and breed in a laboratory setting. Because of this, there is a lack of reference genomes available for these species.
“A reference genome is like the picture of a puzzle you look at to be able to reconstruct in which place to put each piece. If you don’t have the reference picture, reconstructing the puzzle is impossible. It’s the same with the reference genomes (picture) and the sequencing data (puzzle pieces)”, says last author David Díez-del-Molino from the Swedish Museum of Natural History.
In order to tackle this problem, a group of researchers from the SciLifeLab National Bioinformatics Infrastructure, NBIS, and the Swedish Museum of Natural History developed a bioinformatics pipeline named GenErode. Their report is now published in BMC Bioinformatics.
“GenErode is a bioinformatics pipeline designed to process and analyze ancient, historical, and modern genomic data from endangered species in order to produce comparable estimates of genomic erosion indices”, he continues.
These are estimates of different genomic patterns, such as genomic diversity (heterozygosity, the relative number of positions in the genome in which the two sister chromosomes have a different nucleotide), inbreeding (large stretches of the genome that are homozygous, in which the two sister chromosomes have the same nucleotide, as a consequence of mating with close relatives), and a number of damaging mutations in each genome.
A treasure-trove for conservation efforts
Having these estimates can be hugely relevant for conservation experts, since they impact the opportunities of threatened species to adapt to present or future environmental changes (diversity and inbreeding) and their survivability in short to medium time spans (damaging mutations).
“At its minimum expression, the pipeline only requires sequencing data from two specimens from the same species, one ancient/historical and one modern, and a reference genome
from the same species or a closely related one”, says David Díez-del-Molino.
With minimal command line usage, the pipeline can then map the sequencing data to the reference genome to reconstruct the genomes from the two samples, performing quality filtering and quality control on the way. It also prepares the data for downstream analyzes, which include multiple reports and comparable estimations of the genomic erosion indices per sample.
“Depending on the storage and computational power available to the user, as well as the size of the genome of the target species, GenErode can be run on dozens of samples at the time.
Importantly, GenErode does a great job at keeping reports and logs of settings used in past runs, which should help with reproducibility”, he says.
There are multiple pipelines available that can be used to process genomic data from modern samples, as well as pipelines for ancient DNA data.
“To our knowledge, this is the only pipeline aimed at specifically processing and analyzing genomic data from endangered species. The uniqueness of GenErode comes from the fact that you can process both kinds of data, historical and modern in the same pipeline. Since the aim is to make the downstream analyses, comparable between samples from different periods, our pipeline processes the data accordingly”, says David Díez-del-Molino.
Today, there are multiple large-scale projects, such as the Darwin Tree of Life (DTOL), the Vertebrate Genomes Project (VGP), and the European Reference Genome Atlas (ERGA) that are set to generate reference genomes for all eukaryotes.
“This has the potential to transform the genomics research for endangered species. Therefore, we predict that the number of genomic projects on endangered species will make use of these resources and will multiply in the coming years as well. I think bioinformatics pipelines such as GenErode can be at the center of such a movement, helping researchers and practitioners alike to process and analyze their genomic data”, David says.
“There is definitively a scope for GenErode 2, but we will see. For now, we will be focused on solving any issues or bugs the users might find and improve this pipeline as much as we can to make it as useful as it can be”, he concludes.
What was the hardest part of developing this pipeline?
“The paper is only the face of the pipeline for the research community, but there is a LOT of work behind that is not so obvious. For example, when we started developing this pipeline we had some ideas of what we wanted, but such ideas changed over time, so we have to adapt the
pipeline. We did a lot of testing with different kinds of ancient/historical data that led to big changes to the pipeline a few times. Also, at some point during the development we decided to make the pipeline public, so other people would also benefit from it, which also led to many integral changes to how the pipeline works. One of the hardest things was to decide when to stop developing and start working on a deliverable package instead of keep adding new features. In the end, I think everyone involved was happy to wrap the project up with a publication, so we (sort of) could move on”, says David Díez-del-Molino.
Do you have some funny or inspiring anecdotes from working with the team behind your paper?
“In the Centre for Palaeogenetics we usually bring a cake for everyone when a paper is published. We call it ‘paper cake’. For GenErode we already had it twice, once when the pipeline was made public in GitHub and another one when the manuscript was published as preprint in bioRxiv. We are currently planning a third paper cake for GenErode now that it has been published in BMC Bioinformatics. I think that shows how important this pipeline is for us”, says David Díez-del-Molino.