During 2021, WASP and the SciLifeLab and Wallenberg National Program on Data-Driven Life Science (DDLS), the two largest research programs in Sweden, launched a joint call with the aim of solving ground-breaking research questions across their different scientific disciplines. In total, 15 applications were awarded grants for two-year projects.
The call closed on September 1, 2021, but you can read about the call here.
Joakim Andén (KTH), Erik Lindahl (SU)
Proteins are essential biomolecules, acting in basically all cellular functions. Single-particle cryogenic electron microscopy (Cryo-EM) has traditionally been the main method for studying the morphology of large protein complexes. Advances in this technique in recent years are now allowing researchers to use cryo-EM to solve near-atomic-resolution macromolecular structures. However, available methods only work for rigid biomolecules and further knowledge about the inherent flexibility and dynamics of biomolecules are essential, not least in drug design.
To construct data-driven priors of atomic models trained using simulations obtained from molecular dynamics
Data driven priors in the form of deep neural networks (DNNs) will be used to construct atomic models. The priors will then be used to estimate the posterior distribution of atomic model trajectories given cryo-EM data.
The project will propose a framework to recover molecular motions and guide multiscale simulations of protein dynamics using cryo-EM data. This would serve as a catalyst for research both in machine learning and structural biology. An important application will be a better prediction of the effects of ligand binding to biomolecules, which is essential in drug design.
Arne Elofsson (SU), Hossein Azizpour (KTH)
All cellular functions are basically regulated by interacting proteins. The essential role of proteins in all living organisms are a result of their complex and variable three-dimensional structures, which in turn, enables their different functions. Increased knowledge about protein structures, functions and interactions are therefore fundamental in the understanding of basic cellular functions as well as alterations behind diseases.
In recent years, the AI/deep-learning system AlphaFold2 has radically improved the possibilities to retrieve highly reliable structures of single proteins only from its amino acid sequence. However, predicting protein binding and mechanisms of interaction remains a challenge.
To develop new deep learning tools to improve predictions of protein binding and interactions.
A new fold and dock protocol based on improved multiple sequence alignment and the AlphaFold2 tool has been developed, and will be developed further within this project. These tools, along with novel deep learning techniques, will be used to generate models of protein-protein interactions. In addition, the computational power in the AI/deep learning focused Berzelius SuperPOD will be used.
The tool developed in this project will be used to predict the structures of all known protein-protein interactions in open source databases, and thereby provide the biological community with a large library of protein interaction models. In the long term, this will increase the understanding of cellular functions as well as disease mechanisms.
Mika Gustafsson (LiU), Rebecka Jörnsten (GU/CTH)
Many common drugs work ineffectively for sub-groups of patients, as a result of the interplay between a multitude of small-effect genetic and epigenetic factors in complex diseases. New biotechnology methods, called omics, have made it possible to measure molecular imprints of a whole cell, which could be useful for development of more individualized therapies. Deep auto-encoders (DAEs), a type of artificial neural networks, are flexible non-linear dimension reduction methods, that recently have emerged as effective tools for summarizing high-dimensional complex genomics data.
To create a flexible multi-omic data integration tool that captures disease-specific structure across multiple levels of biological data to help identify processes related to disease outcome, severity, and response to treatment.
DAEs will be developed with latent spaces constrained by biological side information such as cellular pathways, that can be combined into Deep translational networks (DTNs). The DTNs are further constrained with biological information on the samples, e.g., disease subtypes or clinical outcome.
More advanced data-driven data-integrative methods will be essential in biology as multi-omics single-cell data sets are rapidly emerging. This project will illustrate how flexible integration of multiple data sources can lead to new insights into disease processes, which is the key to individualized treatment strategies.
Joakim Jaldén (KTH), Emma Lundberg (KTH)
Cellular stress responses are shown important in normal human physiology as well as in disease mechanisms. Live cell fluorescence microscopy allows visualization of morphological changes, revealing cellular states. Convolutional Neural Networks (CNN) enable protein pattern recognition and cell segmentation of microscopic images. However, successful training of deep learning models requires capturing not only the major population of cells, but also the minority of rare cells.
To perform early identification and active tracking of cells with rare cell fates including apoptosis, senescence, and drug resistance, by using reinforcement learning in the live cell imaging setting.
Two computational elements will be developed:
1) A method that is sensitive enough to identify cell fate groups based on their morphological changes and stress response pattern.
2) A strategy for a self-driving microscope to reduce phototoxicity and maximize the chance to acquire data points for less represented rare cell populations.
The self-driving system will enable rapid on-the-fly cell fate prediction, automatic microscopy control, and active tracking and acquisition for rare cell populations. The work will provide a framework for modeling the entire cell in a purely data-driven manner. Such digital cell models will further allow in-silico cell experiments which can directly contribute to accelerated drug discovery for a wide range of diseases.
Sven Nelander (UU), Rebecka Jörnsten (GU/CTH)
Each year, approximately 1300 new cases of brain tumors are diagnosed in Sweden. A better understanding of the mechanisms behind cellular growth of different tumors are essential to develop new, targeted therapies against the diseases. New biotechnological methods such as single cell sequencing and gene editing show promise as tools to explore brain tumor biology. The data generated by these methods are complex and require new analytical methods.
To develop a new method to further uncover key genes and cellular pathways behind brain tumor growth.
CRISPR sequencing will be used for large-scale analysis screening of brain tumor xenografts. Thereafter, AI-models based on neural networks, that integrate and structure the CRISPR-Seq results in relation to other databases will be utilized to reveal therapeutic targets and mechanisms.
The project will address several challenges including the formulation of a new class of AI models, based on structured and adaptive regularization. The project can increase our understanding of invasive brain tumor growth and the role of AI methods for genomic data interpretation. This will improve our capacity to identify in vivo relevant disease genes, thereby enabling development of new therapies specifically targeting certain tumor diseases.
Martin Rosvall (UmU), Beatrice Melin (UmU)
The biobanks and the extensive medical registers in Sweden comprise a goldmine for precision medicine research. However, exploiting the benefits of these resources is limited by the strict regulation of sensitive personal data (GDPR) and the complexity of growing amounts of data.
Making biobank data accessible by developing new solutions to overcome the obstacles of handling sensitive personal information and analyzing complex data.
The new biobank infrastructure PREDICT built from the Västerbotten Intervention project will provide data and implement the results coming out from our planned analyses. We will develop a machine-learning approach for data de-identification and evaluation of the de-identified biomedical data based on risk for re-identification and data quality, allowing data sharing and recycling unrestricted by GDPR.
In addition, we will develop methods for fast, interactive biobank data visualization and dimensionality reduction for longitudinal data from repeated samples donated by the same individual over time, enabling a broad research community to access the data.
Making the biobank data accessible will propel precision medicine research, advance its clinical use, and ultimately improve population health and survival in major endemic diseases.
Kevin Smith (KTH), Theodoros Foukakis (KI)
Breast cancer is the most common cancer for women and around 1,500 patients die each year in Sweden due to the disease. Regular mammographic screening has been demonstrated to decrease mortality by 20% to 40%, however, this method has limited sensitivity for some tumor types. ScreenTrust, an AI-based approach, has been developed for estimating breast cancer risk from screening images with some success. However, all models have been based on Convolutional Neural Networks (CNNs) that only accept a single image as input data.
To develop a new class of neural networks for breast cancer risk assessment that considers information from other views and historic images, to improve early detection and select patients for more sensitive screening methods.
Vision transformers (ViTs) will be used, which can outperform CNNs on standard vision tasks. ViTs provide a powerful way to combine multiple mammographic views from a patient. In addition, the ViT can treat an individual image or an entire patient’s clinical history as a sentence and learn which words (or regions) to pay attention to for the task at hand.
Reliable AI-based methods to estimate breast cancer risk offer a potential solution, which could allow hospitals to offer effective personalized screening regimens and care to women, thereby increasing breast cancer survival.
Petter Brodin (KI), Dimos Dimarogonas (KTH)
Immune systems in humans vary a lot between individuals. At the same time, the composition of cells in the blood is stable within an individual over time in the absence of perturbations such as an infection. To understand higher order functions in this complex system and its regulatory mechanisms, simultaneous analyses of all cell populations are required. Technological advances now allow such analyses.
To develop a network model for the immune system responses.
A novel single-cell genomics approach established in the Brodin lab will be used where all immune cell populations are stimulated and analyzed in whole blood cultures. Cell composition will be modulated with individual cell types depleted and the consequent functional responses by remaining cells analyzed. Using state-space techniques, a network model of different immune system responses will be modelled. This can be used to design strategies on how to improve immunomodulatory therapies, by utilizing and expanding leader-selection and pinning control methodologies.
Understanding cell-cell dependencies and the regulatory network of immune cells will allow human immune responses to become more predictable. Precise immunomodulatory treatments may be devised using data-driven precision interventions, targeting the most important nodes in the immune cell network of a given patient.
Tino Ebbers (LiU), Ingrid Hotz (LiU)
Personalized medicine is at the center of all discussions of future medicine. The promise is to enable earlier diagnoses, better risk assessments, and optimal treatments by providing patient-specific care and treatment. The foundation for this is the generation of realistic patient models, which are largely based on high-quality imaging techniques.
Consequently, medical imaging data is continuously growing in size and complexity, where every generation has resulted in more data per exam. However, it also gives rise to many challenges in terms of handling and accessibility of a huge amount of complex imaging data.
To provide a pipeline for the next generation of personalized heart models.
Data will be collected by using the image acquisition technology of the future and methods that support an efficient and effective analysis of the data for diagnosis and model building, integrated in a visual data browser with high-quality rendering, will be developed.
Expertise from two Postdocs, one with a medical engineering background and the other, with Data analysis and visualization experiences, will be combined in this project. Jointly they will generate detailed heart models, extract flow and muscle strain tensor data harvesting the PCD-CT data. From the generated data, a database according to FAIR principles will be built. The visual data browser will support interactive exploration of the database with high quality rendering of anatomy, blood flow and heart muscle strains, and advanced search function.
Mats Karlsson (UU), Bo Bernhardsson (LU)
In the context of precision medicine, the purpose of pharmacometric modeling is to make predictions on an individual basis, based on known possible covariates. Traditionally, covariates have been restricted to demographic parameters (age, weight, etc.), but lifestyle (smoking, exercise, etc.) and omics data (SNPs, etc.) are vastly expanding the possible parameter space, presenting both challenge and opportunity.
Ultimately predictive performance is limited by the amount of available training data, how informative the training data is, and how much of inter-patient variability that can be explained by the covariates present in the data set. One key challenge lies in understanding or learning causality dependences and using them to establish a sound covariate model.
To develop methods to simultaneously learn the structure and parameters of covariate models and to improve pharmacometric predictors suited for individualized precision medical therapies.
Two machine-learning methods, Knowledge-based regularization (artificial neural network (ANN) and kernel models) and Learning Low-dimensional mechanistic and casual relations, needs to be developed. Datasets from Propofol, the Diabetes Registry, and the Multiple Sclerosis Registry will be considered for this project.
Pharmacometric predictions constitute a cornerstone in precision medicine. The existence of well-established national quality registries puts us in an internationally unique position to lead the development of data-driven pharmacometrics modeling, with the prospect of strengthening Sweden’s position within research on the area, and to inspire further collaboration in-line with the objectives of the KAW DDLS initiative.
Tuuli Lappalainen (KTH), Stefan Bauer (KTH)
Understanding causal regulatory networks of the cell is one of the most fundamental gaps in our current biological knowledge. While we know that genomic information is transmitted to cellular functions via complex regulatory networks, we have limited knowledge of how genes regulate each other.
Previous approaches to tackle this question have been unsatisfactory: gene co-expression studies and equivalent correlation-based approaches lack directionality of causation, genetic data is sparse, and classical molecular biology of specific pathways lacks the necessary scale.
To develop novel computational methods to infer causal regulatory networks in human cells from existing and upcoming CRISPR gene knockdown data sets.
Modern causal learning models will be adapted to the high-dimensional perturbation data from single-cell genomics experiments to model gene regulatory interactions. The methods will be extended to handle perturbations of thousands of genes, and include uncertainty estimates to distinguish between no causal effect or insufficient data. Single-cell gene perturbation experiments with CRISPR will be used to validate inferred causal networks.
The combination of novel statistical methods and modern functional genomics toolkit – with scalable CRISPR perturbations coupled with single-cell molecular readouts – provides exciting novel opportunities to infer causal regulatory networks of the cell. Application of these methods to diverse data sets will allow us to characterize basic biology of cellular function, empower design of maximally informative experiments, and discover specific regulatory pathways underlying genetic risk for complex disease.
Fredrik Lindsten (LiU), Sebastian Westenhoff (UU)
Determination of biomolecular structures has led to scientific breakthroughs and innovations in molecular biology and new treatments for diseases. The recently developed deep-learning-based system AlphaFold 2 has improved the possibilities for structure prediction of single proteins. However, several challenges remain, including prediction of conformational heterogeneity.
Machine learning models are commonly trained from large datasets, but once trained they are unable to adapt to constraints or auxiliary, instance-specific data during inference. This is particularly problematic when reliable uncertainty quantification regarding their predictions is required.
To develop novel algorithms to include instance-specific experimental constraints in machine learning models, to bridge the gap between AI predictions and experimental observations, thereby providing new tools for producing reliable, hybrid “predicted/experimental” protein structural ensembles.
Initially, AI-generated structure predictions for complex of phytochrome proteins will be scored against low-resolution cryo-electron-microscopy (EM) data. Hybrid structures will be obtained, which are expected to significantly increase the resolution compared to the pure data. The solution obtained with experimental constraints will be re-fed into the predictive model. In a second step, probabilistic modeling, and uncertainty quantification (UQ) will be performed to predict conformational heterogeneity.
The novel hybrid predicted/experimental protein structures will combine the best of cryo EM (high reliability, single particle technique, measures conformational heterogeneity) and predicted protein structures (high resolution, readily available). The novel AI methods are widely applicable, examples include materials design, drug discovery, and personalized medicine.
Alexander Schliep (GU/Chalmers), Pär Matsson (Sahlgrenska Academy)
Machine learning (ML) and Artificial Intelligence (AI) have been a remarkable success in improving small molecule drug discovery, from generating novel molecular structures to suggesting synthesis pathways. While small molecules interacting with proteins are the most frequently used drug modality today, alternatives are increasingly explored to address unmet clinical needs. Particularly, oligonucleotide therapeutics – i.e., drugs based on chemically modified short RNA/DNA sequences – are opening new opportunities in disease areas where traditional drugs have failed.
To predict thermodynamic effects of novel chemical modifications to oligonucleotides, predict the impact of chemical conjugation on the thermodynamics of therapeutic oligonucleotide binding, and enable federated, privacy-preserving learning of thermodynamics prediction.
Driven by the sample data for thermodynamics of DNA-DNA hybridization, ML models which can serve as the basis for transfer to chemically modified oligonucleotides, will be developed.
Available data for specific chemical modifications will provide input for transfer learning. Extensions to novel modifications, and to combinations of different modification types, will be additionally based on in silico simulations where experimental data is sparse. By using existing data, one of the goals of the project is to predict the impact of conjugation and select suitable conjugates dependent on oligonucleotide sequence and targeted cell type. Finally, competing entities will be allowed to learn ML models for tasks such as off-target binding prediction from pooled data of oligonucleotide binding without compromising privacy of drug candidate data.
The proposal will greatly expand the speed of exploration of novel individual and combination strategies of chemical modification and conjugations in the design of novel oligonucleotide drugs. The proposal will also suggest a way for pharmaceutical companies and other parties to pool data without sacrificing privacy.
Mathias Uhlén (KTH), Andreas Kerren (LiU)
There is a need for a functional genome-wide annotation of the protein-coding genes to get a deeper understanding of mammalian biology. Present genome-wide annotation tools are useful, but require arbitrary cut-offs, commonly obtained via black-box computational models. This may hinder the ability of the analyst to make an informed decision regarding what are relevant fold-changes and detection limits for the underlying transcriptomics data.
To develop a new data-driven strategy for exploring whole-body co-expression patterns, using interpretable machine learning with the help of interactive visualization techniques, that support informed decisions, leading to better predictions and improved trustworthiness of the results.
The new data-driven strategy will be based on interpretable unsupervised learning of whole-body co-expression patterns, supported by state-of-the-art visual analytics. Interactive guiding of the clustering process will be used to explore the gene expression landscape in humans and other mammalian species, to create a whole-body map of all protein-coding genes in all major cell types, tissues and organs.
The new interactive clustering strategy will improve the quality of the classification of all protein-coding genes according to their expression “landscape”, allowing distinct clustering of genes related to tissues and/or functions, such as testis or muscle contraction. Since the data will be publicly available and visualized in the Human Protein Atlas, it will be of world-wide use for the research community.
Björn Wallner (LiU), Alexey Amunts (SU)
Protein-protein interactions underlie the dynamic processes of all living cells. Until very recently the main tool to directly visualize those interactions has been cryo-EM. However, this technique is limited to the most stable complexes.
At the moment, the protein structure field are experiencing a revolution with the development of powerful computational structure prediction tools, AlphaFold, which is powered by advanced deep learning neural networks, that can compete with the experimentally obtained crystal structures.
Remarkably, AlphaFold works well for protein complexes even when it is trained on individual proteins. The reason for this is somewhat unclear but protein-protein interactions are not that different from protein interactions within a single chain. However, folding large protein complexes, with multiple linkers, involving transient interactions will still be a challenge.
To develop new tools to investigate large dynamic multi-component systems.
AlphaFold will be used to systematically predict protein-protein interactions along the ribosomal assembly pathway, by analyzing previously unpublished high-resolution structures of assembly intermediates of human ribosome assembly in mitochondria. This is done to optimize the AlphaFold pipeline and to provide a proof of principle.
Using previous results, transient interactions and structural dynamics will be explored by retraining the Evoformer and the Structural Module. Finally, a system that is optimized to use raw experimental data or electron densities, in addition to the evolutionary and structural information, will be developed.
Due to highly flexible nature of many other cellular assembly systems, and the fundamental importance of the biological problem proposed here, the methods pioneered in this project are expected to be visible and therefore applied also beyond the ribosome assembly problem and become a widely used tool for multi-protein macromolecules.