Artificial intelligence and deep learning holds great promise, but how is this used in life science and what do we need to overcome to get there?
The amount of data acquired in life science is increasing dramatically by the day. So, how can these massive amounts of data be used efficiently, or even at all? One way to do this, that comes with great opportunities, is to let machines do the job – through artificial intelligence and machine learning. There are, however, big challenges with using these techniques. To learn more about these, and what in the world these technologies actually are, we let Wei Ouyang, postdoc in the Emma Lundberg lab (SciLifelab/KTH), straighten things out.
How would you define artificial intelligence and machine learning in the simplest possible way?
AI and machine learning are algorithms and computational tools that can learn from data, solve complex tasks and reason outcomes like humans do with our brain.
How is artificial intelligence and machine learning used in life science today?
Currently AI/ML are mostly used for mining information from data, finding patterns in data, building predictive models, accelerating simulation, performing casualty analysis and performing generative modeling. For example, using AI models for microscopy image segmentation and classification, doing DNA/RNA motif mining, predicting protein structure, and generating drug molecules.
What are the challenges with artificial intelligence and machine learning in life science?
The challenges are in severalfold, including data-level, algorithmic-level, application-level and computational-level challenges.
For data-level challenges, new AI models are more data hungry than conventional methods, it requires more labelled data which can be costly to obtain, especially when annotations are done manually. At the algorithmic-level, while AI models are applicable across different tasks, design efforts are still required to deal with complex data types. Meanwhile, there are more fundamental challenges for AI researchers to resolve, for example, making AI models explainable and interpretable. Producing uncertainty scores to warn users when a model is applied on the wrong dataset. Generalization capability is also a limiting factor for applying AI models, especially for applications that aim for new discoveries, it requires careful design for avoiding problems such as hallucination.
Since many data analysis problems in life science are less defined and lacking reference datasets, it makes it even harder for the AI expert to look into. In the application-level, it requires researchers who apply AI methods to their research understand the requirements, the advantages and drawbacks of using AI tools, more importantly being able to validate the results generated with AI models.
For example, we conventionally work with small amounts of data using hand crafted features. In deep learning, we tend to work with large amounts of raw data with little selection or processing. It is fine to contain some noise in images, we can use the quantity to compensate for the quality. Microscopy imaging for example, researchers would generally acquire dozens of images from selected fields of views manually, and perform quantification via handcrafted image analysis workflows. With deep learning, we tend to do automatic whole-slide scanning, acquiring large amounts of images that can then be processed by deep learning models.
In practice, AI models require more compute power which can be challenging for non-experts. More specifically, running models requires sophisticated software setup and dedicated hardware such as GPU and TPU. As a result, the number of user-friendly AI tools for non-experts are currently limited. More recent AI models such as transformers which have great potential in analyzing DNA sequences require enormous amounts of compute power and energy to train. For many current and future tasks, it is inevitable that many of the AI tools need to be deployed to cloud or on-premise computing infrastructure that are more scalable. This poses further challenges to infrastructure building and software development.
What are the opportunities?
There are many opportunities for data driven life science, the most important one is that we are generating more and more data, new imaging and sequence techniques are becoming much more affordable, and it becomes much easier to produce massive datasets. Not only raw data, but we can combine different techniques to generate labels which further removes costly manual annotation and make the result more reproducible and less biased. As an example, AI-powered label free imaging allows predicting fluorescence labels from brightfield images.
When trained on large amounts of data, AI models not only perform better, but also become more robust and more generalizable to unseen data. Research in weakly supervised learning, unsupervised learning allows training with less annotation or no annotation, which opens the door for applying AI to much larger unlabeled datasets. For example, AlphaFold2 is a model from the DeepMind team that successfully applied for predicting protein folding from protein sequence. Apart from the AI algorithm itself, a big part of its success owes to the ~170,000 proteins with known structure (together with a large dataset with unknown structure).
Recent work in Bayesian deep learning, symbolic AI and explainable AI making AI more applicable and suitable for scientific discovery.
Overall, the AI research community is open and the vast majority of the papers are published with source code which makes it easier to reproduce and make further improvements. The community has been generating open source software libraries, online courses, resources and tools.
For example, our group has been working on ImJoy (https://imjoy.io) for facilitating the deployment of AI tools to non-experts, and we have been working with the community on the Bioimage Model Zoo (https://bioimage.io) project to enable sharing of AI models for bioimage analysis. The SciLifeLab Data Center has been actively building data and AI infrastructure to support AI applications.
Professor Emma Lundberg adds that she is “convinced that artificial intelligence and machine learning will underpin lots of developments in life sciences and accelerate discoveries across the field. Such methods will serve a key tool in the recently started SciLifeLab & Wallenberg National Program for Data-Driven Life Science”, she says.
Is artificial intelligence and machine learning changing life science that dramatically in your view, Wei?
Yes, it gives us a set of powerful and revolutionary tools for handling the complexity of biological systems, it helps us mining large amounts of data and integrate different data sources.
If so, is it already changing life science?
Yes, I think so, and there are many good examples. AI models have for example been successfully applied for predicting protein folding with AlphaFold 2, deep learning models are applied for processing DNA sequences, and microscopy images.
What could be achieved/discovered in life science in the long term using AI and machine learning?
In the near future, we will be able to build large scale data-driven models of the cells, tissue and more complex biological systems. By training models on large amounts of data, models will allow performing in-silico experiments and enable much faster iteration in hypothesis validation, or purely data-driven discovery. In the long run, AI that can assist scientists do literature study with NLP models, guide the experimental design, generate hypotheses and help making decisions during the experiment.
Any other thoughts on what could be important to highlight?
Maybe the issue of generalization, and AI models can produce artefacts easily when applied to data that are “out-of-distribution”.
When working with sensitive data in life science, techniques such as differentiable privacy and federated learning shows promising directions to pursue.
It is important for us to think about AI at the beginning of research planning. Instead of acquiring data first and thinking later about the data analysis, early planning will be beneficial for designing experiments to take advantage of AI models.
You mention “the issue of generalization” and AI models producing artefacts. Could you tell me more about that?
The generalization capability is about how well a model is likely to perform when giving a slightly different dataset that is not seen during training. It is the key issue for applying models, especially when the training dataset is small and biased, models coming out of it can have poor generalization capability. As a result, they tend to be overfitting – perform well during training, but very bad when applied on an unseen dataset. The real risk is that they generate falsy results or artefacts, and the users may not notice that. As an extreme example, when feeding an image with cells into a model trained for recognizing cats or dogs, the model may say it is 99% sure that this is a dog. In practice, there are more subtle cases that are hard to know. And this is why we would like the model to not only produce output but also generate a confidence or uncertainty score. So users are warned when the results are less reliable.
You also mention “differentiable privacy” and “federated learning”. Could you briefly explain those?
Sure. When we want to share statistical information derived from sensitive datasets publicly, such as patient data, we can use differential privacy techniques to make sure information about particular individuals are protected. And federated learning is a technique that allows training of powerful aggregated models with many datasets held by isolated parties without exchanging each individual dataset. For example, we can use federated learning to train drug screening models independently in research institutions or pharma companies and simultaneously aggregate the models to make a single empowered model without the need for exchanging or sharing the underlying datasets.
Read more about SciLifeLabs efforts in data-driven life science at the DDLS program site.