An image is a window into the world
Abstract
To develop AI for image analysis, we essentially need four components: 1. patient data 2. reference standards 3. neural networks 4. computational resources. For the patient data and reference standards, the most important aspects are representation, quality, and scale. For the neural networks and computational resources, it is expressiveness, parallelism, and scale. This is true for every organization developing AI for image analysis. In the following we briefly explore how AI has transformed the field of medical imaging and provide rationale for our vision.
Who is this for?
This page is written for our collaborators seeking to understand our perspective on medical imaging and AI, as well as for those interested in joining XDMD. It outlines our vision of the field, emphasizing what we find important and the principles that guide our approach.
Author: Dr. Rashindra Manniesing
Last modified: December 21, 2024
To understand what we see in an image, we must first understand the world it represents. An image is never acquired in isolation; it is always part of a larger context. From a physics perspective, it is equally important to consider how the image was acquired, as the image acts as a filter through which we observe. In medical imaging, this translates to understanding both the patient workflow (the world the image represents) and the acquisition process (the scanner that generates the images). In the following, we focus on the patient workflow.
Example workflow
A patient presents with neurological symptoms and is evaluated by a neurologist, who refers the patient for an MRI scan. The radiologist detects a brain tumor. While the tumor’s size and location are assessed, its aggressiveness remains unclear. The patient is scheduled for a brain biopsy, and the sample is sent to a pathologist for histological examination. The combined findings from imaging and pathology are discussed by a multidisciplinary team, including an oncologist and neurosurgeon. Surgery is recommended and the neurosurgeon uses pre-operative images to plan the procedure. Following surgery, the patient begins a course of radiotherapy and/or chemotherapy as part of the treatment plan. The patient will receive regular follow-up imaging to monitor for recurrence.
Although the above is a simplification, it provides valuable context to explain the role of imaging and AI.
Role of imaging
First, AI will not magically replace medical professionals, not any time soon. The example illustrates highly complex and specialized care, where human expertise remains essential. However, certain tasks in this workflow, such as tumor segmentation, can be automated using AI.
Second, imaging serves distinct purposes, each imposing specific requirements for image analysis. For example, tumor removal requires accurate segmentation, while follow-up imaging must ensure no tumor tissue is missed. Or in acute stroke imaging, fast detection is crucial because time is brain.
Third, image interpretation is typically performed by a radiologist or pathologist, who are also part of a multidisciplinary team where findings and implications are discussed. They are trained to interpret medical images and understand the patient and workflow. Thus when it comes to understanding medical images, we must rely on their expertise.
An AI moment
In September 2012, a small team from the University of Toronto submitted their solution to the ImageNet classification challenge. First organized in 2010, this competition focused on classifying natural images across hundreds of object categories and millions of images. The Toronto team won by an unprecedented margin. Their accompanying paper, ImageNet Classification with Deep Convolutional Neural Networks, co-authored by Geoff Hinton – not yet a Nobel laurate – is widely regarded as marking the birth of deep learning.
Deep learning, a type of AI, constructs neural networks using multiple consecutive processing layers that can be efficiently trained on large computing clusters. While neither the architecture nor the training algorithm (backpropagation) were new, this moment was transformative because several key factors converged: the availability of large-scale datasets and the ability to train neural networks effectively on powerful GPUs. Suddenly, there was a working blueprint for automating image analysis.
Another AI moment
Though this watershed moment is over a decade old – a lifetime in the fast-moving field of AI – the core principles remain unchanged. What has evolved is complexity and scale. The neural networks of 2012 have been succeeded by more expressive networks, culminating in the transformer architecture proposed by Google in 2017 – another defining moment in AI. This architecture acts as a differentiable universal computer, capable of learning complex algorithms, and can efficiently be trained in parallel on massive computer clusters. It forms the foundation of generative AI, a class of AI models capable of generating text, images, audio, and video. ChatGPT, one of the first large language models based on this architecture to be made publicly available, took over the world by storm and marked a significant milestone in AI’s public adoption.
What is less widely known, but equally transformative, is that this architecture has also contributed to AlphaFold – an AI system developed by Google DeepMind that solved a 50-year-old protein folding problem in 2020. AlphaFold predicts protein structures from their amino acid sequences, accelerating breakthroughs in medicine and drug discovery. Its significance cannot be overstated. Demis Hassabis, co-founder of DeepMind, was awarded the Nobel Prize in 2024 for his contributions to AlphaFold.
The transformer architecture thus has truly ignited the current AI revolution.
One more ingredient
Large-scale data, advanced neural networks, massive computer clusters: however, as was already highlighted in the example patient workflow, medical imaging is a specialized domain requiring expert knowledge for understanding. We need one more ingredient to build our AI models, and that is domain expertise. Before AI can automatically interpret unseen images, it must first learn from examples – a process that typically involves annotating images with expert input. These annotated images are referred to as the reference standard, and this method of training is known as supervised learning.
Of note, voxel-wise annotations are not the only type of reference standard. Any measure related to the image can serve as a reference, such as a simple yes-or-no answer indicating the presence of a bleed or occlusion, or ground truth derived from corresponding histopathology images. These examples emphasize the importance of causality in medical imaging.
Automated image analysis thus has four main components.
Patient data
Reference standards
Neural networks
Computational resources
These are the basic building blocks. Despite all the amazing developments in AI and the brilliant new architectures, we still need quality and representative data, and a clear understanding of what we are looking at, before AI potentially can take over the task. There is no shortcut. This is true for every organization developing AI models for automated image analysis. Also, behind every block there is a complete world of scientific progress and technical innovation. The field of medical imaging is constantly changing making an image essentially a snapshot in space and time.
Snapshot in spacetime
A few simple examples illustrate its importance. Suppose a new segmentation method (a new AI model) has been developed on CT images with a certain resolution. If technological advancements enable scanning at much higher resolution (for example with photon-counting CT), then this model cannot be used as is on the new imaging data. Similarly, if an AI model has been trained on an adult population, then it cannot be used as is on a pediatric population. All factors contributing to image creation, as well as who is looking (because observers may disagree on what they see introducing uncertainties in the reference standard, and the field of medicine is advancing too), influence the AI model.
Therefore not only the image is a snapshot in spacetime, the AI model itself is as well: both are static. As local environments change, AI model performance declines over time, requiring monitoring when deployed in clinical practice. Segmentation is thus an engineering problem and cannot be solved like solving a mathematical problem.
But AI solves everything
No, AI does not solve everything. Except, perhaps, until AI can discover new physics, a sign of truly understanding our world and universe. We are also impressed and excited by AI developments. The first time interacting with ChatGPT, for example, can feel like magic. The potential of generative AI in medicine is large, but superb eloquence and convincing realism can be misleading. First, it is not always obvious when results are wrong. Ensuring correctness is a real hard problem, making patient safety a concern. Second, only big tech companies have the resources to build these models, raising privacy concerns. Finally, it may obscure the fact that under the hood quality data and manual annotating (now called ‘reinforcement learning with human feedback’) are still crucial for training.
Data is crucial. A large language model trained on blog posts about alternative medicine will generate different answers than one trained on medical scientific literature. But almost certainly they have been trained on both, so how does that work? It is the sheer scale of things. Training of any serious generative model requires massive compute and massive data. At some point the model switches from fast memorization to slow generalization, a phenomenon called grokking. That is when the magic happens. The model, also called a foundation model, can then be fine-tuned for a specific application.
Foundation models
Foundation models play an important role in medical imaging. Examples of foundation models are TotalSegmentator and SAM. They work because images have more commonalities than not. For example, scanners operate on the same physical principles regardless of manufacturer, anatomy is similar regardless of age or race, and segmentation uses similar image gradients regardless of anatomy. Earlier models did not exhibit grokking, later models do and enable zero-shot learning, which allows segmenting unseen images without additional training. Foundation models are excellent starting points for segmentation, but require fine-tuning for your application since your patient data is unique. Do you trust SAM trained on images scraped from the internet to segment images of your patients? Probably not. Also with foundation models, there is no free lunch.
The perfect blend
We have approached the field of medical imaging from a physics- and computer science perspective, building our vision from first principles. If there is one key take-home message, it is this: for most medical imaging problems we know what to do and how to do it – not implying this is an easy task.
Other important perspectives can also be taken, such as the aforementioned causality perspective, clinical perspectives (from both the patient and the medical professional), health economics (understanding cost-effectiveness), safety and privacy, and many more. Regardless of which perspective, if you have questions or wonder how we can help solve your medical imaging problem, feel free to reach out.