đŹ Training Transformers to solve 95% failure rate of Cancer Trials â Ron Alfa & Daniel Bear, Noetik
Latent Space: The AI Engineer Podcast
The failure of 90-95% of cancer drugs in clinical trials stems from poor patient selection rather than inadequate pharmacology. Noetik addresses this by building foundation models that analyze multimodal patient dataâincluding H&E pathology stains, protein markers, and spatial transcriptomicsâto identify therapeutically relevant cancer subtypes. By training these models on massive, proprietary datasets generated in-house, the team creates "world models" capable of simulating how specific patient populations respond to drug perturbations. This approach moves beyond simple, biased biomarkers, allowing for the discovery of novel targets and the optimization of clinical trial design. By leveraging self-supervised learning on high-density spatial data, these models provide a scalable, interpretable framework for matching the right molecule to the right patient, ultimately aiming to transform oncology from a trial-and-error process into a data-driven, predictive science.
00:00Addressing High Failure Rates in Oncology Drug Development
Addressing High Failure Rates in Oncology Drug Development
90% to 95% of cancer drugs fail in clinical trials, primarily due to poor patient selection rather than flaws in pharmacology or target identification. Current preclinical models, such as immortalized cell lines and animal models, fail to accurately represent human patient biology, leading to "Frankensteinian" data that does not translate to clinical success. Noetik aims to solve this by building models that understand patient biology from the outset, allowing for the identification of therapeutically relevant cancer subtypes. By moving beyond simple biomarkersâwhich often fail to capture the complexity of diseaseâthese models can better position molecules in the correct patient populations, potentially rescuing failed trials and improving clinical outcomes.
11:16Multimodal Data Generation and Foundation Model Training
Multimodal Data Generation and Foundation Model Training
Building foundation models for biology requires intentional, high-quality data generation rather than relying on public repositories. Noetik generates multimodal data by sampling patient tumors dozens of times and distributing them across randomized arrays to control for batch effects. The data stack includes H&E staining for tissue-level structure, multiplex immunofluorescence for cell-level immune profiling, and spatial transcriptomics for molecular information. This approach creates a dense, information-rich dataset where each sample acts as an image with 20,000 channels. This scaleâcurrently exceeding 100 million spatially resolved cellsâis essential for training models that can generalize across diverse cancer indications and simulate complex biological responses.
26:33Virtual Cell Simulations for Therapeutic Discovery
Virtual Cell Simulations for Therapeutic Discovery
Virtual cell models function as practical heuristics for drug discovery by simulating how cells behave in specific contexts, such as predicting transcriptomic changes following genetic or chemical perturbations. These models prioritize patient-derived data over in vitro cell culture to ensure clinical relevance. By using H&E images as a "lingua franca," the models can perform inference on historical clinical trial data to identify patient clusters that respond to specific drugs. This allows for the discovery of novel targets and the design of more effective phase two and three trials. The models provide interpretability by linking patient-level embeddings to specific gene expression patterns, revealing why simple biomarkers often fail to predict therapeutic success.
42:02Validating Human Biology Through In Vivo Mouse Perturbations
Validating Human Biology Through In Vivo Mouse Perturbations
To bridge the gap between human patient data and preclinical validation, a "PerturbMap" platform utilizes barcoded CRISPR knockouts in mice. This system allows for the simultaneous study of hundreds of genetic perturbations within a single mouse lung, creating diverse tumor biologies that mirror human conditions. By "in silico humanizing" the mouse data, the models can infer human biological responses directly from mouse experiments. This approach validates the models by confirming they recognize known biological pathways and immune-cold versus immune-hot phenotypes. These experiments demonstrate that models trained on human data can accurately interpret mouse histology, confirming that the learned representations capture fundamental, conserved biological principles.
53:42Scaling Autoregressive Architectures for Biological Foundation Models
Scaling Autoregressive Architectures for Biological Foundation Models
The Tario model architecture represents a shift toward autoregressive, next-token prediction tasks, which have demonstrated superior scaling behavior compared to traditional masked auto-encoding. By increasing context lengths and model size, these architectures better capture non-linear patterns in spatial transcriptomics and low-expression genes. This scaling is critical for understanding the holistic structure of patient biology. The industry is increasingly recognizing the value of these broad, model-based licensing deals, as evidenced by a $50 million agreement with GSK. Such deals allow pharmaceutical companies to access foundation models that can be fine-tuned on their own siloed, historical clinical data, unlocking insights that were previously inaccessible through traditional project-driven collaborations.
1:05:51The Future of AI-Driven Therapeutic Development
The Future of AI-Driven Therapeutic Development
Developing foundation models for biology requires the conviction to generate massive, proprietary datasets before a clear model signal emerges. This process involves years of trial and error to establish robust processing pipelines and data standards. While the field is in the early stagesâakin to the pre-ChatGPT era for biologyâthe potential for AI to transform drug development is immense. Success in this space depends on moving beyond simple literature-reading agents and instead focusing on top-down, functional tissue modeling that abstracts away subcellular biophysical details. The ultimate goal is to create predictive tools that determine which patients should receive which drugs, fundamentally changing how oncology trials are designed and executed.
Sign in to continue reading, translating and more.
Open full episode in Podwise