The Virtual Cell Challenge
The vision of a "virtual cell" — a computational model that simulates biological function across modalities and scales — has become a defining goal in computational biology. While powerful unimodal foundation models exist for gene expression (scGPT, Geneformer, scFoundation) and histology images (UNI, CONCH, Virchow), the key challenge is: how do we fuse them?
This paper introduces a principled framework based on Partial Information Decomposition (PID) to understand when and how multimodal fusion actually helps — and proves that standard alignment-based approaches fundamentally cannot capture the most valuable cross-modal information.
The Problem
Paired spatial transcriptomics data (gene expression + histology images for the same tissue) is scarce. This favors Compositional Foundation Models (CFMs): architectures that fuse frozen unimodal experts via a learned interface.
But when does this fusion actually add information? And when does it merely aggregate redundant signal from the strongest modality?
Current benchmarks can't distinguish these cases — they only measure overall accuracy, not information structure.
The Solution
The Synergistic Information Score (SIS) — grounded in PID — quantifies the information gain achievable only through cross-modal interactions. It decomposes task information into:
Redundancy Shared signal (both modalities carry it)
Unique₁ Expression-only info
Unique₂ Morphology-only info
Synergy Info requiring both together
Compositional Foundation Model Architecture
Figure 1. CFM architecture: frozen unimodal foundation models feed embeddings into a learned fusion interface. The choice of fusion objective determines which information types are captured.
Three Key Findings
Finding 1
Tasks dominated by linear redundancies — where both modalities carry the same signal — are sufficiently served by unimodal baselines. Multimodal fusion adds no benefit.
Finding 2
Complex niche definitions — requiring nonlinear relationships between expression and morphology — benefit from synergy-aware integration objectives that enable cross-modal interactions beyond linear alignment.
Finding 3
Fine-tuning a dominant unimodal expert is the most sample-efficient path for standard tasks. Multimodal benefits only emerge when tasks depend on information distributed across modalities.
Partial Information Decomposition (PID)
PID (Williams & Beer, 2010) decomposes the total information two variables provide about a target into four non-negative, additive components. This framework reveals the structure of multimodal information, not just its quantity.
Interactive PID Decomposition
Adjust the sliders to explore how different information structures affect the value of multimodal integration.
PID Component Definitions
| Component | Symbol | Definition | Biological Example | Captured By |
|---|---|---|---|---|
| Redundancy | R(X₁,X₂→Y) | Information provided by either modality alone | Cell type signal visible in both expression markers and morphology | Unimodal baselines, any fusion |
| Unique₁ | U₁(X₁→Y) | Info in expression but not morphology | Subtle transcriptional states not reflected in cell shape | Expression FM alone |
| Unique₂ | U₂(X₂→Y) | Info in morphology but not expression | Tissue architecture, collagen patterns, vascular structure | Histology FM alone |
| Synergy | S(X₁,X₂→Y) | Info only from combining both modalities | Tumor-immune niche: expression state + spatial context → prognosis | Only synergy-aware integration |
Why PID Matters for Biology
In single-cell biology, different data modalities capture overlapping but distinct aspects of cellular state. PID provides the first principled way to quantify:
10-Method Fusion Benchmark
The paper benchmarks 10 multimodal fusion methods on spatial transcriptomics datasets, categorized by their ability to capture different PID components.
Benchmark Performance by Task Type
Select a task to see how fusion methods compare, organized by the task's SIS (synergy requirement).
Cell Type Classification
Assigning broad cell type labels (epithelial, fibroblast, immune). High redundancy — morphology and expression both encode this clearly.
SIS
Best Approach
Method Capability Comparison
Synergistic Information Score (SIS)
SIS quantifies how much of a task's information requires both modalities together. A high SIS means alignment-based fusion will systematically miss critical signal.
where S is the synergistic information from PID
SIS ranges from 0 (all information is redundant or unique — fusion optional) to 1 (all information is synergistic — fusion essential).
SIS Across Spatial Transcriptomics Tasks
Higher SIS → greater benefit from synergy-aware integration. Low SIS → unimodal baselines suffice.
SIS-Guided Decision Framework
| SIS Range | Interpretation | Recommended Strategy |
|---|---|---|
| 0.00 – 0.10 | Negligible synergy | Use best unimodal FM alone |
| 0.10 – 0.25 | Low synergy | Simple fusion (concat) may help marginally |
| 0.25 – 0.50 | Moderate synergy | Interaction-based fusion (cross-attention, tensor) |
| 0.50+ | High synergy | Synergy-maximizing objectives required |
SIS vs. Performance Gain
Tasks with higher SIS show greater performance improvements from synergy-aware fusion.
Task Information Decomposition
Compare the full PID breakdown across all benchmarked tasks. Tasks are ordered by increasing SIS (synergy fraction).
The Alignment Collapse Theorem
The paper's central theoretical contribution: proving that standard alignment-based objectives on frozen encoders are mathematically limited to detecting linear redundancies.
Why Alignment Collapses
Figure 2. Left: Alignment collapses to linear subspaces, missing synergistic patterns (✗). Right: Synergy-aware integration captures nonlinear cross-modal signal (✓).
Formal Statement
Let L_align = -𝔼[sim(Wf₁(x₁), Wf₂(x₂))]
for linear projection W.
Then argmin L_align converges to the
principal subspace of Cov(f₁(X₁), f₂(X₂)),
which captures only linear redundancies.
The key insight: even though the data may contain rich nonlinear relationships between expression and morphology, alignment objectives with frozen encoders can only discover linear shared structure.
Practical Implications
CLIP-style alignment is provably limited
More data doesn't fix it
Fine-tuning encoders helps but is expensive
Interaction-based objectives are needed
Redundancy vs. Synergy Capture by Method Type
Sample Efficiency & Scaling
The paper performs a scaling analysis revealing a critical practical insight: the optimal strategy depends on how much paired data you have.
Scaling Curves
Performance vs. paired training samples for low-synergy (cell type) and high-synergy (niche) tasks.
Practical Decision Tree
Step 1: Compute SIS for your task
Use the SIS metric on a subset of paired data to measure synergy requirements.
Step 2: Assess data availability
Count available paired samples (expression + histology for same tissue).
Step 3: Choose strategy
| SIS | Data | Strategy |
|---|---|---|
| Low | Any | Unimodal expert (fine-tune) |
| High | Low (<1K) | Unimodal expert (still wins) |
| High | Med (1K–10K) | Simple interaction fusion (LMF) |
| High | High (>10K) | Full cross-attention / bottleneck |
Crossover Points
The number of paired samples at which multimodal fusion begins to outperform the best unimodal baseline, by task type.
Implications for the Virtual Cell
Architecture Shift
Building toward a virtual cell requires a fundamental shift from alignment objectives to synergy-maximizing integration that preserves complementary cross-modal signal.
Data Strategy
Generating more paired spatial transcriptomics data is crucial. The value of multimodal models only emerges above critical data thresholds for synergy-rich tasks.
Task-Aware Design
Not all tasks need multimodal models. SIS should guide resource allocation — investing in fusion only where synergy justifies the complexity and data costs.
References
- "Beyond alignment: synergistic integration is required for multimodal cell foundation models." bioRxiv 2026.02.23.707420 (2026). bioRxiv
- "From modality-specific to compositional foundation models for cell biology." Cell Systems (2026). Cell Systems
- Williams, P.L. & Beer, R.D. "Nonnegative decomposition of multivariate information." arXiv:1004.2515 (2010). arXiv
- Bertschinger, N. et al. "Quantifying unique information." Entropy 16(4): 2161–2183 (2014).
- Liang, P.P. et al. "Quantifying & modeling multimodal interactions: an information decomposition framework." NeurIPS (2023). arXiv
- Daunhawer, I. et al. "What to align in multimodal contrastive learning?" ICLR (2025). arXiv
- Cui, H. et al. "scGPT: toward building a foundation model for single-cell multi-omics using generative AI." Nature Methods 21: 1470–1480 (2024). Nature Methods
- Theodoris, C.V. et al. "Transfer learning enables predictions in network biology." Nature 618: 616–624 (2023). Nature
- Chen, R.J. et al. "Towards a general-purpose foundation model for computational pathology." Nature Medicine 30: 850–862 (2024).
- Lu, M.Y. et al. "A visual-language foundation model for computational pathology." Nature Medicine 30: 863–874 (2024).
- Nagrani, A. et al. "Attention bottlenecks for multimodal fusion." NeurIPS (2021). arXiv
- Zadeh, A. et al. "Tensor fusion network for multimodal sentiment analysis." EMNLP (2017).
- Liu, Z. et al. "Efficient low-rank multimodal fusion with modality-specific factors." ACL (2018).
- Arevalo, J. et al. "Gated multimodal units for information fusion." ICLR Workshop (2017).
- Hao, M. et al. "Nicheformer: a foundation model for single-cell and spatial omics." Nature Methods (2025). Nature Methods
- Radford, A. et al. "Learning transferable visual models from natural language supervision." ICML (2021).
- Andrew, G. et al. "Deep canonical correlation analysis." ICML (2013).
- "How to build the virtual cell with artificial intelligence." Cell 187(25): 7045–7063 (2024). Cell
- Stärz, K. et al. "Benchmarking clustering, alignment, and integration methods for spatial transcriptomics." Genome Biology 25: 212 (2024). Genome Biology
- Lance, C. et al. "Multitask benchmarking of single-cell multimodal omics integration methods." Nature Methods (2025). Nature Methods