Multimodal Cell FM Synergy Bench

Fusion Methods Benchmarked

SIS

Synergistic Information Score

PID

Theoretical Framework

CFM

Compositional Foundation Model

🧬

The Virtual Cell Challenge

The vision of a "virtual cell" — a computational model that simulates biological function across modalities and scales — has become a defining goal in computational biology. While powerful unimodal foundation models exist for gene expression (scGPT, Geneformer, scFoundation) and histology images (UNI, CONCH, Virchow), the key challenge is: how do we fuse them?

This paper introduces a principled framework based on Partial Information Decomposition (PID) to understand when and how multimodal fusion actually helps — and proves that standard alignment-based approaches fundamentally cannot capture the most valuable cross-modal information.

Key Insight: Standard contrastive alignment (CLIP-style) on frozen encoders can only detect linear redundancies between modalities. The most valuable biological signal — nonlinear synergistic information — requires fundamentally different integration objectives.

⚡

The Problem

Paired spatial transcriptomics data (gene expression + histology images for the same tissue) is scarce. This favors Compositional Foundation Models (CFMs): architectures that fuse frozen unimodal experts via a learned interface.

But when does this fusion actually add information? And when does it merely aggregate redundant signal from the strongest modality?

Current benchmarks can't distinguish these cases — they only measure overall accuracy, not information structure.

💡

The Solution

The Synergistic Information Score (SIS) — grounded in PID — quantifies the information gain achievable only through cross-modal interactions. It decomposes task information into:

Redundancy Shared signal (both modalities carry it)

Unique₁ Expression-only info

Unique₂ Morphology-only info

Synergy Info requiring both together

📐

Compositional Foundation Model Architecture

Figure 1. CFM architecture: frozen unimodal foundation models feed embeddings into a learned fusion interface. The choice of fusion objective determines which information types are captured.

🎯

Three Key Findings

Finding 1

Tasks dominated by linear redundancies — where both modalities carry the same signal — are sufficiently served by unimodal baselines. Multimodal fusion adds no benefit.

Finding 2

Complex niche definitions — requiring nonlinear relationships between expression and morphology — benefit from synergy-aware integration objectives that enable cross-modal interactions beyond linear alignment.

Finding 3

Fine-tuning a dominant unimodal expert is the most sample-efficient path for standard tasks. Multimodal benefits only emerge when tasks depend on information distributed across modalities.

🧮

Partial Information Decomposition (PID)

PID (Williams & Beer, 2010) decomposes the total information two variables provide about a target into four non-negative, additive components. This framework reveals the structure of multimodal information, not just its quantity.

I(X₁, X₂ ; Y) = R(X₁, X₂ → Y) + U₁(X₁ → Y) + U₂(X₂ → Y) + S(X₁, X₂ → Y)

Redundancy — shared across modalities Unique₁ — expression-only Unique₂ — morphology-only Synergy — requires both together

🎛️

Interactive PID Decomposition

Adjust the sliders to explore how different information structures affect the value of multimodal integration.

Redundancy 30%

Unique₁ (Expr) 25%

Unique₂ (Morph) 20%

Synergy 25%

30%

25%

20%

25%

Redundancy Unique₁ Unique₂ Synergy

Analysis: With 25% synergy, alignment-based fusion will miss about a quarter of the available information. Synergy-aware integration objectives are needed to fully exploit this task.

📋

PID Component Definitions

Component	Symbol	Definition	Biological Example	Captured By
Redundancy	R(X₁,X₂→Y)	Information provided by either modality alone	Cell type signal visible in both expression markers and morphology	Unimodal baselines, any fusion
Unique₁	U₁(X₁→Y)	Info in expression but not morphology	Subtle transcriptional states not reflected in cell shape	Expression FM alone
Unique₂	U₂(X₂→Y)	Info in morphology but not expression	Tissue architecture, collagen patterns, vascular structure	Histology FM alone
Synergy	S(X₁,X₂→Y)	Info only from combining both modalities	Tumor-immune niche: expression state + spatial context → prognosis	Only synergy-aware integration

🔬

Why PID Matters for Biology

In single-cell biology, different data modalities capture overlapping but distinct aspects of cellular state. PID provides the first principled way to quantify:

When fusion is unnecessary — both modalities already carry the answer

When one modality dominates — focus resources there

When fusion is essential — neither modality alone suffices

⚔️

10-Method Fusion Benchmark

The paper benchmarks 10 multimodal fusion methods on spatial transcriptomics datasets, categorized by their ability to capture different PID components.

Early Concatenation

BASELINE

Concatenate raw embeddings from both encoders and feed to a linear classifier. Simplest fusion.

Simple Redundancy Unique

Synergy

Low

Complexity

●○○

Late Fusion (Average)

BASELINE

Train separate classifiers per modality and average predictions. No cross-modal interaction.

Simple Redundancy Only

Synergy

None

Complexity

●○○

Contrastive Alignment (CLIP)

ALIGNMENT-BASED

Maximize cosine similarity of paired embeddings in shared space. Standard multimodal alignment.

Alignment Linear Redundancy

Synergy

Low

Complexity

●●○

Deep CCA

ALIGNMENT-BASED

Deep Canonical Correlation Analysis — maximize linear correlations between modality embeddings.

Alignment Linear Redundancy

Synergy

Low

Complexity

●●○

Projection Alignment

ALIGNMENT-BASED

Learn linear projections mapping both modalities to shared subspace via MSE loss on paired data.

Alignment Linear Redundancy

Synergy

Low

Complexity

●○○

Tensor Fusion Network

INTERACTION-BASED

Outer product of modality embeddings captures all pairwise feature interactions. Exponential parameters.

Integration Synergy-Capable

Synergy

High

Complexity

●●●

Low-Rank Fusion (LMF)

INTERACTION-BASED

Efficient tensor decomposition capturing multiplicative interactions with controlled parameter count.

Integration Synergy-Capable

Synergy

High

Complexity

●●○

Cross-Attention Fusion

INTERACTION-BASED

Each modality attends to the other via transformer-style queries/keys/values. Dynamic, context-dependent.

Integration Synergy-Capable

Synergy

High

Complexity

●●●

Attention Bottleneck (MBT)

INTERACTION-BASED

Shared bottleneck tokens mediate cross-modal information flow. Parameter-efficient cross-modal interaction.

Integration Synergy-Capable

Synergy

High

Complexity

●●●

Gated Multimodal Unit

INTERACTION-BASED

Learned gates control modality contributions. Sigmoid-weighted combination enables nonlinear mixing.

Integration Partial Synergy

Synergy

Med

Complexity

●●○

📊

Benchmark Performance by Task Type

Select a task to see how fusion methods compare, organized by the task's SIS (synergy requirement).

Cell Type

Spatial Domain

Tissue Region

Niche Definition

Microenvironment

Cell Type Classification

Assigning broad cell type labels (epithelial, fibroblast, immune). High redundancy — morphology and expression both encode this clearly.

SIS

0.08

Best Approach

Unimodal

Verdict: Unimodal expression FM alone matches or beats all fusion methods. Multimodal fusion is unnecessary.

🏆

Method Capability Comparison

🎯

Synergistic Information Score (SIS)

SIS quantifies how much of a task's information requires both modalities together. A high SIS means alignment-based fusion will systematically miss critical signal.

SIS(Y; X₁, X₂) = S(X₁, X₂ → Y) / I(X₁, X₂ ; Y)

where S is the synergistic information from PID

SIS ranges from 0 (all information is redundant or unique — fusion optional) to 1 (all information is synergistic — fusion essential).

📈

SIS Across Spatial Transcriptomics Tasks

Higher SIS → greater benefit from synergy-aware integration. Low SIS → unimodal baselines suffice.

📋

SIS-Guided Decision Framework

SIS Range	Interpretation	Recommended Strategy
0.00 – 0.10	Negligible synergy	Use best unimodal FM alone
0.10 – 0.25	Low synergy	Simple fusion (concat) may help marginally
0.25 – 0.50	Moderate synergy	Interaction-based fusion (cross-attention, tensor)
0.50+	High synergy	Synergy-maximizing objectives required

🔮

SIS vs. Performance Gain

Tasks with higher SIS show greater performance improvements from synergy-aware fusion.

🧪

Task Information Decomposition

Compare the full PID breakdown across all benchmarked tasks. Tasks are ordered by increasing SIS (synergy fraction).

🔒

The Alignment Collapse Theorem

The paper's central theoretical contribution: proving that standard alignment-based objectives on frozen encoders are mathematically limited to detecting linear redundancies.

Theorem (informal): For frozen encoder representations f₁(X₁) and f₂(X₂), contrastive alignment objectives (InfoNCE, CLIP-loss) converge to linear subspaces of the encoder outputs. Any synergistic information requiring nonlinear cross-modal interaction is provably inaccessible.

📐

Why Alignment Collapses

Figure 2. Left: Alignment collapses to linear subspaces, missing synergistic patterns (✗). Right: Synergy-aware integration captures nonlinear cross-modal signal (✓).

📝

Formal Statement

Prop 1: Let f₁, f₂ be frozen encoder maps.
Let L_align = -𝔼[sim(Wf₁(x₁), Wf₂(x₂))]
for linear projection W.

Then argmin L_align converges to the
principal subspace of Cov(f₁(X₁), f₂(X₂)),
which captures only linear redundancies.

The key insight: even though the data may contain rich nonlinear relationships between expression and morphology, alignment objectives with frozen encoders can only discover linear shared structure.

⚡

Practical Implications

CLIP-style alignment is provably limited

InfoNCE and contrastive losses maximize mutual information between projected representations. With frozen encoders, this reduces to finding linear correlations — the same solution as CCA.

More data doesn't fix it

The limitation is architectural, not statistical. Even with infinite paired data, alignment on frozen features converges to the same linear subspace.

Fine-tuning encoders helps but is expensive

Unfreezing encoders allows learning nonlinear cross-modal features, but requires paired data (scarce in biology) and destroys pretrained representations.

Interaction-based objectives are needed

Cross-attention, tensor products, and learned bottlenecks create nonlinear interaction terms that can capture synergistic information unavailable to alignment.

📊

Redundancy vs. Synergy Capture by Method Type

📈

Sample Efficiency & Scaling

The paper performs a scaling analysis revealing a critical practical insight: the optimal strategy depends on how much paired data you have.

Key Finding: Fine-tuning a dominant unimodal expert is the most sample-efficient path for standard tasks. The benefits of multimodal fusion only emerge at higher data regimes and when tasks require cross-modal synergy.

📊

Scaling Curves

Performance vs. paired training samples for low-synergy (cell type) and high-synergy (niche) tasks.

💡

Practical Decision Tree

Step 1: Compute SIS for your task

Use the SIS metric on a subset of paired data to measure synergy requirements.

Step 2: Assess data availability

Count available paired samples (expression + histology for same tissue).

Step 3: Choose strategy

SIS	Data	Strategy
Low	Any	Unimodal expert (fine-tune)
High	Low (<1K)	Unimodal expert (still wins)
High	Med (1K–10K)	Simple interaction fusion (LMF)
High	High (>10K)	Full cross-attention / bottleneck

🎯

Crossover Points

The number of paired samples at which multimodal fusion begins to outperform the best unimodal baseline, by task type.

🔭

Implications for the Virtual Cell

Architecture Shift

Building toward a virtual cell requires a fundamental shift from alignment objectives to synergy-maximizing integration that preserves complementary cross-modal signal.

Data Strategy

Generating more paired spatial transcriptomics data is crucial. The value of multimodal models only emerges above critical data thresholds for synergy-rich tasks.

Task-Aware Design

Not all tasks need multimodal models. SIS should guide resource allocation — investing in fusion only where synergy justifies the complexity and data costs.

📚

References

"Beyond alignment: synergistic integration is required for multimodal cell foundation models." bioRxiv 2026.02.23.707420 (2026). bioRxiv
"From modality-specific to compositional foundation models for cell biology." Cell Systems (2026). Cell Systems
Williams, P.L. & Beer, R.D. "Nonnegative decomposition of multivariate information." arXiv:1004.2515 (2010). arXiv
Bertschinger, N. et al. "Quantifying unique information." Entropy 16(4): 2161–2183 (2014).
Liang, P.P. et al. "Quantifying & modeling multimodal interactions: an information decomposition framework." NeurIPS (2023). arXiv
Daunhawer, I. et al. "What to align in multimodal contrastive learning?" ICLR (2025). arXiv
Cui, H. et al. "scGPT: toward building a foundation model for single-cell multi-omics using generative AI." Nature Methods 21: 1470–1480 (2024). Nature Methods
Theodoris, C.V. et al. "Transfer learning enables predictions in network biology." Nature 618: 616–624 (2023). Nature
Chen, R.J. et al. "Towards a general-purpose foundation model for computational pathology." Nature Medicine 30: 850–862 (2024).
Lu, M.Y. et al. "A visual-language foundation model for computational pathology." Nature Medicine 30: 863–874 (2024).
Nagrani, A. et al. "Attention bottlenecks for multimodal fusion." NeurIPS (2021). arXiv
Zadeh, A. et al. "Tensor fusion network for multimodal sentiment analysis." EMNLP (2017).
Liu, Z. et al. "Efficient low-rank multimodal fusion with modality-specific factors." ACL (2018).
Arevalo, J. et al. "Gated multimodal units for information fusion." ICLR Workshop (2017).
Hao, M. et al. "Nicheformer: a foundation model for single-cell and spatial omics." Nature Methods (2025). Nature Methods
Radford, A. et al. "Learning transferable visual models from natural language supervision." ICML (2021).
Andrew, G. et al. "Deep canonical correlation analysis." ICML (2013).
"How to build the virtual cell with artificial intelligence." Cell 187(25): 7045–7063 (2024). Cell
Stärz, K. et al. "Benchmarking clustering, alignment, and integration methods for spatial transcriptomics." Genome Biology 25: 212 (2024). Genome Biology
Lance, C. et al. "Multitask benchmarking of single-cell multimodal omics integration methods." Nature Methods (2025). Nature Methods