10
Fusion Methods Benchmarked
SIS
Synergistic Information Score
PID
Theoretical Framework
CFM
Compositional Foundation Model
🧬

The Virtual Cell Challenge

The vision of a "virtual cell" — a computational model that simulates biological function across modalities and scales — has become a defining goal in computational biology. While powerful unimodal foundation models exist for gene expression (scGPT, Geneformer, scFoundation) and histology images (UNI, CONCH, Virchow), the key challenge is: how do we fuse them?

This paper introduces a principled framework based on Partial Information Decomposition (PID) to understand when and how multimodal fusion actually helps — and proves that standard alignment-based approaches fundamentally cannot capture the most valuable cross-modal information.

Key Insight: Standard contrastive alignment (CLIP-style) on frozen encoders can only detect linear redundancies between modalities. The most valuable biological signal — nonlinear synergistic information — requires fundamentally different integration objectives.

The Problem

Paired spatial transcriptomics data (gene expression + histology images for the same tissue) is scarce. This favors Compositional Foundation Models (CFMs): architectures that fuse frozen unimodal experts via a learned interface.

But when does this fusion actually add information? And when does it merely aggregate redundant signal from the strongest modality?

Current benchmarks can't distinguish these cases — they only measure overall accuracy, not information structure.

💡

The Solution

The Synergistic Information Score (SIS) — grounded in PID — quantifies the information gain achievable only through cross-modal interactions. It decomposes task information into:

Redundancy Shared signal (both modalities carry it)

Unique₁ Expression-only info

Unique₂ Morphology-only info

Synergy Info requiring both together

📐

Compositional Foundation Model Architecture

Gene Expression scRNA-seq / Visium ~20K genes per spot X₁ ∈ ℝᵈ¹ Histology Image H&E stained tissue Morphological features X₂ ∈ ℝᵈ² Expression FM scGPT / Geneformer scFoundation / scPRINT ❄ FROZEN Histology FM UNI / CONCH / Virchow CTransPath / Phikon ❄ FROZEN FUSION INTERFACE Alignment (CLIP/CL) Cross-Attention Tensor Fusion / LMF Attention Bottleneck Gated / Bilinear / ... Downstream Tasks Cell Type Classification Spatial Domain ID Niche Definition Gene Imputation

Figure 1. CFM architecture: frozen unimodal foundation models feed embeddings into a learned fusion interface. The choice of fusion objective determines which information types are captured.

🎯

Three Key Findings

Finding 1

Tasks dominated by linear redundancies — where both modalities carry the same signal — are sufficiently served by unimodal baselines. Multimodal fusion adds no benefit.

Finding 2

Complex niche definitions — requiring nonlinear relationships between expression and morphology — benefit from synergy-aware integration objectives that enable cross-modal interactions beyond linear alignment.

Finding 3

Fine-tuning a dominant unimodal expert is the most sample-efficient path for standard tasks. Multimodal benefits only emerge when tasks depend on information distributed across modalities.

🧮

Partial Information Decomposition (PID)

PID (Williams & Beer, 2010) decomposes the total information two variables provide about a target into four non-negative, additive components. This framework reveals the structure of multimodal information, not just its quantity.

I(X₁, X₂ ; Y) = R(X₁, X₂ → Y) + U₁(X₁ → Y) + U₂(X₂ → Y) + S(X₁, X₂ → Y)
Redundancy — shared across modalities Unique₁ — expression-only Unique₂ — morphology-only Synergy — requires both together
🎛️

Interactive PID Decomposition

Adjust the sliders to explore how different information structures affect the value of multimodal integration.

Redundancy 30%
Unique₁ (Expr) 25%
Unique₂ (Morph) 20%
Synergy 25%
30%
25%
20%
25%
Redundancy Unique₁ Unique₂ Synergy
Analysis: With 25% synergy, alignment-based fusion will miss about a quarter of the available information. Synergy-aware integration objectives are needed to fully exploit this task.
📋

PID Component Definitions

Component Symbol Definition Biological Example Captured By
Redundancy R(X₁,X₂→Y) Information provided by either modality alone Cell type signal visible in both expression markers and morphology Unimodal baselines, any fusion
Unique₁ U₁(X₁→Y) Info in expression but not morphology Subtle transcriptional states not reflected in cell shape Expression FM alone
Unique₂ U₂(X₂→Y) Info in morphology but not expression Tissue architecture, collagen patterns, vascular structure Histology FM alone
Synergy S(X₁,X₂→Y) Info only from combining both modalities Tumor-immune niche: expression state + spatial context → prognosis Only synergy-aware integration
🔬

Why PID Matters for Biology

In single-cell biology, different data modalities capture overlapping but distinct aspects of cellular state. PID provides the first principled way to quantify:

R
When fusion is unnecessary — both modalities already carry the answer
U
When one modality dominates — focus resources there
S
When fusion is essential — neither modality alone suffices
⚔️

10-Method Fusion Benchmark

The paper benchmarks 10 multimodal fusion methods on spatial transcriptomics datasets, categorized by their ability to capture different PID components.

Early Concatenation
BASELINE
Concatenate raw embeddings from both encoders and feed to a linear classifier. Simplest fusion.
Simple Redundancy Unique
Synergy
Low
Complexity
●○○
Late Fusion (Average)
BASELINE
Train separate classifiers per modality and average predictions. No cross-modal interaction.
Simple Redundancy Only
Synergy
None
Complexity
●○○
Contrastive Alignment (CLIP)
ALIGNMENT-BASED
Maximize cosine similarity of paired embeddings in shared space. Standard multimodal alignment.
Alignment Linear Redundancy
Synergy
Low
Complexity
●●○
Deep CCA
ALIGNMENT-BASED
Deep Canonical Correlation Analysis — maximize linear correlations between modality embeddings.
Alignment Linear Redundancy
Synergy
Low
Complexity
●●○
Projection Alignment
ALIGNMENT-BASED
Learn linear projections mapping both modalities to shared subspace via MSE loss on paired data.
Alignment Linear Redundancy
Synergy
Low
Complexity
●○○
Tensor Fusion Network
INTERACTION-BASED
Outer product of modality embeddings captures all pairwise feature interactions. Exponential parameters.
Integration Synergy-Capable
Synergy
High
Complexity
●●●
Low-Rank Fusion (LMF)
INTERACTION-BASED
Efficient tensor decomposition capturing multiplicative interactions with controlled parameter count.
Integration Synergy-Capable
Synergy
High
Complexity
●●○
Cross-Attention Fusion
INTERACTION-BASED
Each modality attends to the other via transformer-style queries/keys/values. Dynamic, context-dependent.
Integration Synergy-Capable
Synergy
High
Complexity
●●●
Attention Bottleneck (MBT)
INTERACTION-BASED
Shared bottleneck tokens mediate cross-modal information flow. Parameter-efficient cross-modal interaction.
Integration Synergy-Capable
Synergy
High
Complexity
●●●
Gated Multimodal Unit
INTERACTION-BASED
Learned gates control modality contributions. Sigmoid-weighted combination enables nonlinear mixing.
Integration Partial Synergy
Synergy
Med
Complexity
●●○
📊

Benchmark Performance by Task Type

Select a task to see how fusion methods compare, organized by the task's SIS (synergy requirement).

Cell Type
Spatial Domain
Tissue Region
Niche Definition
Microenvironment

Cell Type Classification

Assigning broad cell type labels (epithelial, fibroblast, immune). High redundancy — morphology and expression both encode this clearly.

SIS
0.08
Best Approach
Unimodal
Verdict: Unimodal expression FM alone matches or beats all fusion methods. Multimodal fusion is unnecessary.
🏆

Method Capability Comparison

🎯

Synergistic Information Score (SIS)

SIS quantifies how much of a task's information requires both modalities together. A high SIS means alignment-based fusion will systematically miss critical signal.

SIS(Y; X₁, X₂) = S(X₁, X₂ → Y) / I(X₁, X₂ ; Y)

where S is the synergistic information from PID

SIS ranges from 0 (all information is redundant or unique — fusion optional) to 1 (all information is synergistic — fusion essential).

📈

SIS Across Spatial Transcriptomics Tasks

Higher SIS → greater benefit from synergy-aware integration. Low SIS → unimodal baselines suffice.

📋

SIS-Guided Decision Framework

SIS Range Interpretation Recommended Strategy
0.00 – 0.10 Negligible synergy Use best unimodal FM alone
0.10 – 0.25 Low synergy Simple fusion (concat) may help marginally
0.25 – 0.50 Moderate synergy Interaction-based fusion (cross-attention, tensor)
0.50+ High synergy Synergy-maximizing objectives required
🔮

SIS vs. Performance Gain

Tasks with higher SIS show greater performance improvements from synergy-aware fusion.

🧪

Task Information Decomposition

Compare the full PID breakdown across all benchmarked tasks. Tasks are ordered by increasing SIS (synergy fraction).

🔒

The Alignment Collapse Theorem

The paper's central theoretical contribution: proving that standard alignment-based objectives on frozen encoders are mathematically limited to detecting linear redundancies.

Theorem (informal): For frozen encoder representations f₁(X₁) and f₂(X₂), contrastive alignment objectives (InfoNCE, CLIP-loss) converge to linear subspaces of the encoder outputs. Any synergistic information requiring nonlinear cross-modal interaction is provably inaccessible.
📐

Why Alignment Collapses

Alignment-Based Fusion Expression Space Morphology Space linear subspace linear subspace maximize cos(z₁, z₂) Only captures redundant (shared) signal Synergy-Aware Integration Joint Interaction Space nonlinear manifold Captures redundant + synergistic signal

Figure 2. Left: Alignment collapses to linear subspaces, missing synergistic patterns (✗). Right: Synergy-aware integration captures nonlinear cross-modal signal (✓).

📝

Formal Statement

Prop 1: Let f₁, f₂ be frozen encoder maps.
Let L_align = -𝔼[sim(Wf₁(x₁), Wf₂(x₂))]
for linear projection W.

Then argmin L_align converges to the
principal subspace of Cov(f₁(X₁), f₂(X₂)),
which captures only linear redundancies.

The key insight: even though the data may contain rich nonlinear relationships between expression and morphology, alignment objectives with frozen encoders can only discover linear shared structure.

Practical Implications

CLIP-style alignment is provably limited
InfoNCE and contrastive losses maximize mutual information between projected representations. With frozen encoders, this reduces to finding linear correlations — the same solution as CCA.
More data doesn't fix it
The limitation is architectural, not statistical. Even with infinite paired data, alignment on frozen features converges to the same linear subspace.
Fine-tuning encoders helps but is expensive
Unfreezing encoders allows learning nonlinear cross-modal features, but requires paired data (scarce in biology) and destroys pretrained representations.
Interaction-based objectives are needed
Cross-attention, tensor products, and learned bottlenecks create nonlinear interaction terms that can capture synergistic information unavailable to alignment.
📊

Redundancy vs. Synergy Capture by Method Type

📈

Sample Efficiency & Scaling

The paper performs a scaling analysis revealing a critical practical insight: the optimal strategy depends on how much paired data you have.

Key Finding: Fine-tuning a dominant unimodal expert is the most sample-efficient path for standard tasks. The benefits of multimodal fusion only emerge at higher data regimes and when tasks require cross-modal synergy.
📊

Scaling Curves

Performance vs. paired training samples for low-synergy (cell type) and high-synergy (niche) tasks.

💡

Practical Decision Tree

Step 1: Compute SIS for your task

Use the SIS metric on a subset of paired data to measure synergy requirements.

Step 2: Assess data availability

Count available paired samples (expression + histology for same tissue).

Step 3: Choose strategy

SISDataStrategy
LowAnyUnimodal expert (fine-tune)
HighLow (<1K)Unimodal expert (still wins)
HighMed (1K–10K)Simple interaction fusion (LMF)
HighHigh (>10K)Full cross-attention / bottleneck
🎯

Crossover Points

The number of paired samples at which multimodal fusion begins to outperform the best unimodal baseline, by task type.

🔭

Implications for the Virtual Cell

Architecture Shift

Building toward a virtual cell requires a fundamental shift from alignment objectives to synergy-maximizing integration that preserves complementary cross-modal signal.

Data Strategy

Generating more paired spatial transcriptomics data is crucial. The value of multimodal models only emerges above critical data thresholds for synergy-rich tasks.

Task-Aware Design

Not all tasks need multimodal models. SIS should guide resource allocation — investing in fusion only where synergy justifies the complexity and data costs.

📚

References

  1. "Beyond alignment: synergistic integration is required for multimodal cell foundation models." bioRxiv 2026.02.23.707420 (2026). bioRxiv
  2. "From modality-specific to compositional foundation models for cell biology." Cell Systems (2026). Cell Systems
  3. Williams, P.L. & Beer, R.D. "Nonnegative decomposition of multivariate information." arXiv:1004.2515 (2010). arXiv
  4. Bertschinger, N. et al. "Quantifying unique information." Entropy 16(4): 2161–2183 (2014).
  5. Liang, P.P. et al. "Quantifying & modeling multimodal interactions: an information decomposition framework." NeurIPS (2023). arXiv
  6. Daunhawer, I. et al. "What to align in multimodal contrastive learning?" ICLR (2025). arXiv
  7. Cui, H. et al. "scGPT: toward building a foundation model for single-cell multi-omics using generative AI." Nature Methods 21: 1470–1480 (2024). Nature Methods
  8. Theodoris, C.V. et al. "Transfer learning enables predictions in network biology." Nature 618: 616–624 (2023). Nature
  9. Chen, R.J. et al. "Towards a general-purpose foundation model for computational pathology." Nature Medicine 30: 850–862 (2024).
  10. Lu, M.Y. et al. "A visual-language foundation model for computational pathology." Nature Medicine 30: 863–874 (2024).
  11. Nagrani, A. et al. "Attention bottlenecks for multimodal fusion." NeurIPS (2021). arXiv
  12. Zadeh, A. et al. "Tensor fusion network for multimodal sentiment analysis." EMNLP (2017).
  13. Liu, Z. et al. "Efficient low-rank multimodal fusion with modality-specific factors." ACL (2018).
  14. Arevalo, J. et al. "Gated multimodal units for information fusion." ICLR Workshop (2017).
  15. Hao, M. et al. "Nicheformer: a foundation model for single-cell and spatial omics." Nature Methods (2025). Nature Methods
  16. Radford, A. et al. "Learning transferable visual models from natural language supervision." ICML (2021).
  17. Andrew, G. et al. "Deep canonical correlation analysis." ICML (2013).
  18. "How to build the virtual cell with artificial intelligence." Cell 187(25): 7045–7063 (2024). Cell
  19. Stärz, K. et al. "Benchmarking clustering, alignment, and integration methods for spatial transcriptomics." Genome Biology 25: 212 (2024). Genome Biology
  20. Lance, C. et al. "Multitask benchmarking of single-cell multimodal omics integration methods." Nature Methods (2025). Nature Methods