ArXiv Preprint 2025

PRISM: Demystifying Retention and
Interaction in Mid-Training

A comprehensive empirical study of mid-training design choices for LLMs across 4 model families, 2 architecture types, 7 models, and scales from 3B to 24B parameters.

Bharat Runwal  ·  Ashish Agrawal  ·  Anurag Roy  ·  Rameswar Panda

IBM IBM Research  ·  MIT-IBM Watson AI Lab

ArXiv Paper Code Data Mixtures (Coming Soon) Models (Coming Soon)
PRISM animated overview showing mid-training pipeline and results across model families
PRISM overview. Mid-training decisions are decomposed into their principal design axes: data composition, timing, domain interaction, benchmark selection, RL compatibility, and scaling behavior. PRISM enables holistic evaluation of mid-training choices across model families at scale.
Experimental Scope

Scale of the Study

PRISM systematically evaluates mid-training across a diverse set of modern LLMs, spanning multiple families, architectures, and parameter scales.

7
Base Models
4
Model Families
2
Architecture Types
3B-24B
Parameter Scale
~27B
Mid-Train Tokens
10+
Benchmarks
Model FamilyModelsArchitectureParameters
GraniteGranite-4 Micro (3B), Granite-3.3-8BDense Transformer3B, 8B
Granite-4-H Micro (3B)Attention-Mamba Hybrid3B
LLaMALLaMA-3.1-8BDense Transformer8B
MistralMistral-7B-v0.1, Mistral-Small-24BDense Transformer7B, 24B
Nemotron-HNemotron-H-8BAttention-Mamba Hybrid8B
Abstract

Paper Summary

We present PRISM (Demystifying Retention and Interaction in Mid-Training), a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven open-source base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we systematically investigate what data to use, when to apply mid-training, how it interacts with reinforcement learning (RL), and whether findings generalize across architectures.


Using targeted mid-training mixtures of only ~27B high-quality tokens, PRISM yields +15 to +40 point math gains, +5 to +12 point code gains, and +6 to +13 point science gains across all tested models, while preserving general-purpose performance. Crucially, data composition choices matter most at mid-training, not at RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. The full PRISM-to-RL pipeline transforms base models scoring under 12 AVG into models scoring 29-42 AVG, a 3-4x improvement. Mechanistically, mid-training densely restructures >90% of weights while RL makes sparse, front-loaded refinements to ~5% of parameters. Representational geometry is largely preserved through RL (CKA >0.998 across models and input distributions), while different mid-training data mixtures produce same-magnitude but differently-directed weight updates (cosine similarity 0.52 for Granite-3.3, 0.62 for Nemotron-H). Pass rate landscape interpolation shows a generally increasing pass rate from Base (17%) to Mid-Training (76%) to RL (80%) for Granite-3.3, consistent with mid-training progressively improving the model's configuration for RL. Benefits hold for both dense Transformers and attention-Mamba hybrids, from 3B to 24B parameters.

At a Glance

Key Results

PRISM mid-training produces consistent, large gains across domains and model families using only ~27B tokens.

+15-40
Math Improvement
points across all models
+5-12
Code Improvement
points across all models
+6-13
Science Improvement
GPQA-Diamond
3-4x
Pipeline Boost
base <12 to 29-42 AVG
Key Findings

What We Discovered

Five principal findings that provide practical guidance for designing mid-training pipelines.

Finding 1

Mid-training is a powerful reasoning catalyst

Across all tested models, PRISM yields +15 to +40 pt math gains, +5 to +12 pt code gains, and +6 to +13 pt science gains, while preserving general-purpose performance.

Finding 2

Mid-training is essential for effective RL

The full PRISM → RL pipeline transforms base models scoring under 12 AVG into models scoring 29-42 AVG, a 3-4x improvement. RL applied directly to base models is substantially less effective, with AIME scores remaining near zero.

Finding 3

Data composition matters at MT, not RL

Changing the mid-training mix shifts AVG by +4 to +6 pt, while changing the RL mix produces <2 point differences. Science data at mid-training unlocks +17 to +28 pt GPQA-Diamond gains during RL.

Finding 4

Benefits generalize across architectures

Both dense Transformers and attention-Mamba hybrids benefit consistently from PRISM, from 3B to 24B parameters. Mid-training gains are consistent across all four model families tested.

Finding 5

RL expands the solvability frontier

For Granite-3.3, RL on PRISM-mid-trained models progressively solves prompts that were initially unsolvable, with training curves that remain non-saturating across hundreds of steps.

Finding 6

Mid-training and RL operate at different representational scales, confirmed mechanistically

CKA analysis shows RL preserves mid-training representations (CKA >0.998). Pass rate landscape on held-out MATH500 shows a generally increasing path from Base (17%) to MT (76%) to RL (80%) for Granite-3.3, consistent with mid-training improving the model's configuration for subsequent RL.

Method

The PRISM Pipeline

A three-stage recipe: targeted mid-training with retention-aware data mixtures, optional long-context restoration, then reinforcement learning.

Base Model
Pre-trained LLM
(3B-24B params)
PRISM Mid-Training
~27B tokens
Math + Code + Science
LC Restoration
Linear merging +
128k extension
RL (GRPO)
Math + Code + Science
verifiable rewards
Data Composition

Mid-Training Data Mixtures

We design three progressively richer data mixtures. Click each configuration to explore how domain coverage evolves and see corresponding benchmark performance.

36.4
Math AVG
2.8
Code AVG
17.3
GPQA-D
18.9
Overall AVG
Mid-Training MixCode AVGGPQA-DMath AVGOverall AVG
Base (no MT)2.0722.568.9511.19
Math only2.8117.3436.4318.86
Math + Code10.7119.0244.3324.69
Math + Code + Science10.5829.1248.7529.48
Domain-specific mid-training results on Granite-3.3-8B. Adding science data yields +10 points on GPQA-Diamond and +4 points on math, with minimal code regression. This composition effect persists and amplifies through RL.
Results

Full Pipeline Evaluation

End-to-end results: base model → PRISM mid-training → RL. AVG is the macro-average over Math, Code, and Science.

ModelStageAIME'24AIME'25MATH500LCBCFGPQA-DAVG
Granite-3.3-8BBase0.460.3126.092.151.9922.568.93
+ PRISM37.1827.9681.1110.6310.5229.1232.75
+ PRISM + RL40.9432.0384.7619.5920.8251.5141.60
LLaMA-3.1-8BBase0.050.15 6.510.000.0720.20 4.50
+ PRISM16.4519.3273.476.095.4421.0423.64
+ PRISM + RL21.15 22.9777.2115.0512.4339.3931.37
Mistral-Small-24BBase0.780.7326.920.000.2922.558.55
+ PRISM32.9127.3480.8010.0310.0822.05 30.54
+ PRISM + RL 39.69 32.40 86.89 16.97 17.59 50.00 40.59
Nemotron-H-8BBase2.132.2949.461.193.604.2110.48
+ PRISM19.2122.7676.6313.0210.5231.9829.02
+ PRISM + RL29.9528.5484.4719.5915.3841.2436.53
Full pipeline results (PRISM → RL). All models use Math+Code+Science mid-training. AVG is the macro-average across all six benchmarks. Highlighted rows show the complete pipeline. Representative models shown; see paper for all seven.
Reinforcement Learning

Why Mid-Training Before RL?

RL on PRISM-mid-trained models produces large, sustained gains across math, code, and science. RL on base models without mid-training is substantially less effective, with AIME scores remaining near zero.

 Base → PRISM → RL
RL learning curves on PRISM mid-trained Granite-3.3-8B for math benchmarks
AIME'24 reaches 41+, AIME'25 reaches 32+, MATH500 reaches 85+. Strong gains with non-saturating curves.
 Base → RL (no mid-training)
RL learning curves on base Granite-3.3-8B for math benchmarks
AIME'24 stays below 1.5, AIME'25 below 0.6. RL alone cannot bootstrap reasoning from a base model.
 Base → PRISM → RL
RL learning curves on PRISM mid-trained Granite-3.3-8B for code and science
LiveCodeBench reaches 19+, Codeforces 20+, GPQA-Diamond 47+. Science gains are especially large when science data was included at mid-training.
 Base → RL (no mid-training)
RL learning curves on base Granite-3.3-8B for code and science
Without mid-training foundation, RL produces minimal code/science gains. Base model lacks the reasoning substrate to leverage RL signal.
Long-Context

Restoring Long-Context Ability

Mid-training at 8k context degrades long-context capabilities. We restore them via linear model merging followed by a brief 128k extension phase.

59.09
Base Model
RULER Score
6.46
After 8k Mid-Training
89.06% drop
11.32
Linear Merge
Partial recovery
42.16
+ 128k Extension
change recovered
Long-context restoration pipeline: base model, mid-training, Linear merge, 128k extension
How it works: After mid-training at 8k context, we merge the mid-trained model with the original base model. This recovers some portion of long-context ability while retaining reasoning gains. A brief additional training phase on 128k sequences further restores the RULER score from 11.32 to 42.2, recovering 71% of the original capability.
Mechanistic Analysis

How Do Mid-Training and RL Differ?

We investigate how mid-training and RL change models through weight-level divergence, CKA representation analysis, weight direction comparisons, pass rate landscape interpolation, prediction entropy, and correctness analysis across three 8B models.

>90%
Parameters changed by mid-training
~5%
Parameters changed by RL
300-580x
Magnitude difference (MT vs RL)
>0.998
CKA preserved through RL
17%→76%→80%
Pass rate: Base→MT→RL (MATH500, G33)

Weight Divergence: Dense Restructuring vs. Sparse Refinement

Normalized L2 divergence by component type. Mid-training changes weights orders of magnitude more than RL.

Base → Mid-Train
Mid-Train → RL
Base → RL (no MT)
Granite-3.3 (8B) — Dense Transformer
Attention
0.175
MLP
0.329

Nemotron-H (8B) — Attention-Mamba Hybrid
Attention
0.230
MLP
0.289
Mamba
0.138
RL produces similarly scaled and sparse weight changes regardless of starting point, yet downstream performance differs substantially depending on whether mid-training preceded it.

CKA Representation Analysis: RL Preserves Mid-Training Geometry

Centered Kernel Alignment (CKA) between mid-trained and RL model representations across inputs and models. Values >0.998 indicate that RL refines without restructuring representations.

CKA representation analysis across 4 models and 3 input distributions
Figure: CKA analysis (PDF format, view in paper)
CKA similarity between mid-trained and RL checkpoints on Wikipedia, C4, and GSM8K inputs for Granite-3.3, LLaMA-3.1, and Nemotron-H (200 samples, batch-size 1). RL consistently preserves the representational geometry established by mid-training across all three models and all three input distributions (minimum MT vs RL CKA >0.998). Bootstrap confidence intervals (20 resamples) confirm the estimates are stable (std <0.0001).
Dense Transformers (G33, LLaMA)
CKA >0.999 across all layers and all three input distributions. Mid-training's representational geometry is consistently preserved through RL across input domains (CKA >0.999 for dense Transformers; >0.998 for hybrids).
Hybrid Architecture (Nemotron-H)
MT vs RL CKA >0.998 across all layers and all three input distributions for Nemotron-H. Note: Nemotron-H shows broader Base vs MT divergence in later layers (CKA ∼0.41 on GSM8K at layer 48), indicating mid-training restructures hybrid models more extensively.

Weight Direction Analysis: Same Magnitude, Different Direction

MC and MCS mid-training produce nearly identical weight change magnitudes but different directions, suggesting that data composition at mid-training shapes the weight configuration that RL subsequently refines.

Weight direction analysis comparing MC and MCS mid-training
Figure: Weight direction analysis (PDF format, view in paper)
Weight direction comparison between Math+Code (MC) and Math+Code+Science (MCS) mid-training. Both mixtures produce nearly identical L2 magnitudes (0.177 vs 0.175) but diverge in direction: cosine similarity of 0.52 for Granite-3.3 and 0.62 for Nemotron-H. This directional divergence, not magnitude, accounts for the large downstream difference in GPQA-Diamond performance after RL (+17 pts).
0.177 / 0.175
MC vs MCS L2 magnitude
(Granite-3.3, nearly identical)
0.52 / 0.62
MC vs MCS cosine similarity
(G33 / Nemotron-H directions)
+17 pt
GPQA-Diamond gain
from directional difference

Pass Rate Landscape Along the Training Path

Interpolating model weights along the Base → MT → RL training path reveals a generally increasing pass rate along the training path, consistent with mid-training progressively improving the model's configuration for subsequent RL.

Pass rate landscape interpolation along Base to MT to RL path
Figure: Pass rate landscape (PDF format, view in paper)
Pass rate landscape on held-out MATH500 (Granite-3.3 & LLaMA-3.1). Left panels: pass rate increases generally from Base to MT to RL (G33: 17%→76%→80%; LLaMA: 3%→44%→66%). Right: 2D landscape for Granite-3.3 centered at MT, showing the RL direction as consistently high-reward.
2D Pass Rate Landscape Animation — dot moves from Base to MT to RL

Animation: Dot moves from Base → MT → RL on the 2D pass rate landscape. The dashed white isoline and colorbar indicator track the current pass rate level.

17%
Base Model
Pass Rate (G33)
76%
After Mid-Training
Pass Rate (G33)
80%
After RL
Pass Rate (G33)
Model Stage Pass Rate Med. Length Neg Log-Prob Corr. NLP Incorr. NLP
Granite-3.3
(8B)
Base 16.9% 1200.3820.383
MT 75.5% 2,2540.1380.1280.153
RL 79.5% 1,7000.1410.1350.160
LLaMA-3.1
(8B)
Base 2.6% 1580.7580.780
MT 43.1% 1,0520.3770.1460.469
RL 64.6% 1,1880.2670.1490.320
Nemotron-H
(8B, Hybrid)
Base 66.6% 4520.1670.0400.258
MT 61.6% 1,9280.1500.1160.156
RL 83.0% 1,7800.1270.1120.137
Setup: 200 held-out MATH500 problems, 8 samples/prompt, 7680 generation tokens, temperature 0.6, top-p 0.95. Pass rate = mean across 8 samples. Scored with the same math verifier used during RL training. — indicates too few correct samples. Correct responses consistently have lower negative log-probability (higher confidence) than incorrect ones.

Mid-training teaches reasoning, not just answers

Response lengths increase dramatically: LLaMA goes from 158 tokens to 1,052 tokens (7x). Models learn multi-step problem decomposition with extended reasoning chains.

RL refines toward efficient correctness

RL adjusts response length model-dependently: shortening Granite-3.3 (2,254 → 1,700) while LLaMA (1,052 → 1,188) shows modest changes, optimizing both quality and efficiency.

Data composition changes direction, not magnitude

MC and MCS mid-training produce nearly identical weight change magnitudes (L2: 0.177 vs 0.175), yet MCS+RL achieves GPQA-Diamond of 52.9 vs only 35.5 for MC+RL. Cosine similarity between the two update directions is only 0.52 (G33) and 0.62 (Nemotron-H).

Architecture-agnostic RL sparsity

All three component types in Nemotron-H show nearly identical RL sparsity: Attention (93.5%), MLP (94.5%), and Mamba (93.9%). The sparse pattern is universal.

RL optimization is front-loaded

Most weight changes occur in the first ~200-400 steps, then plateau. The active parameter set grows progressively from ~1.5% to ~5%, and MT vs Base starting points produce similarly scaled and sparse update trajectories.

RL preserves representational geometry

CKA similarity between mid-trained and RL representations exceeds 0.998 across all three models (Granite-3.3, LLaMA-3.1, Nemotron-H) on Wikipedia, C4, and GSM8K inputs. RL consistently refines rather than restructures the learned representations.

Pass rate landscape along the training path

Pass rate interpolation on held-out MATH500 shows a generally increasing trend (17%→76%→80% for G33; 3%→44%→66% for LLaMA, 8 samples/prompt). The 2D landscape shows the RL direction as consistently high-reward with no apparent sharp barriers.

Citation

BibTeX

If you find PRISM useful in your research, please consider citing our paper.


    @misc{runwal2026prismdemystifyingretentioninteraction,
          title={PRISM: Demystifying Retention and Interaction in Mid-Training}, 
          author={Bharat Runwal and Ashish Agrawal and Anurag Roy and Rameswar Panda},
          year={2026},
          eprint={2603.17074},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2603.17074}, 
    }