PRISM: Demystifying Retention and Interaction in Mid-Training

Experimental Scope

Scale of the Study

PRISM systematically evaluates mid-training across a diverse set of modern LLMs, spanning multiple families, architectures, and parameter scales.

Base Models

Model Families

Architecture Types

3B-24B

Parameter Scale

~27B

Mid-Train Tokens

10+

Benchmarks

Model Family	Models	Architecture	Parameters
Granite	Granite-4 Micro (3B), Granite-3.3-8B	Dense Transformer	3B, 8B
Granite	Granite-4-H Micro (3B)	Attention-Mamba Hybrid	3B
LLaMA	LLaMA-3.1-8B	Dense Transformer	8B
Mistral	Mistral-7B-v0.1, Mistral-Small-24B	Dense Transformer	7B, 24B
Nemotron-H	Nemotron-H-8B	Attention-Mamba Hybrid	8B

Abstract

Paper Summary

We present PRISM (Demystifying Retention and Interaction in Mid-Training), a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven open-source base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we systematically investigate what data to use, when to apply mid-training, how it interacts with reinforcement learning (RL), and whether findings generalize across architectures.

Using targeted mid-training mixtures of only ~27B high-quality tokens, PRISM yields +15 to +40 point math gains, +5 to +12 point code gains, and +6 to +13 point science gains across all tested models, while preserving general-purpose performance. Crucially, data composition choices matter most at mid-training, not at RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. The full PRISM-to-RL pipeline transforms base models scoring under 12 AVG into models scoring 29-42 AVG, a 3-4x improvement. Mechanistically, mid-training densely restructures >90% of weights while RL makes sparse, front-loaded refinements to ~5% of parameters. Representational geometry is largely preserved through RL (CKA >0.998 across models and input distributions), while different mid-training data mixtures produce same-magnitude but differently-directed weight updates (cosine similarity 0.52 for Granite-3.3, 0.62 for Nemotron-H). Pass rate landscape interpolation shows a generally increasing pass rate from Base (17%) to Mid-Training (76%) to RL (80%) for Granite-3.3, consistent with mid-training progressively improving the model's configuration for RL. Benefits hold for both dense Transformers and attention-Mamba hybrids, from 3B to 24B parameters.

At a Glance

Key Results

PRISM mid-training produces consistent, large gains across domains and model families using only ~27B tokens.

+15-40

Math Improvement

points across all models

+5-12

Code Improvement

points across all models

+6-13

Science Improvement

GPQA-Diamond

3-4x

Pipeline Boost

base <12 to 29-42 AVG

Key Findings

What We Discovered

Five principal findings that provide practical guidance for designing mid-training pipelines.

Finding 1

Mid-training is a powerful reasoning catalyst

Across all tested models, PRISM yields +15 to +40 pt math gains, +5 to +12 pt code gains, and +6 to +13 pt science gains, while preserving general-purpose performance.

Finding 2

Mid-training is essential for effective RL

The full PRISM → RL pipeline transforms base models scoring under 12 AVG into models scoring 29-42 AVG, a 3-4x improvement. RL applied directly to base models is substantially less effective, with AIME scores remaining near zero.

Finding 3

Data composition matters at MT, not RL

Changing the mid-training mix shifts AVG by +4 to +6 pt, while changing the RL mix produces <2 point differences. Science data at mid-training unlocks +17 to +28 pt GPQA-Diamond gains during RL.

Finding 4

Benefits generalize across architectures

Both dense Transformers and attention-Mamba hybrids benefit consistently from PRISM, from 3B to 24B parameters. Mid-training gains are consistent across all four model families tested.

Finding 5

RL expands the solvability frontier

For Granite-3.3, RL on PRISM-mid-trained models progressively solves prompts that were initially unsolvable, with training curves that remain non-saturating across hundreds of steps.

Finding 6

Mid-training and RL operate at different representational scales, confirmed mechanistically

CKA analysis shows RL preserves mid-training representations (CKA >0.998). Pass rate landscape on held-out MATH500 shows a generally increasing path from Base (17%) to MT (76%) to RL (80%) for Granite-3.3, consistent with mid-training improving the model's configuration for subsequent RL.

Method

The PRISM Pipeline

A three-stage recipe: targeted mid-training with retention-aware data mixtures, optional long-context restoration, then reinforcement learning.

Base Model

Pre-trained LLM
(3B-24B params)

PRISM Mid-Training

~27B tokens
Math + Code + Science

LC Restoration

Linear merging +
128k extension

RL (GRPO)

Math + Code + Science
verifiable rewards

Data Composition

Mid-Training Data Mixtures

We design three progressively richer data mixtures. Click each configuration to explore how domain coverage evolves and see corresponding benchmark performance.

36.4

Math AVG

2.8

Code AVG

17.3

GPQA-D

18.9

Overall AVG

Mid-Training Mix	Code AVG	GPQA-D	Math AVG	Overall AVG
Base (no MT)	2.07	22.56	8.95	11.19
Math only	2.81	17.34	36.43	18.86
Math + Code	10.71	19.02	44.33	24.69
Math + Code + Science	10.58	29.12	48.75	29.48

Domain-specific mid-training results on Granite-3.3-8B. Adding science data yields +10 points on GPQA-Diamond and +4 points on math, with minimal code regression. This composition effect persists and amplifies through RL.

Results

Full Pipeline Evaluation

End-to-end results: base model → PRISM mid-training → RL. AVG is the macro-average over Math, Code, and Science.

Model	Stage	AIME'24	AIME'25	MATH500	LCB	CF	GPQA-D	AVG
Granite-3.3-8B	Base	0.46	0.31	26.09	2.15	1.99	22.56	8.93
	+ PRISM	37.18	27.96	81.11	10.63	10.52	29.12	32.75
	+ PRISM + RL	40.94	32.03	84.76	19.59	20.82	51.51	41.60
LLaMA-3.1-8B	Base	0.05	0.15	6.51	0.00	0.07	20.20	4.50
	+ PRISM	16.45	19.32	73.47	6.09	5.44	21.04	23.64
	+ PRISM + RL	21.15	22.97	77.21	15.05	12.43	39.39	31.37
Mistral-Small-24B	Base	0.78	0.73	26.92	0.00	0.29	22.55	8.55
	+ PRISM	32.91	27.34	80.80	10.03	10.08	22.05	30.54
	+ PRISM + RL	39.69	32.40	86.89	16.97	17.59	50.00	40.59
Nemotron-H-8B	Base	2.13	2.29	49.46	1.19	3.60	4.21	10.48
	+ PRISM	19.21	22.76	76.63	13.02	10.52	31.98	29.02
	+ PRISM + RL	29.95	28.54	84.47	19.59	15.38	41.24	36.53

Full pipeline results (PRISM → RL). All models use Math+Code+Science mid-training. AVG is the macro-average across all six benchmarks. Highlighted rows show the complete pipeline. Representative models shown; see paper for all seven.

Reinforcement Learning

Why Mid-Training Before RL?

RL on PRISM-mid-trained models produces large, sustained gains across math, code, and science. RL on base models without mid-training is substantially less effective, with AIME scores remaining near zero.

Base → PRISM → RL

RL learning curves on PRISM mid-trained Granite-3.3-8B for math benchmarks

AIME'24 reaches 41+, AIME'25 reaches 32+, MATH500 reaches 85+. Strong gains with non-saturating curves.

Base → RL (no mid-training)

RL learning curves on base Granite-3.3-8B for math benchmarks

AIME'24 stays below 1.5, AIME'25 below 0.6. RL alone cannot bootstrap reasoning from a base model.

Base → PRISM → RL

RL learning curves on PRISM mid-trained Granite-3.3-8B for code and science

LiveCodeBench reaches 19+, Codeforces 20+, GPQA-Diamond 47+. Science gains are especially large when science data was included at mid-training.

Base → RL (no mid-training)

RL learning curves on base Granite-3.3-8B for code and science

Without mid-training foundation, RL produces minimal code/science gains. Base model lacks the reasoning substrate to leverage RL signal.

Long-Context

Restoring Long-Context Ability

Mid-training at 8k context degrades long-context capabilities. We restore them via linear model merging followed by a brief 128k extension phase.

59.09

Base Model

RULER Score

6.46

After 8k Mid-Training

89.06% drop

11.32

Linear Merge

Partial recovery

42.16

+ 128k Extension

change recovered

How it works: After mid-training at 8k context, we merge the mid-trained model with the original base model. This recovers some portion of long-context ability while retaining reasoning gains. A brief additional training phase on 128k sequences further restores the RULER score from 11.32 to 42.2, recovering 71% of the original capability.

Mechanistic Analysis

How Do Mid-Training and RL Differ?

We investigate how mid-training and RL change models through weight-level divergence, CKA representation analysis, weight direction comparisons, pass rate landscape interpolation, prediction entropy, and correctness analysis across three 8B models.

>90%

Parameters changed by mid-training

~5%

Parameters changed by RL

300-580x

Magnitude difference (MT vs RL)

>0.998

CKA preserved through RL

17%→76%→80%

Pass rate: Base→MT→RL (MATH500, G33)

Weight Divergence: Dense Restructuring vs. Sparse Refinement

Normalized L2 divergence by component type. Mid-training changes weights orders of magnitude more than RL.

Base → Mid-Train

Mid-Train → RL

Base → RL (no MT)

Granite-3.3 (8B) — Dense Transformer

Attention

0.175

MLP

0.329

Nemotron-H (8B) — Attention-Mamba Hybrid

Attention

0.230

MLP

0.289

Mamba

0.138

RL produces similarly scaled and sparse weight changes regardless of starting point, yet downstream performance differs substantially depending on whether mid-training preceded it.

CKA Representation Analysis: RL Preserves Mid-Training Geometry

Centered Kernel Alignment (CKA) between mid-trained and RL model representations across inputs and models. Values >0.998 indicate that RL refines without restructuring representations.

CKA representation analysis across 4 models and 3 input distributions

CKA similarity between mid-trained and RL checkpoints on Wikipedia, C4, and GSM8K inputs for Granite-3.3, LLaMA-3.1, and Nemotron-H (200 samples, batch-size 1). RL consistently preserves the representational geometry established by mid-training across all three models and all three input distributions (minimum MT vs RL CKA >0.998). Bootstrap confidence intervals (20 resamples) confirm the estimates are stable (std <0.0001).

Dense Transformers (G33, LLaMA)

CKA >0.999 across all layers and all three input distributions. Mid-training's representational geometry is consistently preserved through RL across input domains (CKA >0.999 for dense Transformers; >0.998 for hybrids).

Hybrid Architecture (Nemotron-H)

MT vs RL CKA >0.998 across all layers and all three input distributions for Nemotron-H. Note: Nemotron-H shows broader Base vs MT divergence in later layers (CKA ∼0.41 on GSM8K at layer 48), indicating mid-training restructures hybrid models more extensively.

Weight Direction Analysis: Same Magnitude, Different Direction

MC and MCS mid-training produce nearly identical weight change magnitudes but different directions, suggesting that data composition at mid-training shapes the weight configuration that RL subsequently refines.

Weight direction analysis comparing MC and MCS mid-training

Weight direction comparison between Math+Code (MC) and Math+Code+Science (MCS) mid-training. Both mixtures produce nearly identical L2 magnitudes (0.177 vs 0.175) but diverge in direction: cosine similarity of 0.52 for Granite-3.3 and 0.62 for Nemotron-H. This directional divergence, not magnitude, accounts for the large downstream difference in GPQA-Diamond performance after RL (+17 pts).

0.177 / 0.175

MC vs MCS L2 magnitude
(Granite-3.3, nearly identical)

0.52 / 0.62

MC vs MCS cosine similarity
(G33 / Nemotron-H directions)

+17 pt

GPQA-Diamond gain
from directional difference

Pass Rate Landscape Along the Training Path

Interpolating model weights along the Base → MT → RL training path reveals a generally increasing pass rate along the training path, consistent with mid-training progressively improving the model's configuration for subsequent RL.

Pass rate landscape interpolation along Base to MT to RL path

Pass rate landscape on held-out MATH500 (Granite-3.3 & LLaMA-3.1). Left panels: pass rate increases generally from Base to MT to RL (G33: 17%→76%→80%; LLaMA: 3%→44%→66%). Right: 2D landscape for Granite-3.3 centered at MT, showing the RL direction as consistently high-reward.

2D Pass Rate Landscape Animation — dot moves from Base to MT to RL

Animation: Dot moves from Base → MT → RL on the 2D pass rate landscape. The dashed white isoline and colorbar indicator track the current pass rate level.

17%

Base Model
Pass Rate (G33)

76%

After Mid-Training
Pass Rate (G33)

80%

After RL
Pass Rate (G33)

Model	Stage	Pass Rate	Med. Length	Neg Log-Prob	Corr. NLP	Incorr. NLP
Granite-3.3 (8B)	Base	16.9%	120	0.382	—	0.383
	MT	75.5%	2,254	0.138	0.128	0.153
	RL	79.5%	1,700	0.141	0.135	0.160

LLaMA-3.1 (8B)	Base	2.6%	158	0.758	—	0.780
	MT	43.1%	1,052	0.377	0.146	0.469
	RL	64.6%	1,188	0.267	0.149	0.320

Nemotron-H (8B, Hybrid)	Base	66.6%	452	0.167	0.040	0.258
	MT	61.6%	1,928	0.150	0.116	0.156
	RL	83.0%	1,780	0.127	0.112	0.137

Setup: 200 held-out MATH500 problems, 8 samples/prompt, 7680 generation tokens, temperature 0.6, top-p 0.95. Pass rate = mean across 8 samples. Scored with the same math verifier used during RL training. — indicates too few correct samples. Correct responses consistently have lower negative log-probability (higher confidence) than incorrect ones.

Mid-training teaches reasoning, not just answers

Response lengths increase dramatically: LLaMA goes from 158 tokens to 1,052 tokens (7x). Models learn multi-step problem decomposition with extended reasoning chains.

RL refines toward efficient correctness

RL adjusts response length model-dependently: shortening Granite-3.3 (2,254 → 1,700) while LLaMA (1,052 → 1,188) shows modest changes, optimizing both quality and efficiency.

Data composition changes direction, not magnitude

MC and MCS mid-training produce nearly identical weight change magnitudes (L2: 0.177 vs 0.175), yet MCS+RL achieves GPQA-Diamond of 52.9 vs only 35.5 for MC+RL. Cosine similarity between the two update directions is only 0.52 (G33) and 0.62 (Nemotron-H).

Architecture-agnostic RL sparsity

All three component types in Nemotron-H show nearly identical RL sparsity: Attention (93.5%), MLP (94.5%), and Mamba (93.9%). The sparse pattern is universal.

RL optimization is front-loaded

Most weight changes occur in the first ~200-400 steps, then plateau. The active parameter set grows progressively from ~1.5% to ~5%, and MT vs Base starting points produce similarly scaled and sparse update trajectories.

RL preserves representational geometry

CKA similarity between mid-trained and RL representations exceeds 0.998 across all three models (Granite-3.3, LLaMA-3.1, Nemotron-H) on Wikipedia, C4, and GSM8K inputs. RL consistently refines rather than restructures the learned representations.

Pass rate landscape along the training path

Pass rate interpolation on held-out MATH500 shows a generally increasing trend (17%→76%→80% for G33; 3%→44%→66% for LLaMA, 8 samples/prompt). The 2D landscape shows the RL direction as consistently high-reward with no apparent sharp barriers.

Citation

BibTeX

If you find PRISM useful in your research, please consider citing our paper.


    @misc{runwal2026prismdemystifyingretentioninteraction,
          title={PRISM: Demystifying Retention and Interaction in Mid-Training}, 
          author={Bharat Runwal and Ashish Agrawal and Anurag Roy and Rameswar Panda},
          year={2026},
          eprint={2603.17074},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2603.17074}, 
    }

PRISM: Demystifying Retention andInteraction in Mid-Training