ArXiv Preprint 2025

PRISM: Demystifying Retention and
Interaction in Mid-Training

A comprehensive empirical study of mid-training design choices for LLMs across 4 model families, 2 architecture types, 7 models, and scales from 3B to 24B parameters.

Bharat Runwal · Ashish Agrawal · Anurag Roy · Rameswar Panda

IBM Research ·

ArXiv Paper Code Data Mixtures Models

PRISM animated overview showing mid-training pipeline and results across model families

PRISM overview. Mid-training decisions are decomposed into their principal design axes: data composition, timing, domain interaction, benchmark selection, RL compatibility, and scaling behavior. PRISM enables holistic evaluation of mid-training choices across model families at scale.

Experimental Scope

Scale of the Study

PRISM systematically evaluates mid-training across a diverse set of modern LLMs, spanning multiple families, architectures, and parameter scales.

Base Models

Model Families

Architecture Types

3B-24B

Parameter Scale

~27B

Mid-Train Tokens

10+

Benchmarks

Model Family	Models	Architecture	Parameters
Granite	Granite-4 Micro (3B), Granite-3.3-8B	Dense Transformer	3B, 8B
Granite	Granite-4-H Micro (3B)	Attention-Mamba Hybrid	3B
LLaMA	LLaMA-3.1-8B	Dense Transformer	8B
Mistral	Mistral-7B-v0.1, Mistral-Small-24B	Dense Transformer	7B, 24B
Nemotron-H	Nemotron-H-8B	Attention-Mamba Hybrid	8B

Abstract

Paper Summary

We present PRISM (Demystifying Retention and Interaction in Mid-Training), a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven open-source base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we systematically investigate what data to use, when to apply mid-training, how it interacts with reinforcement learning (RL), and whether findings generalize across architectures.

Using targeted mid-training mixtures of only ~27B high-quality tokens, PRISM yields +15 to +40 point math gains, +5 to +12 point code gains, and +6 to +13 point science gains across all tested models, while preserving general-purpose performance. Crucially, data composition choices matter most at mid-training, not at RL: including science data during mid-training unlocks +17 to +30 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. The full PRISM-to-RL pipeline transforms base models scoring under 12 AVG into models scoring 29-42 AVG, a 3-4x improvement. Mechanistically, mid-training densely restructures >90% of weights while RL makes sparse, front-loaded refinements to ~5% of parameters. Benefits hold for both dense Transformers and attention-Mamba hybrids, from 3B to 24B parameters.

Key Findings

What We Discovered

Five principal findings that provide practical guidance for designing mid-training pipelines.

Finding 1

Mid-training is a powerful reasoning catalyst

Across all tested models, PRISM yields +15 to +40 pt math gains, +5 to +12 pt code gains, and +6 to +13 pt science gains, while preserving general-purpose performance.

Finding 2

Mid-training is essential for effective RL

The full PRISM → RL pipeline transforms base models scoring under 12 AVG into models scoring 29-42 AVG, a 3-4x improvement. RL applied directly to base models is substantially less effective, with AIME scores remaining near zero.

Finding 3

Data composition matters at MT, not RL

Changing the mid-training mix shifts AVG by +4 to +6 pt, while changing the RL mix produces <2 point differences. Science data at mid-training unlocks +17 to +30 pt GPQA-Diamond gains during RL.

Finding 4

Benefits generalize across architectures

Both dense Transformers and attention-Mamba hybrids benefit consistently from PRISM, from 3B to 24B parameters. Mid-training gains are architecture-agnostic.

Finding 5

RL expands the solvability frontier

RL on PRISM-mid-trained models progressively solves prompts that were initially unsolvable, with training curves that remain non-saturating across hundreds of steps.

Data Composition

Mid-Training Data Mixtures

We design three progressively richer data mixtures. Click each configuration to explore how domain coverage evolves and see corresponding benchmark performance.

36.4

Math AVG

2.8

Code AVG

17.3

GPQA-D

18.9

Overall AVG

Mid-Training Mix	Code AVG	GPQA-D	Math AVG	Overall AVG
Base (no MT)	2.07	22.56	8.95	11.19
Math only	2.81	17.34	36.43	18.86
Math + Code	10.71	19.02	44.33	24.69
Math + Code + Science	10.58	29.12	48.75	29.48

Domain-specific mid-training results on Granite-3.3-8B. Adding science data yields +10 points on GPQA-Diamond and +4 points on math, with minimal code regression. This composition effect persists and amplifies through RL.

Results

Full Pipeline Evaluation

End-to-end results: base model → PRISM mid-training → RL. AVG is the macro-average over Math, Code, and Science.

Model	Stage	AIME'24	AIME'25	MATH500	LCB	CF	GPQA-D	AVG
Granite-3.3-8B	Base	0.46	0.31	26.09	2.15	1.99	22.56	8.93
	+ PRISM	37.18	27.96	81.11	10.63	10.52	29.12	32.75
	+ PRISM + RL	53.56	37.50	89.40	19.95	16.52	46.97	43.98
LLaMA-3.1-8B	Base	2.01	1.54	44.80	5.11	4.27	27.27	14.17
	+ PRISM	22.08	14.71	72.94	10.51	9.21	30.81	26.71
	+ PRISM + RL	37.83	26.04	83.60	16.43	14.43	41.92	36.71
Mistral-Small-24B	Base	6.88	4.79	57.40	7.93	6.30	31.31	19.10
	+ PRISM	35.71	26.04	82.20	17.11	12.54	39.39	35.50
	+ PRISM + RL	47.92	36.46	87.20	22.89	18.40	49.49	43.73
Nemotron-H-8B	Base	3.13	2.29	54.00	9.80	5.62	29.29	17.36
	+ PRISM	25.83	17.50	78.00	14.07	10.85	35.35	30.27
	+ PRISM + RL	40.63	27.71	85.40	18.65	15.22	45.45	38.84

Full pipeline results (PRISM → RL). All models use Math+Code+Science mid-training. AVG is the macro-average across all six benchmarks. Highlighted rows show the complete pipeline. Representative models shown; see paper for all seven.

Reinforcement Learning

Why Mid-Training Before RL?

RL on PRISM-mid-trained models produces large, sustained gains across math, code, and science. RL on base models without mid-training is substantially less effective, with AIME scores remaining near zero.

Base → PRISM → RL

RL learning curves on PRISM mid-trained Granite-3.3-8B for math benchmarks

AIME'24 reaches 41+, AIME'25 reaches 32+, MATH500 reaches 85+. Strong gains with non-saturating curves.

Base → RL (no mid-training)

RL learning curves on base Granite-3.3-8B for math benchmarks

AIME'24 stays below 1.5, AIME'25 below 0.6. RL alone cannot bootstrap reasoning from a base model.

Base → PRISM → RL

RL learning curves on PRISM mid-trained Granite-3.3-8B for code and science

LiveCodeBench reaches 19+, Codeforces 20+, GPQA-Diamond 47+. Science gains are especially large when science data was included at mid-training.

Base → RL (no mid-training)

RL learning curves on base Granite-3.3-8B for code and science

Without mid-training foundation, RL produces minimal code/science gains. Base model lacks the reasoning substrate to leverage RL signal.

Long-Context

Restoring Long-Context Ability

Mid-training at 8k context degrades long-context capabilities. We restore them via linear model merging followed by a brief 128k extension phase.

59.09

Base Model

RULER Score

6.46

After 8k Mid-Training

89.06% drop

11.32

Linear Merge

Partial recovery

42.16

+ 128k Extension

change recovered

How it works: After mid-training at 8k context, we merge the mid-trained model with the original base model. This recovers a significant portion of long-context ability while retaining reasoning gains. A brief additional training phase on 128k sequences further restores the RULER score from 30.4 to 42.2, recovering 71% of the original capability.

Mechanistic Analysis

How Do Mid-Training and RL Differ?

We investigate how mid-training and RL change models through weight-level divergence, prediction entropy, and correctness analysis across three 8B models.

>90%

Parameters changed by mid-training

~5%

Parameters changed by RL

300-580x

Magnitude difference (MT vs RL)

Weight Divergence: Dense Restructuring vs. Sparse Refinement

Normalized L2 divergence by component type. Mid-training changes weights orders of magnitude more than RL.

Base → Mid-Train

Mid-Train → RL

Base → RL (no MT)

Granite-3.3 (8B) — Dense Transformer

Attention

0.175

MLP

0.329

Nemotron-H (8B) — Attention-Mamba Hybrid

Attention

0.230

MLP

0.289

Mamba

0.138

RL produces identical weight changes whether or not mid-training preceded it, yet only succeeds when applied to mid-trained models.

Model	Stage	Pass Rate	Med. Length	Neg Log-Prob	Corr. NLP	Incorr. NLP
Granite-3.3 (8B)	Base	9.5%	444	0.237	0.203	0.240
	MT	37.0%	4,364	0.153	0.141	0.159
	RL	66.5%	2,902	0.156	0.143	0.181

LLaMA-3.1 (8B)	Base	0.5%	10	0.706	—	0.708
	MT	21.0%	1,029	0.387	0.164	0.447
	RL	31.5%	1,666	0.336	0.164	0.415

Nemotron-H (8B, Hybrid)	Base	35.5%	580	0.203	0.050	0.287
	MT	32.0%	2,186	0.238	0.134	0.287
	RL	51.5%	2,514	0.183	0.125	0.244

Setup: 200 math prompts, 8k context (512 prompt + 7680 generation tokens), temperature 0.6, top-p 0.95, scored with the same math verifier used during RL training. Correct responses consistently have lower negative log-probability (higher confidence) than incorrect ones.

Mid-training teaches reasoning, not just answers

Response lengths increase dramatically: LLaMA goes from 10 tokens to 1,029 tokens (100x). Models learn multi-step problem decomposition with extended reasoning chains.

RL refines toward efficient correctness

RL adjusts response length model-dependently: shortening Granite-3.3 (4,364 → 2,902) while extending LLaMA (1,029 → 1,666), optimizing both quality and efficiency.

Data composition changes direction, not magnitude

MC and MCS mid-training produce identical weight changes (L2: 0.177 vs 0.175), yet MCS+RL achieves GPQA-Diamond of 52.9 vs only 35.5 for MC+RL.

Architecture-agnostic RL sparsity

All three component types in Nemotron-H show nearly identical RL sparsity: Attention (93.5%), MLP (94.5%), and Mamba (93.9%). The sparse pattern is universal.

RL optimization is front-loaded

Most weight changes occur in the first ~200-400 steps, then plateau. The active parameter set grows progressively from ~1.5% to ~5%, and MT vs Base starting points produce identical trajectories.

PRISM: Demystifying Retention and
Interaction in Mid-Training

Scale of the Study

Paper Summary

Key Results

What We Discovered

Mid-training is a powerful reasoning catalyst