ArXiv Preprint 2025

PRISM: Demystifying Retention and
Interaction in Mid-Training

A comprehensive empirical study of mid-training design choices for LLMs across 4 model families, 2 architecture types, 7 models, and scales from 3B to 24B parameters.

Bharat Runwal  ·  Ashish Agrawal  ·  Anurag Roy  ·  Rameswar Panda

IBM IBM Research  ·  MIT-IBM Watson AI Lab

ArXiv Paper Code Data Mixtures Models
PRISM animated overview showing mid-training pipeline and results across model families
PRISM overview. Mid-training decisions are decomposed into their principal design axes: data composition, timing, domain interaction, benchmark selection, RL compatibility, and scaling behavior. PRISM enables holistic evaluation of mid-training choices across model families at scale.
Experimental Scope

Scale of the Study

PRISM systematically evaluates mid-training across a diverse set of modern LLMs, spanning multiple families, architectures, and parameter scales.

7
Base Models
4
Model Families
2
Architecture Types
3B-24B
Parameter Scale
~27B
Mid-Train Tokens
10+
Benchmarks
Model FamilyModelsArchitectureParameters
GraniteGranite-4 Micro (3B), Granite-3.3-8BDense Transformer3B, 8B
Granite-4-H Micro (3B)Attention-Mamba Hybrid3B
LLaMALLaMA-3.1-8BDense Transformer8B
MistralMistral-7B-v0.1, Mistral-Small-24BDense Transformer7B, 24B
Nemotron-HNemotron-H-8BAttention-Mamba Hybrid8B
Abstract

Paper Summary

We present PRISM (Demystifying Retention and Interaction in Mid-Training), a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven open-source base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we systematically investigate what data to use, when to apply mid-training, how it interacts with reinforcement learning (RL), and whether findings generalize across architectures.


Using targeted mid-training mixtures of only ~27B high-quality tokens, PRISM yields +15 to +40 point math gains, +5 to +12 point code gains, and +6 to +13 point science gains across all tested models, while preserving general-purpose performance. Crucially, data composition choices matter most at mid-training, not at RL: including science data during mid-training unlocks +17 to +30 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. The full PRISM-to-RL pipeline transforms base models scoring under 12 AVG into models scoring 29-42 AVG, a 3-4x improvement. Mechanistically, mid-training densely restructures >90% of weights while RL makes sparse, front-loaded refinements to ~5% of parameters. Benefits hold for both dense Transformers and attention-Mamba hybrids, from 3B to 24B parameters.

At a Glance

Key Results

PRISM mid-training produces consistent, large gains across domains and model families using only ~27B tokens.

+15-40
Math Improvement
points across all models
+5-12
Code Improvement
points across all models
+6-13
Science Improvement
GPQA-Diamond
3-4x
Pipeline Boost
base <12 to 29-42 AVG
Key Findings

What We Discovered

Five principal findings that provide practical guidance for designing mid-training pipelines.

Finding 1

Mid-training is a powerful reasoning catalyst

Across all tested models, PRISM yields +15 to +40 pt math gains, +5 to +12 pt code gains, and +6 to +13 pt science gains, while preserving general-purpose performance.

Finding 2

Mid-training is essential for effective RL

The full PRISM → RL pipeline transforms base models scoring under 12 AVG into models scoring 29-42 AVG, a 3-4x improvement. RL applied directly to base models is substantially less effective, with AIME scores remaining near zero.

Finding 3

Data composition matters at MT, not RL

Changing the mid-training mix shifts AVG by +4 to +6 pt, while changing the RL mix produces <2 point differences. Science data at mid-training unlocks +17 to +30 pt GPQA-Diamond gains during RL.

Finding 4

Benefits generalize across architectures

Both dense Transformers and attention-Mamba hybrids benefit consistently from PRISM, from 3B to 24B parameters. Mid-training gains are architecture-agnostic.

Finding 5

RL expands the solvability frontier

RL on PRISM-mid-trained models progressively solves prompts that were initially unsolvable, with training curves that remain non-saturating across hundreds of steps.

Method

The PRISM Pipeline

A three-stage recipe: targeted mid-training with retention-aware data mixtures, optional long-context restoration, then reinforcement learning.

Base Model
Pre-trained LLM
(3B-24B params)
PRISM Mid-Training
~27B tokens
Math + Code + Science
LC Restoration
Linear merging +
128k extension
RL (GRPO)
Math + Code + Science
verifiable rewards
Data Composition

Mid-Training Data Mixtures

We design three progressively richer data mixtures. Click each configuration to explore how domain coverage evolves and see corresponding benchmark performance.

36.4
Math AVG
2.8
Code AVG
17.3
GPQA-D
18.9
Overall AVG
Mid-Training MixCode AVGGPQA-DMath AVGOverall AVG
Base (no MT)2.0722.568.9511.19
Math only2.8117.3436.4318.86
Math + Code10.7119.0244.3324.69
Math + Code + Science10.5829.1248.7529.48
Domain-specific mid-training results on Granite-3.3-8B. Adding science data yields +10 points on GPQA-Diamond and +4 points on math, with minimal code regression. This composition effect persists and amplifies through RL.
Results

Full Pipeline Evaluation

End-to-end results: base model → PRISM mid-training → RL. AVG is the macro-average over Math, Code, and Science.

ModelStageAIME'24AIME'25MATH500LCBCFGPQA-DAVG
Granite-3.3-8BBase0.460.3126.092.151.9922.568.93
+ PRISM37.1827.9681.1110.6310.5229.1232.75
+ PRISM + RL53.5637.5089.4019.9516.5246.9743.98
LLaMA-3.1-8BBase2.011.5444.805.114.2727.2714.17
+ PRISM22.0814.7172.9410.519.2130.8126.71
+ PRISM + RL37.8326.0483.6016.4314.4341.9236.71
Mistral-Small-24BBase6.884.7957.407.936.3031.3119.10
+ PRISM35.7126.0482.2017.1112.5439.3935.50
+ PRISM + RL47.9236.4687.2022.8918.4049.4943.73
Nemotron-H-8BBase3.132.2954.009.805.6229.2917.36
+ PRISM25.8317.5078.0014.0710.8535.3530.27
+ PRISM + RL40.6327.7185.4018.6515.2245.4538.84
Full pipeline results (PRISM → RL). All models use Math+Code+Science mid-training. AVG is the macro-average across all six benchmarks. Highlighted rows show the complete pipeline. Representative models shown; see paper for all seven.
Reinforcement Learning

Why Mid-Training Before RL?

RL on PRISM-mid-trained models produces large, sustained gains across math, code, and science. RL on base models without mid-training is substantially less effective, with AIME scores remaining near zero.

 Base → PRISM → RL
RL learning curves on PRISM mid-trained Granite-3.3-8B for math benchmarks
AIME'24 reaches 41+, AIME'25 reaches 32+, MATH500 reaches 85+. Strong gains with non-saturating curves.
 Base → RL (no mid-training)
RL learning curves on base Granite-3.3-8B for math benchmarks
AIME'24 stays below 1.5, AIME'25 below 0.6. RL alone cannot bootstrap reasoning from a base model.
 Base → PRISM → RL
RL learning curves on PRISM mid-trained Granite-3.3-8B for code and science
LiveCodeBench reaches 19+, Codeforces 20+, GPQA-Diamond 47+. Science gains are especially large when science data was included at mid-training.
 Base → RL (no mid-training)
RL learning curves on base Granite-3.3-8B for code and science
Without mid-training foundation, RL produces minimal code/science gains. Base model lacks the reasoning substrate to leverage RL signal.
Long-Context

Restoring Long-Context Ability

Mid-training at 8k context degrades long-context capabilities. We restore them via linear model merging followed by a brief 128k extension phase.

59.09
Base Model
RULER Score
6.46
After 8k Mid-Training
89.06% drop
11.32
Linear Merge
Partial recovery
42.16
+ 128k Extension
change recovered
Long-context restoration pipeline: base model, mid-training, Linear merge, 128k extension
How it works: After mid-training at 8k context, we merge the mid-trained model with the original base model. This recovers a significant portion of long-context ability while retaining reasoning gains. A brief additional training phase on 128k sequences further restores the RULER score from 30.4 to 42.2, recovering 71% of the original capability.
Mechanistic Analysis

How Do Mid-Training and RL Differ?

We investigate how mid-training and RL change models through weight-level divergence, prediction entropy, and correctness analysis across three 8B models.

>90%
Parameters changed by mid-training
~5%
Parameters changed by RL
300-580x
Magnitude difference (MT vs RL)

Weight Divergence: Dense Restructuring vs. Sparse Refinement

Normalized L2 divergence by component type. Mid-training changes weights orders of magnitude more than RL.

Base → Mid-Train
Mid-Train → RL
Base → RL (no MT)
Granite-3.3 (8B) — Dense Transformer
Attention
0.175
MLP
0.329

Nemotron-H (8B) — Attention-Mamba Hybrid
Attention
0.230
MLP
0.289
Mamba
0.138
RL produces identical weight changes whether or not mid-training preceded it, yet only succeeds when applied to mid-trained models.
Model Stage Pass Rate Med. Length Neg Log-Prob Corr. NLP Incorr. NLP
Granite-3.3
(8B)
Base 9.5% 4440.2370.2030.240
MT 37.0% 4,3640.1530.1410.159
RL 66.5% 2,9020.1560.1430.181
LLaMA-3.1
(8B)
Base 0.5% 100.7060.708
MT 21.0% 1,0290.3870.1640.447
RL 31.5% 1,6660.3360.1640.415
Nemotron-H
(8B, Hybrid)
Base 35.5% 5800.2030.0500.287
MT 32.0% 2,1860.2380.1340.287
RL 51.5% 2,5140.1830.1250.244
Setup: 200 math prompts, 8k context (512 prompt + 7680 generation tokens), temperature 0.6, top-p 0.95, scored with the same math verifier used during RL training. Correct responses consistently have lower negative log-probability (higher confidence) than incorrect ones.

Mid-training teaches reasoning, not just answers

Response lengths increase dramatically: LLaMA goes from 10 tokens to 1,029 tokens (100x). Models learn multi-step problem decomposition with extended reasoning chains.

RL refines toward efficient correctness

RL adjusts response length model-dependently: shortening Granite-3.3 (4,364 → 2,902) while extending LLaMA (1,029 → 1,666), optimizing both quality and efficiency.

Data composition changes direction, not magnitude

MC and MCS mid-training produce identical weight changes (L2: 0.177 vs 0.175), yet MCS+RL achieves GPQA-Diamond of 52.9 vs only 35.5 for MC+RL.

Architecture-agnostic RL sparsity

All three component types in Nemotron-H show nearly identical RL sparsity: Attention (93.5%), MLP (94.5%), and Mamba (93.9%). The sparse pattern is universal.

RL optimization is front-loaded

Most weight changes occur in the first ~200-400 steps, then plateau. The active parameter set grows progressively from ~1.5% to ~5%, and MT vs Base starting points produce identical trajectories.

Citation

BibTeX

If you find PRISM useful in your research, please consider citing our paper.

@article{runwal2025prism,
  title={{PRISM}: Demystifying Retention and
         Interaction in Mid-Training},
  author={Runwal, Bharat and Agrawal, Ashish
          and Roy, Anurag and Panda, Rameswar},
  journal={arXiv preprint},
  year={2025}
}