A comprehensive empirical study of mid-training design choices for LLMs across 4 model families, 2 architecture types, 7 models, and scales from 3B to 24B parameters.
IBM Research
·
PRISM systematically evaluates mid-training across a diverse set of modern LLMs, spanning multiple families, architectures, and parameter scales.
| Model Family | Models | Architecture | Parameters |
|---|---|---|---|
| Granite | Granite-4 Micro (3B), Granite-3.3-8B | Dense Transformer | 3B, 8B |
| Granite-4-H Micro (3B) | Attention-Mamba Hybrid | 3B | |
| LLaMA | LLaMA-3.1-8B | Dense Transformer | 8B |
| Mistral | Mistral-7B-v0.1, Mistral-Small-24B | Dense Transformer | 7B, 24B |
| Nemotron-H | Nemotron-H-8B | Attention-Mamba Hybrid | 8B |
We present PRISM (Demystifying Retention and Interaction in Mid-Training), a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven open-source base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we systematically investigate what data to use, when to apply mid-training, how it interacts with reinforcement learning (RL), and whether findings generalize across architectures.
Using targeted mid-training mixtures of only ~27B high-quality tokens, PRISM yields +15 to +40 point math gains, +5 to +12 point code gains, and +6 to +13 point science gains across all tested models, while preserving general-purpose performance. Crucially, data composition choices matter most at mid-training, not at RL: including science data during mid-training unlocks +17 to +30 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. The full PRISM-to-RL pipeline transforms base models scoring under 12 AVG into models scoring 29-42 AVG, a 3-4x improvement. Mechanistically, mid-training densely restructures >90% of weights while RL makes sparse, front-loaded refinements to ~5% of parameters. Benefits hold for both dense Transformers and attention-Mamba hybrids, from 3B to 24B parameters.
PRISM mid-training produces consistent, large gains across domains and model families using only ~27B tokens.
Five principal findings that provide practical guidance for designing mid-training pipelines.
Across all tested models, PRISM yields +15 to +40 pt math gains, +5 to +12 pt code gains, and +6 to +13 pt science gains, while preserving general-purpose performance.
The full PRISM → RL pipeline transforms base models scoring under 12 AVG into models scoring 29-42 AVG, a 3-4x improvement. RL applied directly to base models is substantially less effective, with AIME scores remaining near zero.
Changing the mid-training mix shifts AVG by +4 to +6 pt, while changing the RL mix produces <2 point differences. Science data at mid-training unlocks +17 to +30 pt GPQA-Diamond gains during RL.
Both dense Transformers and attention-Mamba hybrids benefit consistently from PRISM, from 3B to 24B parameters. Mid-training gains are architecture-agnostic.
RL on PRISM-mid-trained models progressively solves prompts that were initially unsolvable, with training curves that remain non-saturating across hundreds of steps.
A three-stage recipe: targeted mid-training with retention-aware data mixtures, optional long-context restoration, then reinforcement learning.
We design three progressively richer data mixtures. Click each configuration to explore how domain coverage evolves and see corresponding benchmark performance.
| Mid-Training Mix | Code AVG | GPQA-D | Math AVG | Overall AVG |
|---|---|---|---|---|
| Base (no MT) | 2.07 | 22.56 | 8.95 | 11.19 |
| Math only | 2.81 | 17.34 | 36.43 | 18.86 |
| Math + Code | 10.71 | 19.02 | 44.33 | 24.69 |
| Math + Code + Science | 10.58 | 29.12 | 48.75 | 29.48 |
End-to-end results: base model → PRISM mid-training → RL. AVG is the macro-average over Math, Code, and Science.
| Model | Stage | AIME'24 | AIME'25 | MATH500 | LCB | CF | GPQA-D | AVG |
|---|---|---|---|---|---|---|---|---|
| Granite-3.3-8B | Base | 0.46 | 0.31 | 26.09 | 2.15 | 1.99 | 22.56 | 8.93 |
| + PRISM | 37.18 | 27.96 | 81.11 | 10.63 | 10.52 | 29.12 | 32.75 | |
| + PRISM + RL | 53.56 | 37.50 | 89.40 | 19.95 | 16.52 | 46.97 | 43.98 | |
| LLaMA-3.1-8B | Base | 2.01 | 1.54 | 44.80 | 5.11 | 4.27 | 27.27 | 14.17 |
| + PRISM | 22.08 | 14.71 | 72.94 | 10.51 | 9.21 | 30.81 | 26.71 | |
| + PRISM + RL | 37.83 | 26.04 | 83.60 | 16.43 | 14.43 | 41.92 | 36.71 | |
| Mistral-Small-24B | Base | 6.88 | 4.79 | 57.40 | 7.93 | 6.30 | 31.31 | 19.10 |
| + PRISM | 35.71 | 26.04 | 82.20 | 17.11 | 12.54 | 39.39 | 35.50 | |
| + PRISM + RL | 47.92 | 36.46 | 87.20 | 22.89 | 18.40 | 49.49 | 43.73 | |
| Nemotron-H-8B | Base | 3.13 | 2.29 | 54.00 | 9.80 | 5.62 | 29.29 | 17.36 |
| + PRISM | 25.83 | 17.50 | 78.00 | 14.07 | 10.85 | 35.35 | 30.27 | |
| + PRISM + RL | 40.63 | 27.71 | 85.40 | 18.65 | 15.22 | 45.45 | 38.84 |
RL on PRISM-mid-trained models produces large, sustained gains across math, code, and science. RL on base models without mid-training is substantially less effective, with AIME scores remaining near zero.
Mid-training at 8k context degrades long-context capabilities. We restore them via linear model merging followed by a brief 128k extension phase.
We investigate how mid-training and RL change models through weight-level divergence, prediction entropy, and correctness analysis across three 8B models.
Normalized L2 divergence by component type. Mid-training changes weights orders of magnitude more than RL.
| Model | Stage | Pass Rate | Med. Length | Neg Log-Prob | Corr. NLP | Incorr. NLP |
|---|---|---|---|---|---|---|
| Granite-3.3 (8B) |
Base | 444 | 0.237 | 0.203 | 0.240 | |
| MT | 4,364 | 0.153 | 0.141 | 0.159 | ||
| RL | 2,902 | 0.156 | 0.143 | 0.181 | ||
| LLaMA-3.1 (8B) |
Base | 10 | 0.706 | — | 0.708 | |
| MT | 1,029 | 0.387 | 0.164 | 0.447 | ||
| RL | 1,666 | 0.336 | 0.164 | 0.415 | ||
| Nemotron-H (8B, Hybrid) |
Base | 580 | 0.203 | 0.050 | 0.287 | |
| MT | 2,186 | 0.238 | 0.134 | 0.287 | ||
| RL | 2,514 | 0.183 | 0.125 | 0.244 | ||
Response lengths increase dramatically: LLaMA goes from 10 tokens to 1,029 tokens (100x). Models learn multi-step problem decomposition with extended reasoning chains.
RL adjusts response length model-dependently: shortening Granite-3.3 (4,364 → 2,902) while extending LLaMA (1,029 → 1,666), optimizing both quality and efficiency.
MC and MCS mid-training produce identical weight changes (L2: 0.177 vs 0.175), yet MCS+RL achieves GPQA-Diamond of 52.9 vs only 35.5 for MC+RL.
All three component types in Nemotron-H show nearly identical RL sparsity: Attention (93.5%), MLP (94.5%), and Mamba (93.9%). The sparse pattern is universal.
Most weight changes occur in the first ~200-400 steps, then plateau. The active parameter set grows progressively from ~1.5% to ~5%, and MT vs Base starting points produce identical trajectories.
If you find PRISM useful in your research, please consider citing our paper.
@article{runwal2025prism,
title={{PRISM}: Demystifying Retention and
Interaction in Mid-Training},
author={Runwal, Bharat and Agrawal, Ashish
and Roy, Anurag and Panda, Rameswar},
journal={arXiv preprint},
year={2025}
}