A comprehensive empirical study of mid-training design choices for LLMs across 4 model families, 2 architecture types, 7 models, and scales from 3B to 24B parameters.
IBM Research
·
PRISM systematically evaluates mid-training across a diverse set of modern LLMs, spanning multiple families, architectures, and parameter scales.
| Model Family | Models | Architecture | Parameters |
|---|---|---|---|
| Granite | Granite-4 Micro (3B), Granite-3.3-8B | Dense Transformer | 3B, 8B |
| Granite-4-H Micro (3B) | Attention-Mamba Hybrid | 3B | |
| LLaMA | LLaMA-3.1-8B | Dense Transformer | 8B |
| Mistral | Mistral-7B-v0.1, Mistral-Small-24B | Dense Transformer | 7B, 24B |
| Nemotron-H | Nemotron-H-8B | Attention-Mamba Hybrid | 8B |
We present PRISM (Demystifying Retention and Interaction in Mid-Training), a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven open-source base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we systematically investigate what data to use, when to apply mid-training, how it interacts with reinforcement learning (RL), and whether findings generalize across architectures.
Using targeted mid-training mixtures of only ~27B high-quality tokens, PRISM yields +15 to +40 point math gains, +5 to +12 point code gains, and +6 to +13 point science gains across all tested models, while preserving general-purpose performance. Crucially, data composition choices matter most at mid-training, not at RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. The full PRISM-to-RL pipeline transforms base models scoring under 12 AVG into models scoring 29-42 AVG, a 3-4x improvement. Mechanistically, mid-training densely restructures >90% of weights while RL makes sparse, front-loaded refinements to ~5% of parameters. Representational geometry is largely preserved through RL (CKA >0.998 across models and input distributions), while different mid-training data mixtures produce same-magnitude but differently-directed weight updates (cosine similarity 0.52 for Granite-3.3, 0.62 for Nemotron-H). Pass rate landscape interpolation shows a generally increasing pass rate from Base (17%) to Mid-Training (76%) to RL (80%) for Granite-3.3, consistent with mid-training progressively improving the model's configuration for RL. Benefits hold for both dense Transformers and attention-Mamba hybrids, from 3B to 24B parameters.
PRISM mid-training produces consistent, large gains across domains and model families using only ~27B tokens.
Five principal findings that provide practical guidance for designing mid-training pipelines.
Across all tested models, PRISM yields +15 to +40 pt math gains, +5 to +12 pt code gains, and +6 to +13 pt science gains, while preserving general-purpose performance.
The full PRISM → RL pipeline transforms base models scoring under 12 AVG into models scoring 29-42 AVG, a 3-4x improvement. RL applied directly to base models is substantially less effective, with AIME scores remaining near zero.
Changing the mid-training mix shifts AVG by +4 to +6 pt, while changing the RL mix produces <2 point differences. Science data at mid-training unlocks +17 to +28 pt GPQA-Diamond gains during RL.
Both dense Transformers and attention-Mamba hybrids benefit consistently from PRISM, from 3B to 24B parameters. Mid-training gains are consistent across all four model families tested.
For Granite-3.3, RL on PRISM-mid-trained models progressively solves prompts that were initially unsolvable, with training curves that remain non-saturating across hundreds of steps.
CKA analysis shows RL preserves mid-training representations (CKA >0.998). Pass rate landscape on held-out MATH500 shows a generally increasing path from Base (17%) to MT (76%) to RL (80%) for Granite-3.3, consistent with mid-training improving the model's configuration for subsequent RL.
A three-stage recipe: targeted mid-training with retention-aware data mixtures, optional long-context restoration, then reinforcement learning.
We design three progressively richer data mixtures. Click each configuration to explore how domain coverage evolves and see corresponding benchmark performance.
| Mid-Training Mix | Code AVG | GPQA-D | Math AVG | Overall AVG |
|---|---|---|---|---|
| Base (no MT) | 2.07 | 22.56 | 8.95 | 11.19 |
| Math only | 2.81 | 17.34 | 36.43 | 18.86 |
| Math + Code | 10.71 | 19.02 | 44.33 | 24.69 |
| Math + Code + Science | 10.58 | 29.12 | 48.75 | 29.48 |
End-to-end results: base model → PRISM mid-training → RL. AVG is the macro-average over Math, Code, and Science.
| Model | Stage | AIME'24 | AIME'25 | MATH500 | LCB | CF | GPQA-D | AVG |
|---|---|---|---|---|---|---|---|---|
| Granite-3.3-8B | Base | 0.46 | 0.31 | 26.09 | 2.15 | 1.99 | 22.56 | 8.93 |
| + PRISM | 37.18 | 27.96 | 81.11 | 10.63 | 10.52 | 29.12 | 32.75 | |
| + PRISM + RL | 40.94 | 32.03 | 84.76 | 19.59 | 20.82 | 51.51 | 41.60 | |
| LLaMA-3.1-8B | Base | 0.05 | 0.15 | 6.51 | 0.00 | 0.07 | 20.20 | 4.50 |
| + PRISM | 16.45 | 19.32 | 73.47 | 6.09 | 5.44 | 21.04 | 23.64 | |
| + PRISM + RL | 21.15 | 22.97 | 77.21 | 15.05 | 12.43 | 39.39 | 31.37 | |
| Mistral-Small-24B | Base | 0.78 | 0.73 | 26.92 | 0.00 | 0.29 | 22.55 | 8.55 |
| + PRISM | 32.91 | 27.34 | 80.80 | 10.03 | 10.08 | 22.05 | 30.54 | |
| + PRISM + RL | 39.69 | 32.40 | 86.89 | 16.97 | 17.59 | 50.00 | 40.59 | |
| Nemotron-H-8B | Base | 2.13 | 2.29 | 49.46 | 1.19 | 3.60 | 4.21 | 10.48 |
| + PRISM | 19.21 | 22.76 | 76.63 | 13.02 | 10.52 | 31.98 | 29.02 | |
| + PRISM + RL | 29.95 | 28.54 | 84.47 | 19.59 | 15.38 | 41.24 | 36.53 |
RL on PRISM-mid-trained models produces large, sustained gains across math, code, and science. RL on base models without mid-training is substantially less effective, with AIME scores remaining near zero.
Mid-training at 8k context degrades long-context capabilities. We restore them via linear model merging followed by a brief 128k extension phase.
We investigate how mid-training and RL change models through weight-level divergence, CKA representation analysis, weight direction comparisons, pass rate landscape interpolation, prediction entropy, and correctness analysis across three 8B models.
Normalized L2 divergence by component type. Mid-training changes weights orders of magnitude more than RL.
Centered Kernel Alignment (CKA) between mid-trained and RL model representations across inputs and models. Values >0.998 indicate that RL refines without restructuring representations.
MC and MCS mid-training produce nearly identical weight change magnitudes but different directions, suggesting that data composition at mid-training shapes the weight configuration that RL subsequently refines.
Interpolating model weights along the Base → MT → RL training path reveals a generally increasing pass rate along the training path, consistent with mid-training progressively improving the model's configuration for subsequent RL.
Animation: Dot moves from Base → MT → RL on the 2D pass rate landscape. The dashed white isoline and colorbar indicator track the current pass rate level.
| Model | Stage | Pass Rate | Med. Length | Neg Log-Prob | Corr. NLP | Incorr. NLP |
|---|---|---|---|---|---|---|
| Granite-3.3 (8B) |
Base | 120 | 0.382 | — | 0.383 | |
| MT | 2,254 | 0.138 | 0.128 | 0.153 | ||
| RL | 1,700 | 0.141 | 0.135 | 0.160 | ||
| LLaMA-3.1 (8B) |
Base | 158 | 0.758 | — | 0.780 | |
| MT | 1,052 | 0.377 | 0.146 | 0.469 | ||
| RL | 1,188 | 0.267 | 0.149 | 0.320 | ||
| Nemotron-H (8B, Hybrid) |
Base | 452 | 0.167 | 0.040 | 0.258 | |
| MT | 1,928 | 0.150 | 0.116 | 0.156 | ||
| RL | 1,780 | 0.127 | 0.112 | 0.137 | ||
Response lengths increase dramatically: LLaMA goes from 158 tokens to 1,052 tokens (7x). Models learn multi-step problem decomposition with extended reasoning chains.
RL adjusts response length model-dependently: shortening Granite-3.3 (2,254 → 1,700) while LLaMA (1,052 → 1,188) shows modest changes, optimizing both quality and efficiency.
MC and MCS mid-training produce nearly identical weight change magnitudes (L2: 0.177 vs 0.175), yet MCS+RL achieves GPQA-Diamond of 52.9 vs only 35.5 for MC+RL. Cosine similarity between the two update directions is only 0.52 (G33) and 0.62 (Nemotron-H).
All three component types in Nemotron-H show nearly identical RL sparsity: Attention (93.5%), MLP (94.5%), and Mamba (93.9%). The sparse pattern is universal.
Most weight changes occur in the first ~200-400 steps, then plateau. The active parameter set grows progressively from ~1.5% to ~5%, and MT vs Base starting points produce similarly scaled and sparse update trajectories.
CKA similarity between mid-trained and RL representations exceeds 0.998 across all three models (Granite-3.3, LLaMA-3.1, Nemotron-H) on Wikipedia, C4, and GSM8K inputs. RL consistently refines rather than restructures the learned representations.
Pass rate interpolation on held-out MATH500 shows a generally increasing trend (17%→76%→80% for G33; 3%→44%→66% for LLaMA, 8 samples/prompt). The 2D landscape shows the RL direction as consistently high-reward with no apparent sharp barriers.
If you find PRISM useful in your research, please consider citing our paper.
@misc{runwal2026prismdemystifyingretentioninteraction,
title={PRISM: Demystifying Retention and Interaction in Mid-Training},
author={Bharat Runwal and Ashish Agrawal and Anurag Roy and Rameswar Panda},
year={2026},
eprint={2603.17074},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.17074},
}