Procedural Memory Distillation

Abstract

Reinforcement learning with verifiable rewards (RLVR), along with recent self-distillation variants such as SDPO, evaluates each rollout against a verifier and updates the policy from that episode-level signal. However, the richer procedural information in the rollout is rarely retained or reused. We propose Procedural Memory Distillation (PMD), which converts cross-episode signals into reusable procedural memory and distills it into the policy's weights during training. This memory functions as a training scaffold, absorbed into the policy itself, yielding a memory-free model at inference. PMD organizes the memory at three levels: raw trajectories, self-reflected strategies and lessons, and higher-level behavioral patterns. A memory-conditioned self-teacher draws on accumulated experience to supervise the student, enabling progressive internalization of procedural knowledge. The central design principle is co-evolution: the policy generates rollouts that update memory, and memory shapes the supervision that updates the policy.

Why PMD?

RLVR and self-distillation methods such as SDPO usually learn from episode-local feedback: a rollout is verified, the policy is updated, and the richer procedural information in that rollout is discarded. PMD instead preserves cross-episode learning signals:

✓

Consistent Strategies

Which strategies consistently pass verification?

⚠️

Recurring Mistakes

Which mistakes recur across attempts?

🔄

Transferable Behaviors

Which reasoning behaviors transfer across related problems?

🔗

Memory Evolution

How should memory evolve as the policy itself changes?

PMD treats memory as a training scaffold, not an inference-time dependency. The final student model performs inference without external memory retrieval.

Key Contributions

1. Procedural Memory Distillation Framework

We propose a self-distillation framework that turns episode-local supervision into cross-episode procedural memory, built online from the model's own rollouts and distilled into the policy's weights. The resulting model reasons natively at inference, with no external memory dependency.

2. Co-Evolution as Design Principle

PMD updates policy and memory jointly: rollouts from the current policy refresh memory, and the refreshed memory shapes the supervision that trains the next policy. This online coupling distinguishes PMD from static or offline memory banks and powers the performance gains.

3. Three-Level Memory Hierarchy

We propose experience, insight, and behavior memory, characterizing the fidelity-transfer trade-off across them. Experience memory preserves faithful evidence but remains local; behavior memory transfers broadly but can become coarse; insight memory strikes a balance by retaining problem-grounded lessons in compact form.

Approach at a Glance

1

Student Attempts & Verification

The student policy makes multiple attempts at solving problems and receives verifier feedback on correctness. Each rollout provides rewards, feedback, and concrete evidence of reasoning patterns.

2

Self-Reflection & Memory Building

Successes and failures are summarized into online procedural memory at three levels: raw trajectories (experience), strategies and lessons (insight), and cross-problem behavioral patterns (behavior).

3

Memory-Conditioned Supervision

The teacher retrieves relevant memory to provide richer supervision. The student distills this guidance into its weights, progressively internalizing procedural knowledge for memory-free inference.

PMD Overview

Figure 1: Overview of Procedural Memory Distillation. The student makes attempts and receives verification, self-reflection builds procedural memory at three levels, the memory-conditioned teacher provides supervision, and guidance is distilled into the student for the next epoch.

Comparison with Existing Approaches

Method	Evolving Policy	Memory-Free Inference	Persistent Memory	Evolving Memory	Policy-Memory Co-Evolution
Base Model	✗	✓	✗	✗	✗
GRPO / RLVR	✓	✓	✗	✗	✗
SDPO / OPD	✓	✓	✗	✗	✗
Inference-Time Memory Agents	✗	✗	✓	✓	✗
PMD (Ours)	✓	✓	✓	✓	✓

PMD uniquely combines evolving policy with persistent, evolving memory during training, while maintaining memory-free inference through distillation into model weights.

Memory Hierarchy

Level 0

Experience Memory

Raw trajectories, rollouts, rewards, and verifier feedback for each problem. Faithful but local evidence.

→

Level 1

Insight Memory

Self-reflected strategies from successful attempts and lessons from failures. Problem-grounded, compact, and reusable.

→

Level 2

Behavior Memory

Cross-problem reasoning patterns distilled from clustered insights. Broadly transferable but more abstract.

Figure 1: The three-level procedural memory hierarchy in PMD. Each level represents a different trade-off between fidelity and transferability.

Main Results

SciKnowEval

Qwen3-8B Improvement +3.8%

OLMo3-Instruct-7B Improvement +5.5%

Over SDPO baseline on science reasoning tasks

LiveCodeBench

Qwen3-8B Improvement +7.9%

OLMo3-Instruct-7B Improvement +13.6%

Over SDPO baseline on code generation tasks

Comprehensive Comparison

Model	Method	SciKnowEval AVG	LiveCodeBench v6
Qwen3-8B	Base Policy	47.9	27.1
	GRPO	69.4	41.2
	SDPO	74.4	47.9
	PMD	77.2	51.7
OLMo3-Instruct-7B	Base Policy	27.7	27.7
	GRPO	63.9	36.1
	SDPO	69.5	45.0
	PMD	73.3	51.1

Table 2: Comprehensive comparison across both model families and benchmarks. PMD achieves consistent improvements over strong baselines.

Training Dynamics

PMD demonstrates consistent performance improvement across training epochs. The co-evolution of policy and memory enables the model to progressively internalize procedural knowledge, resulting in steady gains throughout the training process.

Figure 2: Training dynamics showing performance improvement across epochs for Qwen3-8B. PMD achieves consistent gains over SDPO baseline.

The Importance of Co-Evolution

🔄

Co-Evolution is Essential

Freezing either memory or policy substantially underperforms full PMD, showing that memory must stay aligned with the changing policy. The tight coupling between policy evolution and memory updates is crucial for performance gains.

Table 3: Freezing either component (memory or policy) leads to significant performance degradation compared to full co-evolution.

Test-Time Compute Scaling

📈

Better Answer Coverage and Scaling

PMD preserves broader answer-space coverage than SDPO, leaving more headroom for verifier reranking and best-of-N sampling. This enables better scaling with additional inference-time compute, opening 2-4× wider verifier headroom on SciKnowEval.

Figure 4: Answer coverage and pass@k scaling behavior. PMD maintains higher diversity in generated answers while achieving better pass@k performance across different k values.

Key Takeaways

1

Co-evolution is essential. PMD improves over SDPO by 3.8-5.5% on SciKnowEval and 7.9-13.6% on LiveCodeBench. Freezing either the memory or the policy trails PMD by more than 10% across domains, confirming that joint evolution of policy and memory is crucial for performance gains.

2

Memory-free inference with training-time scaffolding. Unlike memory-augmented agents that rely on external memory at inference time, PMD uses procedural memory only during training. The student progressively internalizes this knowledge, yielding a memory-free model that reasons natively without retrieval overhead.

3

Better test-time compute scaling. PMD continues to gain from additional rollouts where SDPO saturates, widening the gap with the rollout budget and opening 2-4× wider verifier headroom on SciKnowEval. This demonstrates superior utilization of inference-time computation.