Research Architecture · AI-CLS

Complementary
Learning Systems
for LLMs

A framework that makes language models learn from their own reasoning — just like the human brain does during sleep.

WHY WE NEED TO CARE

The Problem No One Is Talking About

Every time an LLM solves a hard problem well — that experience is thrown away. The model ends the session exactly as smart as it started.

🧠

Your brain keeps getting better

Every experience you have changes your neural connections. You wake up slightly smarter. The brain replays memories during sleep to consolidate skills.

🤖

LLMs don't learn from use

No matter how well an LLM reasons through a problem, the weights stay frozen. Useful reasoning trajectories are discarded. Nothing is consolidated.

📚

RAG retrieves blindly

Standard retrieval only uses the initial query. It doesn't know where you are in the reasoning chain. You get the same documents regardless of what you've already figured out.

🔁

No feedback loop exists

Knowledge flows one way: documents → model. The model's reasoning never improves the knowledge store or its own parameters. The two systems are completely isolated.

Theoretical Foundation

CLS Theory: Biology → AI

Complementary Learning Systems (McClelland et al. 1995) explains how two brain regions work together to learn. We map this directly to LLM architecture.

🧠 Brain (Biology)

🤖 AI-CLS Equivalent

Hippocampus — fast, specific memory storage

→

External Memory M — vector store, session context cache

Stores exact episodes ("what happened today")

→

Stores exact documents + reasoning artifacts from session

Neocortex — slow, generalized skill learning

→

LLM Parameters θ — the reasoning policy π_θ

Encodes general patterns across many experiences

→

Encodes general reasoning skills via weight updates

Sleep replay — hippocampus replays to neocortex

→

PRM-guided consolidation — session trajectories → fine-tune θ

Objective

Build a system where retrieval is guided by live reasoning state, correct reasoning is evaluated step-by-step by a reward model, and high-quality reasoning trajectories from each session are consolidated into the LLM's parameters — so the model gets genuinely better from experience.

Problems Being Solved

What Exactly Are We Fixing?

PROBLEM 01

Blind Retrieval

RAG uses only the original query. By step 3 of reasoning, the query is irrelevant — but retrieval doesn't know that.

PROBLEM 02

No Step-Level Validation

Reward models score only final answers. Bad intermediate steps that lead to correct answers stay undetected — and get reinforced.

PROBLEM 03

Experience is Discarded

After every session, the full context cache — including hard-won reasoning steps — is erased. Nothing is learned.

PROBLEM 04

Static Parameters

The LLM's weights never update from inference. A model that solved 1000 hard problems is no better than one that solved zero.

Phase 1

Unification — Reasoning-Guided Retrieval + PRM Consolidation

Connect the two systems. Make retrieval aware of reasoning. Make good reasoning consolidate into weights.

STEP 1.1

Reasoning-Guided Retrieval

Stop retrieving from the question. Start retrieving from where you are in the reasoning chain.

Standard RAG — What Happens Today

Situation

User asks a complex question requiring multi-step reasoning.

Problem

Retrieval happens once, using only the original query, before any reasoning begins.

Consequence

Documents retrieved are relevant to the question but not to where the model is in its reasoning.

d_i = R(q) // only knows the original question

EXAMPLE

Query

"Why did the 2008 financial crisis happen?"

Retrieved

"General overview of 2008 crisis, mortgage definitions, Lehman Brothers bankruptcy"

Problem

By reasoning step 3, the model needs specifics about CDO pricing failures — but retrieval already happened and returned generic docs.

↓ Replace this with →

State-Conditioned Retrieval — What We Do

Situation

We have the original query q and the model's current reasoning step s_t.

Task

Retrieve documents that are relevant not just to the question, but to this specific point in reasoning.

Action

Encode both q and s_t together into the retrieval query. Re-query the memory store at every reasoning step.

Result

Retrieval gets progressively more targeted as reasoning narrows toward the answer.

s_t ~ π_θ(s_t | q, d_<t) // model reasons to step t
d_i = R(q, s_t) // retrieve using BOTH query + current state

↓ What this looks like step by step →

The Retrieval Loop — Step by Step Example

Situation

Model is mid-reasoning. Each step generates a new state that can trigger a new, better retrieval.

Task

Show how retrieval quality improves as reasoning progresses.

Action

At each step t, embed [q + s_t] → ANN search over M → top-k new documents → appended to context.

Result

A self-improving loop: better reasoning → better retrieval → better reasoning.

STEP-BY-STEP EXAMPLE — "Why did 2008 crisis happen?"

t=0

State: (empty) → Retrieves: general 2008 overview

t=1

State: "crisis involved mortgage-backed securities..." → Retrieves: CDO structure docs, MBS pricing

t=2

State: "CDOs were mispriced because rating agencies used..." → Retrieves: rating agency incentive models, Basel II rules

t=3

State: "Regulators failed to act because..." → Retrieves: Fed reserve meeting notes, SEC oversight docs

Result

Each retrieval is narrower & more useful than the last. The model's context stays perfectly aligned with what it actually needs.

State-conditioned retrieval gives the model exactly the right knowledge at the right moment

STEP 1.2

PRM-Guided Consolidation

Score every reasoning step. Keep only the best. Write them into model weights.

Generate the Full Reasoning Trajectory

Situation

Step 1.1 has produced a complete reasoning chain to the answer.

Task

Capture the trajectory τ — the sequence of all reasoning steps — as a learning signal.

Action

Collect τ = (s_1, s_2, ..., s_T). This is the full chain from question to answer.

Result

We now have something concrete to evaluate: not just the answer, but every step that led to it.

τ = (s_1, s_2, ..., s_T) // full reasoning chain

EXAMPLE — τ for the 2008 question

s_1

"The crisis started with US housing market..."

s_2

"MBS were packaged and sold as low-risk by ratings agencies..."

s_3

"When housing prices fell, MBS became worthless, triggering bank insolvencies..."

s_T

"Therefore, the root cause was mispriced risk + regulatory failure."

↓ Now evaluate each step →

PRM Scores Every Step

Situation

We have trajectory τ. We need to know which steps were logically valid.

Task

Score each intermediate step for correctness — not just the final answer.

Action

Pass each s_t through the Process Reward Model r_φ. PRM was trained separately on step-level human annotations.

Result

Each step gets a quality score. We can now see where reasoning went right or wrong.

r_t = r_φ(s_t) // score each step
R(τ) = r_1 + r_2 + ... + r_T // total trajectory score

EXAMPLE — PRM scoring τ

s_1 → 0.9

Valid starting point, factually grounded

s_2 → 0.8

Correct causal link, well-supported by retrieved docs

s_3 → 0.3

Skipped a key mechanism — partially wrong

s_T → 0.9

Conclusion correct but s_3 gap means overall R(τ) = moderate

↓ Filter → Consolidate →

Filter + Fine-Tune (The Consolidation)

Situation

We have scored trajectories from many sessions. Some are good, some are flawed.

Task

Use only high-quality trajectories to update θ. Discard the rest.

Action

Threshold filter: keep τ if R(τ) ≥ threshold. Run policy gradient to increase probability of those reasoning strategies.

Result

Model weights θ now encode verified reasoning patterns. The model is slightly better at this type of reasoning in all future sessions.

if R(τ) ≥ threshold: // filter: only keep high-quality
θ ← θ + α ∇_θ 𝔼[Σ r_t] // update weights

WHAT THIS MEANS PRACTICALLY

Before

Model reasons about financial crises with generic causal chains

After

Model's weights now encode: "when reasoning about financial crises, trace: pricing → incentives → regulation → systemic failure"

Effect

This reasoning pattern is available in every future session, without needing to re-derive it

Phase 1 result: Smarter retrieval + reasoning skills encoded in weights

FROM STEP 1.1

Context-Aware Retrieval

Every reasoning step gets the documents it specifically needs — not just what the question asked for.

FROM STEP 1.2 B

Step-Level Validation

PRM ensures only step-correct reasoning trajectories become training signal. No bad steps sneak through.

FROM STEP 1.2 C

Reasoning Skills in Weights

Good reasoning strategies from this session are consolidated into θ — available in all future sessions permanently.

Phase 2

Consolidation — Session Memory into Parameter Update

Important Clarification

What is "session memory" here? It's the full in-context cache the LLM holds during a conversation — every retrieved document, every reasoning step, the PRM scores. This exists only during the session. When the session ends, it's gone. Phase 2's goal: before it disappears, extract the useful reasoning from it and update θ permanently.

Phase 2 is the "sleep replay" moment. The session is over. We mine the reasoning trajectory from the session cache and consolidate it into model parameters.

STEP 2.1

Extract Reasoning Artifacts from Session Cache

Before the session memory is lost, pull out what's worth keeping.

What the Session Cache Contains

Situation

Session just ended. The full context — query, retrieved docs, reasoning steps s_1...s_T, PRM scores r_t — is still in cache.

Task

Extract the reasoning artifacts that are most likely to generalize to future sessions.

Action

Apply extraction function f(τ) to the trajectory. Pull out structured reasoning patterns, not raw text.

Result

A clean set of reasoning artifacts, ready for consolidation into θ.

WHAT f(τ) EXTRACTS

Pattern

"For systemic risk questions: trace pricing → incentive misalignment → regulatory gap → cascade"

Sub-plan

"When CDO structure is relevant, retrieve: ratings methodology + issuer incentives together"

Failure

"Step where I skipped regulatory mechanism — PRM scored 0.3 — avoid this shortcut"

STEP 2.2

PRM Validates — What Gets Promoted

Not everything from the session is worth learning from. The PRM scores decide.

Filter by Step-Level Quality

Situation

We have extracted reasoning artifacts from the session. Some came from well-scored steps. Some from weak steps.

Task

Only promote artifacts where the supporting reasoning steps scored high on the PRM.

Action

Use PRM step scores r_t as a gate. Only artifacts from steps where r_t ≥ threshold proceed to weight update.

Result

We guarantee that only verified reasoning patterns enter the weight update. Bad reasoning is blocked.

// From the session, only keep what PRM validated
A_good = {a from f(τ) | r_φ(s_source) ≥ threshold}

STEP 2.3

Update LLM Parameters θ (The Consolidation)

This is the "sleep replay." Validated session reasoning is written permanently into model weights.

Fine-Tune θ on Validated Artifacts

Situation

We have A_good — a set of reasoning patterns that the PRM confirmed are correct and useful.

Task

Make the LLM more likely to generate these reasoning patterns in the future — without seeing them in context.

Action

Fine-tune θ using A_good as supervised examples. The reasoning patterns that lived in context cache now live in weights.

Result

The model's reasoning policy π_θ is permanently updated. Session experience becomes permanent capability.

// Session cache → weight update
θ_new = fine_tune(θ, A_good)

// Session cache is now cleared — but its essence lives in θ_new
del session_cache // memory gone, but knowledge kept

WHAT CHANGES IN θ AFTER PHASE 2

Before

Model needs documents in context to reason about systemic financial risk

After

Model internally knows the "pricing → incentives → regulation → cascade" structure — it's in θ now

Crucially

This is not the session memory. This is the model's learned reasoning policy. It persists forever.

Phase 2 result: Session reasoning → permanent parameter update → model improves

FROM STEP 2.1

Structured Extraction

Reasoning patterns are extracted structurally, not as raw text — ready for learning.

FROM STEP 2.2

Quality Gated

PRM ensures only step-validated reasoning reaches the weight update. No noise enters.

FROM STEP 2.3

Permanent Capability

What existed in the session cache as context now exists in θ as reasoning skill. Forever.

Full System

AI-CLS Architecture — How It All Runs

One session, start to finish. Phase 1 runs during inference. Phase 2 runs after the session.

AI-CLS · Full Runtime + Learning Pipeline

📥 User Query arrives

↓

PHASE 1 — Runs during inference (loops T times)

Memory M

retrieve(q, s_t)

docs relevant to
current reasoning step

→

LLM π_θ

generate s_{t+1}

next reasoning step
using new docs

→

PRM r_φ

score r_t

how good was
this step?

↑ s_t feeds back → triggers smarter retrieval next loop ↑

↓

                    ✓ Answer returned to user

📤 Final Answer Generated

↓

                    session cache contains: full τ + all r_t scores

↓

PHASE 2 — Runs after session ends (async)

Session Cache

τ, r_t scores

⚠ temporary — will be lost

→

f(τ) Extraction

+ PRM filter

only good steps proceed

→

Fine-tune θ

θ_new = θ + Δ

permanent skill update

↓

🧠 LLM θ is permanently smarter at this type of reasoning

Every future session starts from a slightly better model. The cycle repeats.

Comparison

What Makes AI-CLS Different

Aspect	❌ Standard RAG + LLM	✓ AI-CLS
Retrieval trigger	Once, on query only	Every reasoning step, on query + state
Reward signal	Final answer only	Every step, via PRM
Session learning	Discarded on session end	Extracted and consolidated into θ
Parameters θ	Frozen at inference	Update after each session (async)
Memory role	Static document lookup	Dynamic context that informs retrieval state
Model improves?	Never from use	After every qualifying session

Theory Check

How Well Does AI-CLS Mirror Biological CLS?

A direct mapping of every CLS biological mechanism to its AI-CLS counterpart — and where we're still missing pieces.

🧠 Biological CLS Mechanism

🤖 AI-CLS Implementation

Two complementary systems (hippocampus + neocortex)

✓

Two systems: external memory M + parameters θ

Fast learning in hippocampus (single-shot episodic)

✓

Fast context update: session cache M absorbs all docs in one pass

Slow learning in neocortex (many repetitions)

✓

Slow weight update: θ updates from batches of validated sessions

Sleep replay: hippocampus → neocortex transfer

✓

Phase 2: session cache artifacts → fine-tune θ

Selective consolidation (only important memories)

✓

PRM threshold filter: only high-reward trajectories update θ

Memory-guided attention (hippocampus directs cortical focus)

Partial: state-conditioned retrieval mimics this, but not at neural level

Forgetting (interference protection during sleep)

Partial: fine-tuning can cause catastrophic forgetting — needs continual learning methods

What If This Works

The Benefits of AI-CLS at Scale

📉

Drastically Reduced Hallucinations

Retrieval aligns with reasoning state. PRM validates each step. The model never confidently reasons off unsupported ground.

🔍

Fully Traceable Decisions

Every answer comes with: retrieved sources, each reasoning step, PRM scores per step. Interpretability is built-in, not an afterthought.

📈

Models That Actually Improve From Use

After millions of sessions, θ has consolidated the best reasoning strategies from all of them. The model gets better — without being retrained from scratch.

🧩

Domain Specialization Through Experience

A model deployed in medicine consolidates medical reasoning patterns. One in law consolidates legal reasoning. Specialization emerges from experience — not fine-tuning on static datasets.

The Core Bet

The brain didn't evolve a single monolithic system for memory and reasoning — it evolved two complementary ones that work together. AI-CLS applies that same principle to language models: keep a fast, flexible context memory and a slow, validated parameter update. Connect them. Let them co-improve. That's how you build a model that learns from experience, not just from pretraining.

ComplementaryLearning Systemsfor LLMs

The Problem No One Is Talking About

CLS Theory: Biology → AI

What Exactly Are We Fixing?

Unification — Reasoning-Guided Retrieval + PRM Consolidation

Consolidation — Session Memory into Parameter Update

AI-CLS Architecture — How It All Runs

What Makes AI-CLS Different

How Well Does AI-CLS Mirror Biological CLS?

The Benefits of AI-CLS at Scale

The Core Bet

Complementary
Learning Systems
for LLMs