Research Architecture · AI-CLS

Complementary
Learning Systems
for LLMs

A framework that makes language models learn from their own reasoning — just like the human brain does during sleep.

WHY WE NEED TO CARE

The Problem No One Is Talking About

Every time an LLM solves a hard problem well — that experience is thrown away. The model ends the session exactly as smart as it started.

🧠
Your brain keeps getting better
Every experience you have changes your neural connections. You wake up slightly smarter. The brain replays memories during sleep to consolidate skills.
🤖
LLMs don't learn from use
No matter how well an LLM reasons through a problem, the weights stay frozen. Useful reasoning trajectories are discarded. Nothing is consolidated.
📚
RAG retrieves blindly
Standard retrieval only uses the initial query. It doesn't know where you are in the reasoning chain. You get the same documents regardless of what you've already figured out.
🔁
No feedback loop exists
Knowledge flows one way: documents → model. The model's reasoning never improves the knowledge store or its own parameters. The two systems are completely isolated.

Theoretical Foundation

CLS Theory: Biology → AI

Complementary Learning Systems (McClelland et al. 1995) explains how two brain regions work together to learn. We map this directly to LLM architecture.

🧠 Brain (Biology)
🤖 AI-CLS Equivalent
Hippocampus — fast, specific memory storage
External Memory M — vector store, session context cache
Stores exact episodes ("what happened today")
Stores exact documents + reasoning artifacts from session
Neocortex — slow, generalized skill learning
LLM Parameters θ — the reasoning policy π_θ
Encodes general patterns across many experiences
Encodes general reasoning skills via weight updates
Sleep replay — hippocampus replays to neocortex
PRM-guided consolidation — session trajectories → fine-tune θ

Objective

Build a system where retrieval is guided by live reasoning state, correct reasoning is evaluated step-by-step by a reward model, and high-quality reasoning trajectories from each session are consolidated into the LLM's parameters — so the model gets genuinely better from experience.


Problems Being Solved

What Exactly Are We Fixing?

PROBLEM 01
Blind Retrieval
RAG uses only the original query. By step 3 of reasoning, the query is irrelevant — but retrieval doesn't know that.
PROBLEM 02
No Step-Level Validation
Reward models score only final answers. Bad intermediate steps that lead to correct answers stay undetected — and get reinforced.
PROBLEM 03
Experience is Discarded
After every session, the full context cache — including hard-won reasoning steps — is erased. Nothing is learned.
PROBLEM 04
Static Parameters
The LLM's weights never update from inference. A model that solved 1000 hard problems is no better than one that solved zero.

Phase 1

Unification — Reasoning-Guided Retrieval + PRM Consolidation

Connect the two systems. Make retrieval aware of reasoning. Make good reasoning consolidate into weights.

STEP 1.1
Reasoning-Guided Retrieval
Stop retrieving from the question. Start retrieving from where you are in the reasoning chain.
A
Standard RAG — What Happens Today
Situation
User asks a complex question requiring multi-step reasoning.
Problem
Retrieval happens once, using only the original query, before any reasoning begins.
Consequence
Documents retrieved are relevant to the question but not to where the model is in its reasoning.
d_i = R(q)   // only knows the original question
EXAMPLE
Query
"Why did the 2008 financial crisis happen?"
Retrieved
"General overview of 2008 crisis, mortgage definitions, Lehman Brothers bankruptcy"
Problem
By reasoning step 3, the model needs specifics about CDO pricing failures — but retrieval already happened and returned generic docs.
Replace this with →
B
State-Conditioned Retrieval — What We Do
Situation
We have the original query q and the model's current reasoning step s_t.
Task
Retrieve documents that are relevant not just to the question, but to this specific point in reasoning.
Action
Encode both q and s_t together into the retrieval query. Re-query the memory store at every reasoning step.
Result
Retrieval gets progressively more targeted as reasoning narrows toward the answer.
s_t ~ π_θ(s_t | q, d_<t)   // model reasons to step t
d_i = R(q, s_t)        // retrieve using BOTH query + current state
What this looks like step by step →
C
The Retrieval Loop — Step by Step Example
Situation
Model is mid-reasoning. Each step generates a new state that can trigger a new, better retrieval.
Task
Show how retrieval quality improves as reasoning progresses.
Action
At each step t, embed [q + s_t] → ANN search over M → top-k new documents → appended to context.
Result
A self-improving loop: better reasoning → better retrieval → better reasoning.
STEP-BY-STEP EXAMPLE — "Why did 2008 crisis happen?"
t=0
State: (empty) → Retrieves: general 2008 overview
t=1
State: "crisis involved mortgage-backed securities..." → Retrieves: CDO structure docs, MBS pricing
t=2
State: "CDOs were mispriced because rating agencies used..." → Retrieves: rating agency incentive models, Basel II rules
t=3
State: "Regulators failed to act because..." → Retrieves: Fed reserve meeting notes, SEC oversight docs
Result
Each retrieval is narrower & more useful than the last. The model's context stays perfectly aligned with what it actually needs.
State-conditioned retrieval gives the model exactly the right knowledge at the right moment
STEP 1.2
PRM-Guided Consolidation
Score every reasoning step. Keep only the best. Write them into model weights.
A
Generate the Full Reasoning Trajectory
Situation
Step 1.1 has produced a complete reasoning chain to the answer.
Task
Capture the trajectory τ — the sequence of all reasoning steps — as a learning signal.
Action
Collect τ = (s_1, s_2, ..., s_T). This is the full chain from question to answer.
Result
We now have something concrete to evaluate: not just the answer, but every step that led to it.
τ = (s_1, s_2, ..., s_T)   // full reasoning chain
EXAMPLE — τ for the 2008 question
s_1
"The crisis started with US housing market..."
s_2
"MBS were packaged and sold as low-risk by ratings agencies..."
s_3
"When housing prices fell, MBS became worthless, triggering bank insolvencies..."
s_T
"Therefore, the root cause was mispriced risk + regulatory failure."
Now evaluate each step →
B
PRM Scores Every Step
Situation
We have trajectory τ. We need to know which steps were logically valid.
Task
Score each intermediate step for correctness — not just the final answer.
Action
Pass each s_t through the Process Reward Model r_φ. PRM was trained separately on step-level human annotations.
Result
Each step gets a quality score. We can now see where reasoning went right or wrong.
r_t = r_φ(s_t)                        // score each step
R(τ) = r_1 + r_2 + ... + r_T       // total trajectory score
EXAMPLE — PRM scoring τ
s_1 → 0.9
Valid starting point, factually grounded
s_2 → 0.8
Correct causal link, well-supported by retrieved docs
s_3 → 0.3
Skipped a key mechanism — partially wrong
s_T → 0.9
Conclusion correct but s_3 gap means overall R(τ) = moderate
Filter → Consolidate →
C
Filter + Fine-Tune (The Consolidation)
Situation
We have scored trajectories from many sessions. Some are good, some are flawed.
Task
Use only high-quality trajectories to update θ. Discard the rest.
Action
Threshold filter: keep τ if R(τ) ≥ threshold. Run policy gradient to increase probability of those reasoning strategies.
Result
Model weights θ now encode verified reasoning patterns. The model is slightly better at this type of reasoning in all future sessions.
if R(τ) threshold:   // filter: only keep high-quality
  θ θ + α ∇_θ 𝔼[Σ r_t]  // update weights
WHAT THIS MEANS PRACTICALLY
Before
Model reasons about financial crises with generic causal chains
After
Model's weights now encode: "when reasoning about financial crises, trace: pricing → incentives → regulation → systemic failure"
Effect
This reasoning pattern is available in every future session, without needing to re-derive it
Phase 1 result: Smarter retrieval + reasoning skills encoded in weights
FROM STEP 1.1
Context-Aware Retrieval
Every reasoning step gets the documents it specifically needs — not just what the question asked for.
FROM STEP 1.2 B
Step-Level Validation
PRM ensures only step-correct reasoning trajectories become training signal. No bad steps sneak through.
FROM STEP 1.2 C
Reasoning Skills in Weights
Good reasoning strategies from this session are consolidated into θ — available in all future sessions permanently.

Phase 2

Consolidation — Session Memory into Parameter Update

Important Clarification

What is "session memory" here? It's the full in-context cache the LLM holds during a conversation — every retrieved document, every reasoning step, the PRM scores. This exists only during the session. When the session ends, it's gone. Phase 2's goal: before it disappears, extract the useful reasoning from it and update θ permanently.

Phase 2 is the "sleep replay" moment. The session is over. We mine the reasoning trajectory from the session cache and consolidate it into model parameters.

STEP 2.1
Extract Reasoning Artifacts from Session Cache
Before the session memory is lost, pull out what's worth keeping.
A
What the Session Cache Contains
Situation
Session just ended. The full context — query, retrieved docs, reasoning steps s_1...s_T, PRM scores r_t — is still in cache.
Task
Extract the reasoning artifacts that are most likely to generalize to future sessions.
Action
Apply extraction function f(τ) to the trajectory. Pull out structured reasoning patterns, not raw text.
Result
A clean set of reasoning artifacts, ready for consolidation into θ.
WHAT f(τ) EXTRACTS
Pattern
"For systemic risk questions: trace pricing → incentive misalignment → regulatory gap → cascade"
Sub-plan
"When CDO structure is relevant, retrieve: ratings methodology + issuer incentives together"
Failure
"Step where I skipped regulatory mechanism — PRM scored 0.3 — avoid this shortcut"
STEP 2.2
PRM Validates — What Gets Promoted
Not everything from the session is worth learning from. The PRM scores decide.
A
Filter by Step-Level Quality
Situation
We have extracted reasoning artifacts from the session. Some came from well-scored steps. Some from weak steps.
Task
Only promote artifacts where the supporting reasoning steps scored high on the PRM.
Action
Use PRM step scores r_t as a gate. Only artifacts from steps where r_t ≥ threshold proceed to weight update.
Result
We guarantee that only verified reasoning patterns enter the weight update. Bad reasoning is blocked.
// From the session, only keep what PRM validated
A_good = {a from f(τ) | r_φ(s_source) threshold}
STEP 2.3
Update LLM Parameters θ (The Consolidation)
This is the "sleep replay." Validated session reasoning is written permanently into model weights.
A
Fine-Tune θ on Validated Artifacts
Situation
We have A_good — a set of reasoning patterns that the PRM confirmed are correct and useful.
Task
Make the LLM more likely to generate these reasoning patterns in the future — without seeing them in context.
Action
Fine-tune θ using A_good as supervised examples. The reasoning patterns that lived in context cache now live in weights.
Result
The model's reasoning policy π_θ is permanently updated. Session experience becomes permanent capability.
// Session cache → weight update
θ_new = fine_tune(θ, A_good)

// Session cache is now cleared — but its essence lives in θ_new
del session_cache   // memory gone, but knowledge kept
WHAT CHANGES IN θ AFTER PHASE 2
Before
Model needs documents in context to reason about systemic financial risk
After
Model internally knows the "pricing → incentives → regulation → cascade" structure — it's in θ now
Crucially
This is not the session memory. This is the model's learned reasoning policy. It persists forever.
Phase 2 result: Session reasoning → permanent parameter update → model improves
FROM STEP 2.1
Structured Extraction
Reasoning patterns are extracted structurally, not as raw text — ready for learning.
FROM STEP 2.2
Quality Gated
PRM ensures only step-validated reasoning reaches the weight update. No noise enters.
FROM STEP 2.3
Permanent Capability
What existed in the session cache as context now exists in θ as reasoning skill. Forever.

Full System

AI-CLS Architecture — How It All Runs

One session, start to finish. Phase 1 runs during inference. Phase 2 runs after the session.

AI-CLS · Full Runtime + Learning Pipeline
📥  User Query arrives
PHASE 1 — Runs during inference (loops T times)
Memory M
retrieve(q, s_t)
docs relevant to
current reasoning step
LLM π_θ
generate s_{t+1}
next reasoning step
using new docs
PRM r_φ
score r_t
how good was
this step?
↑ s_t feeds back → triggers smarter retrieval next loop ↑
✓ Answer returned to user
📤  Final Answer Generated
session cache contains: full τ + all r_t scores
PHASE 2 — Runs after session ends (async)
Session Cache
τ, r_t scores
⚠ temporary — will be lost
f(τ) Extraction
+ PRM filter
only good steps proceed
Fine-tune θ
θ_new = θ + Δ
permanent skill update
🧠  LLM θ is permanently smarter at this type of reasoning
Every future session starts from a slightly better model. The cycle repeats.

Comparison

What Makes AI-CLS Different

Aspect ❌ Standard RAG + LLM ✓ AI-CLS
Retrieval trigger Once, on query only Every reasoning step, on query + state
Reward signal Final answer only Every step, via PRM
Session learning Discarded on session end Extracted and consolidated into θ
Parameters θ Frozen at inference Update after each session (async)
Memory role Static document lookup Dynamic context that informs retrieval state
Model improves? Never from use After every qualifying session

Theory Check

How Well Does AI-CLS Mirror Biological CLS?

A direct mapping of every CLS biological mechanism to its AI-CLS counterpart — and where we're still missing pieces.

🧠 Biological CLS Mechanism
🤖 AI-CLS Implementation
Two complementary systems (hippocampus + neocortex)
Two systems: external memory M + parameters θ
Fast learning in hippocampus (single-shot episodic)
Fast context update: session cache M absorbs all docs in one pass
Slow learning in neocortex (many repetitions)
Slow weight update: θ updates from batches of validated sessions
Sleep replay: hippocampus → neocortex transfer
Phase 2: session cache artifacts → fine-tune θ
Selective consolidation (only important memories)
PRM threshold filter: only high-reward trajectories update θ
Memory-guided attention (hippocampus directs cortical focus)
~
Partial: state-conditioned retrieval mimics this, but not at neural level
Forgetting (interference protection during sleep)
~
Partial: fine-tuning can cause catastrophic forgetting — needs continual learning methods

What If This Works

The Benefits of AI-CLS at Scale

📉
Drastically Reduced Hallucinations
Retrieval aligns with reasoning state. PRM validates each step. The model never confidently reasons off unsupported ground.
🔍
Fully Traceable Decisions
Every answer comes with: retrieved sources, each reasoning step, PRM scores per step. Interpretability is built-in, not an afterthought.
📈
Models That Actually Improve From Use
After millions of sessions, θ has consolidated the best reasoning strategies from all of them. The model gets better — without being retrained from scratch.
🧩
Domain Specialization Through Experience
A model deployed in medicine consolidates medical reasoning patterns. One in law consolidates legal reasoning. Specialization emerges from experience — not fine-tuning on static datasets.

The Core Bet

The brain didn't evolve a single monolithic system for memory and reasoning — it evolved two complementary ones that work together. AI-CLS applies that same principle to language models: keep a fast, flexible context memory and a slow, validated parameter update. Connect them. Let them co-improve. That's how you build a model that learns from experience, not just from pretraining.