Research Architecture · AI-CLS
Complementary
Learning Systems
for LLMs
A framework that makes language models learn from their own reasoning — just like the human
brain does during sleep.
WHY WE NEED TO CARE
The Problem No One Is Talking About
Every time an LLM solves a hard problem well — that experience is thrown away. The model
ends the session exactly as smart as it started.
🧠
Your brain keeps getting better
Every experience you have changes your neural connections. You wake up slightly
smarter. The brain replays memories during sleep to consolidate skills.
🤖
LLMs don't learn from use
No matter how well an LLM reasons through a problem, the weights stay frozen.
Useful reasoning trajectories are discarded. Nothing is consolidated.
📚
RAG retrieves blindly
Standard retrieval only uses the initial query. It doesn't know where you are in
the reasoning chain. You get the same documents regardless of what you've already figured out.
🔁
No feedback loop exists
Knowledge flows one way: documents → model. The model's reasoning never improves
the knowledge store or its own parameters. The two systems are completely isolated.
Theoretical Foundation
CLS Theory: Biology → AI
Complementary Learning Systems (McClelland et al. 1995) explains how two brain regions work
together to learn. We map this directly to LLM architecture.
🧠 Brain (Biology)
🤖 AI-CLS Equivalent
Hippocampus — fast, specific memory storage
→
External Memory M — vector store, session context cache
Stores exact episodes ("what happened today")
→
Stores exact documents + reasoning artifacts from session
Neocortex — slow, generalized skill learning
→
LLM Parameters θ — the reasoning policy π_θ
Encodes general patterns across many experiences
→
Encodes general reasoning skills via weight updates
Sleep replay — hippocampus replays to neocortex
→
PRM-guided consolidation — session trajectories → fine-tune θ
Objective
Build a system where retrieval is guided by live reasoning state, correct reasoning is
evaluated step-by-step by a reward model, and high-quality reasoning trajectories from
each session are consolidated into the LLM's parameters — so the model gets genuinely
better from experience.
Problems Being Solved
What Exactly Are We Fixing?
PROBLEM 01
Blind Retrieval
RAG uses only the original query. By step 3 of reasoning, the query is irrelevant —
but retrieval doesn't know that.
PROBLEM 02
No Step-Level Validation
Reward models score only final answers. Bad intermediate steps that lead to correct
answers stay undetected — and get reinforced.
PROBLEM 03
Experience is Discarded
After every session, the full context cache — including hard-won reasoning steps —
is erased. Nothing is learned.
PROBLEM 04
Static Parameters
The LLM's weights never update from inference. A model that solved 1000 hard
problems is no better than one that solved zero.
Phase 1
Unification — Reasoning-Guided Retrieval + PRM Consolidation
Connect the two systems. Make retrieval aware of reasoning. Make good reasoning consolidate
into weights.
STEP 1.1
Reasoning-Guided Retrieval
Stop retrieving from the question. Start retrieving from where you are in the
reasoning chain.
Situation
User asks a complex question requiring multi-step reasoning.
Problem
Retrieval happens once, using only the original query,
before any reasoning begins.
Consequence
Documents retrieved are relevant to the question but not to where
the model is in its reasoning.
d_i = R(q) // only knows the original question
EXAMPLE
Query
"Why did the 2008 financial crisis happen?"
Retrieved
"General overview of 2008 crisis, mortgage definitions, Lehman Brothers
bankruptcy"
Problem
By reasoning step 3, the model needs specifics about CDO pricing
failures — but retrieval already happened and returned generic docs.
↓ Replace this with →
Situation
We have the original query q and the model's current
reasoning step s_t.
Task
Retrieve documents that are relevant not just to the question, but to
this specific point in reasoning.
Action
Encode both q and s_t together into the retrieval query. Re-query the
memory store at every reasoning step.
Result
Retrieval gets progressively more targeted as reasoning
narrows toward the answer.
s_t ~ π_θ(s_t | q, d_<t)
// model reasons to step t
d_i = R(q, s_t)
// retrieve using BOTH query +
current state
↓ What this looks like step by step →
Situation
Model is mid-reasoning. Each step generates a new state that can trigger a
new, better retrieval.
Task
Show how retrieval quality improves as reasoning progresses.
Action
At each step t, embed [q + s_t] → ANN search over M → top-k new documents
→ appended to context.
Result
A self-improving loop: better reasoning → better retrieval →
better reasoning.
STEP-BY-STEP EXAMPLE — "Why did 2008 crisis happen?"
t=0
State: (empty) → Retrieves: general 2008 overview
t=1
State: "crisis involved mortgage-backed securities..." →
Retrieves: CDO structure docs, MBS pricing
t=2
State: "CDOs were mispriced because rating agencies used..." →
Retrieves: rating agency incentive models, Basel II rules
t=3
State: "Regulators failed to act because..." → Retrieves: Fed reserve meeting notes, SEC oversight docs
Result
Each retrieval is narrower & more useful than the last. The model's
context stays perfectly aligned with what it actually needs.
State-conditioned retrieval gives the model exactly the right knowledge at
the right moment
STEP 1.2
PRM-Guided Consolidation
Score every reasoning step. Keep only the best. Write them into model weights.
Situation
Step 1.1 has produced a complete reasoning chain to the answer.
Task
Capture the trajectory τ — the sequence of all reasoning steps — as a
learning signal.
Action
Collect τ = (s_1, s_2, ..., s_T). This is the full chain from question to
answer.
Result
We now have something concrete to evaluate: not just the answer, but
every step that led to it.
τ = (s_1, s_2, ..., s_T) //
full reasoning chain
EXAMPLE — τ for the 2008 question
s_1
"The crisis started with US housing market..."
s_2
"MBS were packaged and sold as low-risk by ratings agencies..."
s_3
"When housing prices fell, MBS became worthless, triggering bank
insolvencies..."
s_T
"Therefore, the root cause was mispriced risk + regulatory failure."
↓ Now evaluate each step →
Situation
We have trajectory τ. We need to know which steps were logically valid.
Task
Score each intermediate step for correctness — not just the final answer.
Action
Pass each s_t through the Process Reward Model r_φ. PRM was trained
separately on step-level human annotations.
Result
Each step gets a quality score. We can now see where reasoning
went right or wrong.
r_t = r_φ(s_t)
// score each step
R(τ) = r_1 + r_2 + ... + r_T // total trajectory score
EXAMPLE — PRM scoring τ
s_1 → 0.9
Valid starting point, factually grounded
s_2 → 0.8
Correct causal link, well-supported by retrieved docs
s_3 → 0.3
Skipped a key mechanism — partially wrong
s_T → 0.9
Conclusion correct but s_3 gap means overall R(τ) = moderate
↓ Filter → Consolidate →
Situation
We have scored trajectories from many sessions. Some are good, some are
flawed.
Task
Use only high-quality trajectories to update θ. Discard the rest.
Action
Threshold filter: keep τ if R(τ) ≥ threshold. Run policy gradient to
increase probability of those reasoning strategies.
Result
Model weights θ now encode verified reasoning patterns.
The model is slightly better at this type of reasoning in all future sessions.
if R(τ) ≥ threshold: // filter: only keep
high-quality
θ ← θ + α ∇_θ 𝔼[Σ r_t]
// update weights
WHAT THIS MEANS PRACTICALLY
Before
Model reasons about financial crises with generic causal chains
After
Model's weights now encode: "when reasoning about financial crises,
trace: pricing → incentives → regulation → systemic failure"
Effect
This reasoning pattern is available in every future session,
without needing to re-derive it
Phase 1 result: Smarter retrieval + reasoning skills encoded in weights
FROM STEP 1.1
Context-Aware Retrieval
Every reasoning step gets the documents it specifically needs — not just what
the question asked for.
FROM STEP 1.2 B
Step-Level Validation
PRM ensures only step-correct reasoning trajectories become training signal. No
bad steps sneak through.
FROM STEP 1.2 C
Reasoning Skills in Weights
Good reasoning strategies from this session are consolidated into θ — available
in all future sessions permanently.
Phase 2
Consolidation — Session Memory into Parameter Update
Important Clarification
What is "session memory" here? It's the full in-context cache the LLM holds during a
conversation — every retrieved document, every reasoning step, the PRM scores. This exists only
during the session. When the session ends, it's gone. Phase 2's goal: before
it disappears, extract the useful reasoning from it and update θ permanently.
Phase 2 is the "sleep replay" moment. The session is over. We mine the reasoning trajectory
from the session cache and consolidate it into model parameters.
STEP 2.1
Extract Reasoning Artifacts from Session Cache
Before the session memory is lost, pull out what's worth keeping.
Situation
Session just ended. The full context — query, retrieved docs, reasoning
steps s_1...s_T, PRM scores r_t — is still in cache.
Task
Extract the reasoning artifacts that are most likely to generalize to
future sessions.
Action
Apply extraction function f(τ) to the trajectory. Pull out structured
reasoning patterns, not raw text.
Result
A clean set of reasoning artifacts, ready for consolidation into θ.
WHAT f(τ) EXTRACTS
Pattern
"For systemic risk questions: trace pricing → incentive misalignment
→ regulatory gap → cascade"
Sub-plan
"When CDO structure is relevant, retrieve: ratings methodology +
issuer incentives together"
Failure
"Step where I skipped regulatory mechanism — PRM scored 0.3 — avoid
this shortcut"
STEP 2.2
PRM Validates — What Gets Promoted
Not everything from the session is worth learning from. The PRM scores decide.
Situation
We have extracted reasoning artifacts from the session. Some came from
well-scored steps. Some from weak steps.
Task
Only promote artifacts where the supporting reasoning steps scored high on
the PRM.
Action
Use PRM step scores r_t as a gate. Only artifacts from steps where r_t ≥
threshold proceed to weight update.
Result
We guarantee that only verified reasoning patterns enter
the weight update. Bad reasoning is blocked.
// From the session, only keep what PRM validated
A_good = {a from f(τ) | r_φ(s_source) ≥ threshold}
STEP 2.3
Update LLM Parameters θ (The Consolidation)
This is the "sleep replay." Validated session reasoning is written permanently
into model weights.
Situation
We have A_good — a set of reasoning patterns that the PRM confirmed are
correct and useful.
Task
Make the LLM more likely to generate these reasoning patterns in the
future — without seeing them in context.
Action
Fine-tune θ using A_good as supervised examples. The reasoning patterns
that lived in context cache now live in weights.
Result
The model's reasoning policy π_θ is permanently updated. Session
experience becomes permanent capability.
// Session cache → weight update
θ_new = fine_tune(θ, A_good)
// Session cache is now cleared — but its essence lives in θ_new
del session_cache // memory gone, but knowledge kept
WHAT CHANGES IN θ AFTER PHASE 2
Before
Model needs documents in context to reason about systemic financial risk
After
Model internally knows the "pricing → incentives → regulation →
cascade" structure — it's in θ now
Crucially
This is not the session memory. This is the model's learned
reasoning policy. It persists forever.
Phase 2 result: Session reasoning → permanent parameter update → model
improves
FROM STEP 2.1
Structured Extraction
Reasoning patterns are extracted structurally, not as raw text — ready for
learning.
FROM STEP 2.2
Quality Gated
PRM ensures only step-validated reasoning reaches the weight update. No noise
enters.
FROM STEP 2.3
Permanent Capability
What existed in the session cache as context now exists in θ as reasoning
skill. Forever.
Full System
AI-CLS Architecture — How It All Runs
One session, start to finish. Phase 1 runs during inference. Phase 2 runs after the
session.
AI-CLS · Full Runtime + Learning Pipeline
📥 User Query arrives
↓
PHASE 1 — Runs during inference (loops T times)
Memory M
retrieve(q, s_t)
docs relevant
to
current reasoning step
→
LLM π_θ
generate s_{t+1}
next reasoning
step
using new docs
→
PRM r_φ
score r_t
how good
was
this step?
↑ s_t feeds back → triggers smarter retrieval next loop ↑
↓
✓ Answer returned to user
📤 Final Answer Generated
↓
session cache contains: full τ + all r_t scores
↓
PHASE 2 — Runs after session ends (async)
Session Cache
τ, r_t scores
⚠ temporary —
will be lost
→
f(τ) Extraction
+ PRM filter
only good
steps proceed
→
Fine-tune θ
θ_new = θ + Δ
permanent
skill update
↓
🧠 LLM θ is permanently smarter at this type of reasoning
Every future session starts from a slightly better model. The cycle repeats.
Comparison
What Makes AI-CLS Different
| Aspect |
❌ Standard RAG + LLM |
✓ AI-CLS |
| Retrieval trigger |
Once, on query only |
Every reasoning step, on query + state |
| Reward signal |
Final answer only |
Every step, via PRM |
| Session learning |
Discarded on session end |
Extracted and consolidated into θ |
| Parameters θ |
Frozen at inference |
Update after each session (async) |
| Memory role |
Static document lookup |
Dynamic context that informs retrieval state |
| Model improves? |
Never from use |
After every qualifying session |
Theory Check
How Well Does AI-CLS Mirror Biological CLS?
A direct mapping of every CLS biological mechanism to its AI-CLS counterpart — and where
we're still missing pieces.
🧠 Biological CLS Mechanism
🤖 AI-CLS Implementation
Two complementary systems (hippocampus + neocortex)
✓
Two systems: external memory M + parameters θ
Fast learning in hippocampus (single-shot episodic)
✓
Fast context update: session cache M absorbs all docs in one pass
Slow learning in neocortex (many repetitions)
✓
Slow weight update: θ updates from batches of validated sessions
Sleep replay: hippocampus → neocortex transfer
✓
Phase 2: session cache artifacts → fine-tune θ
Selective consolidation (only important memories)
✓
PRM threshold filter: only high-reward trajectories update θ
Memory-guided attention (hippocampus directs cortical focus)
~
Partial: state-conditioned retrieval mimics this, but not at neural level
Forgetting (interference protection during sleep)
~
Partial: fine-tuning can cause catastrophic forgetting — needs continual
learning methods
What If This Works
The Benefits of AI-CLS at Scale
📉
Drastically Reduced Hallucinations
Retrieval aligns with reasoning state. PRM validates each step. The model never
confidently reasons off unsupported ground.
🔍
Fully Traceable Decisions
Every answer comes with: retrieved sources, each reasoning step, PRM scores per
step. Interpretability is built-in, not an afterthought.
📈
Models That Actually Improve From Use
After millions of sessions, θ has consolidated the best reasoning strategies
from all of them. The model gets better — without being retrained from scratch.
🧩
Domain Specialization Through Experience
A model deployed in medicine consolidates medical reasoning patterns. One in law
consolidates legal reasoning. Specialization emerges from experience — not fine-tuning on static
datasets.
The Core Bet
The brain didn't evolve a single monolithic system for memory and reasoning — it evolved two
complementary ones that work together. AI-CLS applies that same principle to language models: keep a
fast, flexible context memory and a slow, validated parameter update. Connect them. Let them co-improve.
That's how you build a model that learns from experience, not just
from pretraining.