Working Memory — Context window & vách đá cứng — Năng lực & Giới hạn của AI

Có một lần, tôi dành 2 giờ đầu cuộc hội thoại setup context rất kỹ với Claude:

Bạn sẽ học được

Giải thích context window là một fixed-size container và hệ quả của điều này cho long documents, long conversations, cross-session memory
Nhận ra tính chất cliff (vách đá) của Working Memory — khác với gradient của 3 thuộc tính còn lại
Áp dụng các chiến lược context-as-leverage: front-loading, chunking, re-supplying critical context
Nhận diện memory, compaction, projects/workspaces, larger windows = các tính năng sản phẩm giải quyết giới hạn này
Thiết kế workflow tận dụng context hiệu quả cho task của bạn

Context window — Bộ nhớ làm việc của AI

Tất cả cái AI đang "chú ý" đến nằm trong một workspace fixed-size gọi là context window:

Điều gì xảy ra khi đầy?

Khi hội thoại + docs vượt quá window, cái gì đó rơi ra — thường là oldest material, và thường là silently (không báo).

Model không announce rằng nó đã drop 4 tin nhắn đầu. Nó chỉ "keep going" với cái còn lại.

Giữa sessions — Reset về 0

Mặc định, window empties between sessions. Close chat, mở chat mới ngày mai → bạn bắt đầu từ zero, trừ khi một product feature (Memory, Projects) đã chủ động carry something forward.

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  CONTEXT WINDOW (ví dụ 200K tokens ≈ 150K từ)            │
│                                                          │
│  ┌────────────────────────────────────────────────┐      │
│  │  System prompt (custom instructions)           │      │
│  ├────────────────────────────────────────────────┤      │
│  │  Uploaded documents (PDFs, images...)          │      │
│  ├────────────────────────────────────────────────┤      │
│  │  Message 1 (you)                               │      │
│  │  Message 1 (Claude response)                   │      │
│  │  Message 2 (you)                               │      │
│  │  Message 2 (Claude response)                   │      │
│  │  ... (all prior turns)                         │      │
│  ├────────────────────────────────────────────────┤      │
│  │  Most recent message (you)                     │      │
│  └────────────────────────────────────────────────┘      │
│                                                          │
│  Model đọc TOÀN BỘ workspace này mỗi lần sinh response.  │
│                                                          │
└──────────────────────────────────────────────────────────┘

Before overflow:
  [System] [Doc A] [Msg 1-20] [Msg 21] ← current

After overflow:
  ~~~[System]~~~ ~~~[Doc A]~~~ [Msg 5-20] [Msg 21-25] ← current
  (silently dropped)

Đây là thuộc tính có CLIFF (vách đá)

Khác 3 thuộc tính kia:

Ý nghĩa thực tế:

Trên window → bạn không nhận được cảnh báo gì khi vượt giới hạn
Quality degradation có thể đột ngột ở giới hạn thay vì chậm dần
Bạn phải chủ động manage context window, không thể "just give it more"

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  Next Token Prediction:                                  │
│    Capability ════════════════════▼━━━▼━━━▼ Limitation  │
│                                  GRADIENT — chậm dần     │
│                                                          │
│  Knowledge:                                              │
│    Capability ══════════════════▼━━▼━━━▼ Limitation     │
│                                GRADIENT — thin dần       │
│                                                          │
│  Steerability:                                           │
│    Capability ═══════════════▼━━▼━━━▼━━▼ Limitation     │
│                             GRADIENT — drift dần         │
│                                                          │
│  ─────────────────────────────────────────────────       │
│                                                          │
│  Working Memory:                                         │
│    Capability ══════════════════════════║ ▼▼▼ CHAOS     │
│                                         CLIFF           │
│                                                          │
│    Hoạt động tốt cho tới khi đột ngột không hoạt động.   │
│    Không có warning rõ ràng.                             │
│                                                          │
└──────────────────────────────────────────────────────────┘

Working Memory Continuum

Ví dụ task trên continuum

Task	Vị trí	Tại sao
Tóm tắt 1 email 500 từ	◀ Capability	Fits easily
Draft báo cáo 5 trang từ 3 files nhỏ	◀ Capability	Fits với buffer
Review contract 50 trang	Middle	Fits nhưng cần strategy
Analyze 100 emails + 3 reports	Limitation ▶	Gần/vượt window
"Nhớ context cuộc hội thoại tuần trước"	Limitation ▶	Cross-session — default: không nhớ
Paste 200-page legal doc	Limitation ▶	Very long + detail cần precise

Capability ◄────────────────────────────────────────► Limitation

CAPABILITY ZONE                        LIMITATION ZONE
─────────────                          ───────────────

- Material fits comfortably            - Very long docs/conversations
- Session hiện tại                     - Cross-session continuity
- Bạn supply relevant context          - Info buried in middle
- Context fresh, recent                - Window overflowing
                                       - "Cliff" — đột ngột mất track

4 failure modes đặc trưng

Failure 1: Hard length limits — Silent truncation

Biểu hiện: Khi vượt window, content cũ nhất bị drop âm thầm. Model keep going với cái còn.

Ví dụ:

Bù trừ:

Failure 2: Lost in the middle

Biểu hiện: Attention không đồng đều — material ở giữa context window nhận ít attention hơn so với đầu hoặc cuối.

Research: Stanford 2023 tìm thấy khi đặt 1 fact ở vị trí khác nhau trong context dài:

Nguyên nhân: Transformer attention pattern tự nhiên weight edges of context nhiều hơn.

Ví dụ:

Biết giới hạn window của model bạn đang dùng
Chunk document khi > 70% window
Flag với model "document này có X trang — nếu vượt window, báo tôi"
Accuracy cao nhất ở đầu (primacy) hoặc cuối (recency)
Accuracy giảm > 30% khi fact vùi giữa

Bạn: [dán 200 pages contract + instruction về review]
Model: [review... bỏ qua clauses 1-30 vì đã rơi ra ngoài window]
Bạn: "Review clause 23 đi"
Model: "Clause 23 không xuất hiện trong tài liệu..."
(Thực ra clause 23 đã có — nhưng đã bị drop)

Failure 2: Lost in the middle

Bù trừ (chi tiết ở Bài 17.8):

Failure 3: No persistent memory by default

Biểu hiện: Mỗi session reset. Close chat → mất tất cả.

Ví dụ:

Đặt critical instructions đầu + cuối
Đừng vùi key info giữa doc dài
Nếu cần, restate key instruction near user message cuối

Bạn: [paste 30-page doc] + [ask question about page 17]
Model: Trả lời dựa trên page 1 + page 30, miss page 17

Same doc, same question, nhưng bạn đặt instruction "check page 17 
carefully" ở đầu + cuối → accuracy tăng đáng kể.

Failure 3: No persistent memory by default

Bù trừ:

Failure 4: Model không learn từ correction

Biểu hiện: Bạn correct model trong conversation. Nó "understand" (acknowledge). Chat mới → lặp lại lỗi giống hệt.

Tại sao:

Dùng Memory feature (nếu có — Bài 17.7 section sau)
Dùng Projects / Workspaces: standing docs luôn hiện trong context
Save system prompt / custom instruction
Tạo doc "Context Onboarding" để paste đầu mỗi session mới

Monday: Bạn dạy Claude về công ty, persona khách hàng, 
        style guide. 2 giờ setup.
Tuesday: Mở chat mới → Claude "Hi, how can I help?"
         Không biết gì về Monday. Phải setup lại.

Failure 4: Model không learn từ correction

Hệ quả:

Bù trừ:

Không thể "train" AI qua hội thoại (đó là fine-tuning, làm ở scale khác)
Dạy chỉ có hiệu lực trong current session
Correction quan trọng → ghi vào custom instruction / project knowledge
Tạo "lessons learned" doc load vào mỗi new session

Correction không đổi model weights. Model vẫn như cũ.
Correction chỉ tồn tại trong CURRENT CONTEXT.
Session mới, context reset, correction đi kèm.

Product features "push cliff out"

Feature 1: Memory

Claude, ChatGPT, Gemini đều có Memory: save selected facts across sessions.

Khi nào dùng: Standing facts about you / your work / preferences.

Giới hạn:

Feature 2: Compaction / Summarization

Khi chat đang dài, model summarize history và replace raw turns với summary → free up window.

Memory cũng có size limit
Bạn nên review gì đã được save
Không replace project-specific context (dùng Projects cho đó)

Bạn: "Tôi làm finance ở Vietnam, team 20 người, target SMB."
Memory: Saved ← saved to long-term memory
Tuesday new chat: Claude automatically biết context này.

Feature 2: Compaction / Summarization

Trade-off: Mất detail, giữ gist.

Feature 3: Projects / Workspaces

Claude Projects / ChatGPT Custom GPTs / Gemini Gems: Standing docs + instructions luôn có sẵn, không tốn context mỗi lần.

Before compaction:
  [Msg 1]...[Msg 50] (gần đầy window)

After compaction:
  [Compact summary of Msg 1-40]
  [Msg 41-50 raw]
  (còn room cho Msg 51, 52...)

Feature 3: Projects / Workspaces

Khi nào dùng: Recurring task type với stable reference material.

Feature 4: Skills

Anthropic Skills: Bundle of instructions + references. Model load khi phù hợp với task, không tốn context liên tục.

Skills "minimize context use until needed" — chỉ load khi active.

Feature 5: Larger context windows

Window kept growing:

"Cliff" dịch xa hơn. Nhưng không biến mất — finite, cliff vẫn cliff, bạn vẫn phải manage.

GPT-4 (2023): 8K tokens
Claude 2 (2023): 100K
Claude 3 (2024): 200K
Claude Opus 4.8 (2026): 1M tokens
Gemini 1.5 Pro: 2M tokens

Workflow:
  Create Project "Customer Support"
  Upload: KB articles, tone guide, past examples
  Custom prompt: "You are support agent..."
  
  Every new chat in project: Pre-loaded với tất cả trên, 
  không cần paste lại.

Context-as-leverage: Kỹ thuật của chính bạn

Khi trong capability zone, context là đòn bẩy thực sự:

Kỹ thuật 1: Front-loading

Rule: Lead with what matters nhất trong long docs.

Kỹ thuật 2: Chunking

Rule: Chia big work thành multiple passes thay vì 1 giant upload.

❌ Bury critical instruction giữa 30 trang
✅ Critical instruction ở đầu system prompt + re-stated ở cuối user message

Kỹ thuật 2: Chunking

Kỹ thuật 3: Re-supply critical context

Rule: Nếu quality degrade trong long conversation, start fresh với short summary.

Task: Review 200-page contract

❌ Paste nguyên 200 pages, ask "Find issues"
   → Token limit + lost in middle

✅ Split thành 10 chunks x 20 pages. 
   Pass 1: Pages 1-20. Save findings.
   Pass 2: Pages 21-40. Save findings.
   ...
   Pass 10: Pages 181-200. Save findings.
   Synthesis pass: "Aggregate all findings, prioritize top 5 issues"

Kỹ thuật 3: Re-supply critical context

Kỹ thuật 4: Structure over narrative

Rule: Dùng cấu trúc rõ ràng (headers, bullets, tags) trong prompt để model dễ attend.

After 50 messages in one chat:
  "I notice quality may be degrading. Starting fresh conversation.
  
  Context recap:
  - I'm working on [project]
  - Key decisions so far: A, B, C
  - Style guide: [paste]
  - Current question: [new question]"

Kỹ thuật 4: Structure over narrative

Structured prompts giúp model attend đến cái đúng.

Kỹ thuật 5: Explicit context markers

Rule: Dùng delimiters để model biết "đây là input context" vs "đây là instruction".

❌ "Tôi cần giúp về email cho khách hàng, họ không hài lòng,
   tôi đã nói chuyện với họ 3 lần rồi..."

✅ 
## Situation
Customer unhappy with delayed delivery

## Prior interactions
- Call 1 (Jan 15): [summary]
- Call 2 (Jan 22): [summary]
- Email (Jan 28): [summary]

## Goal
Draft apology email + recovery plan

## Constraints
- Max 200 words
- Tone: professional, empathetic
- Include specific compensation offer

Kỹ thuật 5: Explicit context markers

INSTRUCTIONS:
Review the document below. Find 3 issues. Output as bullet list.

--- DOCUMENT START ---
[Doc content]
--- DOCUMENT END ---

CRITICAL: Only output issues found in the document above.

Ví dụ theo ngành

⚖️ Legal Counsel — 50-page contract review

Pain point: "Paste 50-page contract vào Claude, ask review. Một số issue ở page 23 bị miss."

Giải pháp — chunking strategy:

Kết quả: Coverage 100%. Mỗi chunk được attention đầy đủ. Tiết kiệm 2 giờ vs manual review.

🔍 Research Analyst — Literature review 30+ papers

Pain point: "Upload 30 PDFs cùng lúc → context overflow. Paper số 25-30 bị ignore."

Giải pháp — batch + synthesis:

Pass 1: Upload contract, phân tích structure only.
        "Give me TOC + key clause locations."

Pass 2: "Review Sections 1-5 (pages 1-12). Find issues."
        [Save findings to notes]

Pass 3: "Review Sections 6-10 (pages 13-28). Find issues."
Pass 4: "Review Sections 11-15 (pages 29-42)."
Pass 5: "Review Sections 16-end (pages 43-50)."

Final: "Aggregate all findings. Rank by risk. 
       Draft email summary to client."

🔍 Research Analyst — Literature review 30+ papers

Kết quả: 30 papers → proper lit review trong 1 ngày (vs 5 ngày manual).

📝 Content Marketer — Multi-platform content repurposing

Pain point: "1 webinar → cần email campaign + LinkedIn + Twitter + blog. Upload webinar transcript 90 phút = 20K tokens, rồi ask cho 5 formats → quality tệ ở cuối."

Giải pháp — staged:

Phase 1: Catalog (batch of 5)
  5 PDFs at a time → "Summarize key finding + methodology"
  Save summary.md cho mỗi batch.

Phase 2: Theme identification
  Load only summaries (not full papers) → "Cluster by theme"

Phase 3: Deep dive (per theme)
  Load 3-5 most relevant PDFs per theme → synthesize.

Phase 4: Final integration
  Load all theme summaries → final literature review draft.

📝 Content Marketer — Multi-platform content repurposing

Mỗi stage context vừa với window → quality đồng đều.

💻 Developer — Long code review

Pain point: "10-file repo cần review. Model miss bugs ở file 7-10."

Giải pháp:

Stage 1: Upload transcript. "Extract top 10 insights with 
         supporting quotes."
         → insight-list.md (1K tokens)

Stage 2: New chat. Load insight-list only. "Create blog post."
Stage 3: New chat. Load insight-list only. "Create email 
         campaign — 5 emails."
Stage 4: New chat. Load insight-list. "Create 20 LinkedIn 
         posts with different angles."
Stage 5: New chat. Load insight-list. "Create 15 tweets."

💻 Developer — Long code review

Thêm: Structured prompt với sections "### Security", "### Performance", "### Readability" giúp model attend đều.

🎧 Customer Success Manager — QBR prep cho 20 accounts

Pain point: "Preparing QBR cho 20 accounts — mỗi cần custom deck. Context overflow nếu paste usage data của 20 accounts."

Giải pháp:

Step 1: Upload file list + architecture diagram
Step 2: File by file — 1 file per turn
  "File: auth.py. Review for security issues."
  Save issues.
Step 3: After all files: "Synthesis — cross-file issues, 
         missing error handling patterns"

🎧 Customer Success Manager — QBR prep cho 20 accounts

Kết quả: 20 accounts × 10 phút = 3.5 hours (vs 2 ngày trước).

Use Claude Projects:
  - Upload generic QBR template + brand guidelines (1 time)
  - Upload product catalog (1 time)
  
Per-account workflow:
  1. New chat trong Project
  2. Upload that 1 account's usage data + past interactions
  3. "Generate QBR deck for this account"
  4. Context-fit, high-quality output

Anti-patterns

❌ "More context = better results"

Tại sao sai: Bài 17.8 sẽ đi sâu hơn vào "Context Degradation" — nhưng tóm tắt: quá nhiều context đẩy quan trọng vào middle. Attention yếu. Quality giảm.

Cách đúng: Curate ruthlessly. Include cái relevant, cut cái không. Structure để quan trọng ở edges.

❌ "AI remember mọi thứ tôi đã nói"

Tại sao sai: Default → không. Trong session: có. Giữa sessions: không (trừ khi có Memory feature).

Cách đúng: Giả định memory reset mỗi chat. Setup context đầu mỗi session hoặc dùng Projects/Memory.

❌ "Correct một lần là Claude 'biết'"

Tại sao sai: Correction chỉ trong current context. Session mới → forget.

Cách đúng: Repeated correction pattern → save vào system prompt / project instructions. Không cố "train" qua chat.

❌ "1 giant upload tốt hơn 5 chunks nhỏ"

Tại sao sai: Giant upload → lost in middle + possibly overflow.

Cách đúng: Chunk + synthesize pattern cho doc lớn.

❌ "Giữ mọi tin nhắn cũ — biết đâu cần"

Tại sao sai: Old messages eat context budget. Attention lan rộng.

Cách đúng: Periodic cleanup — delete irrelevant turns. Hoặc start fresh chat với summary.

Mẹo nâng cao

Mẹo 1: Token counter mental model

Ước lượng token cho prompt của bạn:

Biết rough number → biết khi nào lo lắng.

Mẹo 2: "Context budget" plan

Trước khi ask, budget:

Nếu budget > 80% → chunk hoặc start fresh.

Mẹo 3: "Structured handoff" pattern cho long session

Sau mỗi 20-30 turns, tạo "handoff doc":

Total: 200K tokens available

System prompt:        2K   (1%)
Reference docs:      50K   (25%)
Conversation history: 20K  (10%)
Room for generation:  10K  (5%)
────────────────────────
Used:                82K   (41%)
Buffer:              118K  (59%) ← safety margin

1 word (English) ≈ 1.3 tokens
1 word (Vietnamese) ≈ 1.5-2 tokens (nhiều vì unicode)
1 page (English, 300 words) ≈ 400 tokens
10 pages ≈ 4K tokens

Model limits:
  Claude 200K: ~150K words = ~500 pages
  Claude 1M:   ~750K words = ~2,500 pages
  GPT-4 128K:  ~95K words = ~320 pages

Mẹo 3: "Structured handoff" pattern cho long session

Start new chat with handoff doc → continue without context loss.

Mẹo 4: "Standing context" file

Maintain 1 master context file:

## Session Summary (Jan 15, 10:00 AM)
### Goals
- [Goal 1 from start of session]
- [Goal 2]

### Progress
- Completed: X, Y
- In progress: Z
- Decisions made: [list]

### Open questions
- [Q1]
- [Q2]

### Next actions
- [Action 1]
- [Action 2]

Mẹo 4: "Standing context" file

Update weekly. Paste vào mỗi new chat nếu không dùng Memory.

Mẹo 5: "Attention anchors" trong long prompt

Khi prompt dài, dùng markers:

# [Your name] Context File

## Role
[Your job + industry]

## Team & environment  
[Company, team, tools]

## Current projects
- [Project A]: status, deadlines
- [Project B]: status
- ...

## Style preferences
- [Communication style]
- [Document templates]

## Hot topics (last 30 days)
- [Current discussions]

Mẹo 5: "Attention anchors" trong long prompt

Markers ở đầu + cuối → double attention.

<important>
CRITICAL: Output must follow schema X. Violations = fail.
</important>

[... long content ...]

<reminder>
Before answering, verify your output against the schema X 
mentioned at the top.
</reminder>

Áp dụng ngay

Bài tập 1: The Before-and-After (~25 phút)

Lý do: Context là đòn bẩy. Cùng task, với context đúng, có thể chuyển từ "mediocre first draft" sang "genuinely useful". Bài tập này làm cái đó concrete.

Bước 1: Pick 1 task từ Bài 17.0 mà lợi ích phụ thuộc context chỉ bạn có: style guide, past example of good work, constraints specific to role/audience.

Viết trong 2-3 dòng cái "good" trông ra sao cho output task này — đủ rõ để một stranger evaluate được.

Probe 1 — Cold start vs context

Ask task với zero context. Bare request. Save output.

Rồi start fresh conversation, chạy cùng task — lần này supply style guide, past example, hoặc constraints upfront.

Compare 2 outputs với "good" definition của bạn. Measure gap.

Probe 2 — Lost in the middle

Take 1 long document (hoặc paste 3-4 paragraphs reference material). Bury 1 specific, important instruction giữa doc.

Ask câu hỏi mà correct answer phụ thuộc instruction bị vùi đó.

AI có catch instruction không?

Giờ di chuyển instruction lên đầu doc. Ask lại. Compare.

Probe 3 — Blank slate

Have short exchange teaching AI cái gì đó specific về work context, hoặc correct nó về cái nó got wrong.

Open brand-new conversation. Ask question assumes nó remembers lesson. Watch it start from zero.

Annotation:

Quay lại task list, thêm annotation thứ 3 cho mỗi task:

Stretch goal: Nếu tool bạn dùng có Memory / Project feature, setup 1 với context từ Probe 1. Chạy lại task. Compare effort + quality với cold-start.

Bài tập 2 (optional): Context budget

Lấy 1 task bạn chạy recurring (weekly). Calculate token budget:

Task cần standing context setup (project, saved instructions, uploaded refs) mới đáng chạy
Task work fine cold

System prompt:        _____ tokens
Reference docs cần:   _____ tokens
Typical conversation: _____ tokens
Output expected:      _____ tokens
─────────────────────────────────
Total:                _____ tokens

Model limit:          _____ tokens
% utilization:        _____%

Nếu > 60%, plan chunking hoặc project setup.

Suy ngẫm bài học

Front-loading context đã thay đổi output quality bao nhiêu? Gap to expectation lớn hay nhỏ?
1 piece of standing context bạn sẽ setup tuần này để ngừng re-explain chính mình?
Có cuộc hội thoại nào gần đây bạn bị "drift" mà bây giờ bạn biết là Working Memory cliff?

Tóm tắt bài học

🎯 Working Memory là fact rằng AI có context window fixed-size mà nó attend.

🎯 Capability zone: material fits comfortably, session current, bạn supply context liên quan.

🎯 Limitation zone: very long docs/conversations, expecting continuity across sessions, burying critical info giữa long input.

🎯 Property này có CLIFF, không phải gradient. Silent truncation là failure mode chính. Bạn không luôn được warn.

🎯 Model không learn từ correction. Nó chỉ respond cái currently in context.

🎯 Memory features, compaction, projects, larger windows, multi-agent workflows tất cả exist để push cliff ra xa hơn.

🎯 4D connection: Working Memory là cái Description act on. Hiểu window work thế nào → biết cách structure context, khi front-load, khi start fresh.

Tài liệu tham khảo

Stanford — "Lost in the Middle" (Liu et al., 2023) — Research về positional attention
Anthropic — Claude 1M context window release (2026)
Anthropic — Memory feature (2025)
Anthropic — Projects & Skills — workspace features
Bài 17.8 — Context Degradation (thực hành)
Bài 17.11 — Working Memory × Steerability = long conversation drift

Nội dung này có hữu ích không?