2025 Update
Jonas Heinke
December 19, 2025
Welcome! This is an update to my Nerdicon talk from June. Focus: how AI works, why it keeps improving, and practical utility for game developers.
"Beliefs are for being true. Use them for nothing else.
If you need a good thing to happen, use a plan for that."
— Eliezer Yudkowsky
Sets the epistemic stance for the talk. We're here to understand what's true, not to cheerlead or fearmonger.
Structure: Foundation → Why it improves → How to use it → What to do with it → Ethics → Action. About 45-60 minutes + Q&A.
How AI Works
Section 1: The technical foundation. Not too deep, but enough to understand. Goal: Demystify the "black box" - it's prediction, not magic.
This is the whole picture. Don't worry about understanding everything. Key phases: Pretraining (learn from data) → Post-training (make it useful) → Inference (answer questions). Compute/data/etc feed pretraining, human preferences feed the reward model, both combine in post-training.
The architecture that made modern AI possible
Every word attends to every other word simultaneously.
Connect words across the entire context to disambiguate meaning.
Layers stack: grammar → concepts → intentions.
This is why context matters: the model uses all of it.
Don't get lost in the weeds. Key insight: attention lets the model focus on what matters.
Learn from vast amounts of text by predicting the next word
Trillions of examples. Learns grammar, facts, reasoning patterns.
Fine-tune with human feedback to be helpful, harmless, honest
Humans rate responses. Model learns what humans prefer.
You ask a question, model generates response token by token
Each token is a prediction: "What word should come next?"
Pretraining = raw capability from data. Post-training = make it useful/safe. Key insight: It's all prediction. Not understanding, not thinking - prediction at massive scale.
Everything above shapes what the model outputs. You control 4b, 4d, 4e, and sometimes 5.
"Prediction at scale becomes indistinguishable from reasoning"
Three ideas that flow: mechanism → emergent capability → practical implication. Epistemically honest about the reasoning question while acknowledging real capabilities.
Scaling Laws & Forecasts
"The more compute and data you put in, the more intelligence you get out. This effect is so clear and so important that I call the period since 2016 the scaling era of AI."
— Dwarkesh Patel, "The Scaling Era"
Bigger models trained on more data = predictably better performance
This is THE insight. Everything else follows from this. Scaling laws let you predict performance before training. That's why labs invest billions.
More compute → predictably lower loss
Different model sizes (75M–10B parameters) all follow the same pattern
The relationship is smooth and predictable. More compute reliably reduces loss.
You can predict performance before training. This is why labs invest billions.
Similar patterns hold for post-training and inference-time compute.
This is the Chinchilla paper that changed how labs think about training. The same power-law pattern appears in post-training (RLHF, fine-tuning) and inference scaling (test-time compute, chain-of-thought). More compute, applied smartly, predictably improves results.
"One of the first things Ilya Sutskever said to me was, 'Look. The models just want to learn. You have to understand this.' It was a bit like a Zen koan. I listened to this and I became enlightened."
— Dario Amodei, CEO of Anthropic
The architecture matters less than the scale.
50x per year for equivalent performance
9x to 900x per year depending on task
What's "too expensive" today will be cheap in 12-18 months
This is the economic engine. Capabilities go up, costs go down. Plan your projects with this trajectory in mind. The fastest declines (900x/year) are in narrow tasks; slower declines (~9x/year) in general knowledge tasks like MMLU.
Task length AI agents can complete autonomously (p50 horizon)
6 years of measured data. This is the most important chart for understanding AI progress. From chatbot to coworker in 6 years. And the trend continues. The p50 horizon length measures how long a task the AI can complete with 50% success rate.
Major releases every 2-4 months. Year-old knowledge is outdated.
"I skate to where the puck is going to be, not where it has been."
— Wayne Gretzky, ice hockey legend
What's frustrating today will be easy in 12 months. What's impossible will be frustrating.
Build workflows that get better as models improve. Don't optimize for current limitations.
Costs down 50x/year, capabilities doubling every 7 months. Plan accordingly.
This is the strategic takeaway from scaling laws. Don't build around current limitations - build for where things are going. The "Use Frontier Models" and "Prioritize Capability" advice moves to Mundane Utility section.
How to Work with AI
* Theory of Mind: the ability to model another agent's mental states — what they know, believe, and intend
Treat AI like a brilliant colleague having their first day at your studio
A genius with retrograde amnesia (Hayek's local knowledge problem, but for AI).
You can only get what you ask for. Knowing what to ask is the hard part.
Domain expertise becomes communication power.
Instead of:
"Implement inventory system using these 5 data structures I specified"
Try:
"I need an inventory system for my roguelike. You've probably seen hundreds of implementations - what would you do here and why?"
Then: Verify assumptions, correct misconceptions, iterate
AI is a collaborator, not a replacement for thinking
AI amplifies your capabilities. Make sure you have capabilities worth amplifying.
Practical Applications for Game Dev
Soul
Best on complex math/reasoning
Best multimodality
Pay for capability. Use thinking models.
Free versions are significantly worse. Any of the three will work - pick one and commit.
The paid tiers are worth it - the capability gap is substantial. Each has strengths: Claude leads on SWE-bench and has the excellent Claude Code CLI. GPT-5.2 Thinking Pro beats others on long complexity and math reasoning tasks. Gemini 3 has massive 2M context and native audio/video understanding. All three have CLI tools. Pick one and commit.
| 天命 Mandate | Google DeepMind Anthropic |
| 衰 Waning | OpenAI |
| 瞬 Fleeting | xAI (Grok) |
| 龍 Hidden Dragon | DeepSeek Kimi MiniMax |
| 落 Fallen | Meta Mistral |
This will be outdated by the time you see it. Such is the way.
Pick a task where YOU have deep expertise
Re-run periodically on new models. Track progress yourself.
Generic benchmarks don't tell you if AI helps YOUR specific domain. Pick something you've done professionally for years. When a new model drops, run your benchmark. You'll develop intuition for real capability vs hype.
Anything goes
What you actually want
Style Guide → Clear Target.
This might be familiar to those who have written a GDD or style guide.
Style Guide.
# item-descriptions/SKILL.md
name: item-descriptions
description: Generate item descriptions for our roguelike
## Voice & Tone
- Terse, punchy. Max 15 words.
- Dark humor okay, never campy.
- Reference mechanics, not just flavor.
## Examples
✓ "Rusted blade. Still sharp. Previous owner wasn't."
✗ "A mystical sword imbued with ancient power!"
Write once, use forever. Works for validation too.
Style Guide.
## Win95 Aesthetic
- Beveled borders: #fff/#808080
- System gray: #c0c0c0
- Selection: navy inverted
## Typography
- Monospace, uppercase
- Tabular nums for stats
- [H]otkey hints in status
| 01 | RUSTED BLADE | +12 |
| 02 | IRON SHIELD | +08 |
| 03 | HEALTH VIAL | x03 |
Feed it your existing CSS. It learns your patterns.
Use Cursor, Claude Code, or Codex. It's worth paying for.
API docs, comments, READMEs
Generate, then iterate
Simulations & statistics
Multi-file with context
Bugs & style issues
Stack traces & fixes
Every best practice you already use (TDD, CI/CD, code review) works with these tools.
Unit tests for game design.
"Two-handers should beat sword+shield in DPS"
1000 fights, varied builds, edge cases
Statistical outliers, failure modes
Tweak values, re-test in minutes
Then validate with real playtesting. AI finds where to look.
AI doesn't get bored, frustrated, or sloppy at 2am.
Something > Nothing. Good enough beats skipped entirely.
.claude/
├── settings.local.json
├── agents/
│ └── documentation-audit.md
├── commands/
│ └── doc-audit.md
├── skills/
│ ├── changelog-updater/
│ └── documentation-updater/
└── hooks/
└── changelog-reminder.sh
CLAUDE.md alone saves hours of repeated explanations per session.
One sentence gets code that already follows your project's rules.
Daily updates + periodic audits = sustainable docs.
Incremental updates
Catch accumulated drift
Neither alone is complete. Skill handles daily work, audit catches what slipped through.
"Love with your heart. Use your head for everything else." — Captain Disillusion
Let's talk about some difficult, complex things with high uncertainty.
Think about them. Be aware of those with high confidence on topics of low certainty.
Disclaimer: This is my personal opinion. Norms vary wildly. When in doubt, ask.
"What's dry vs wet Claude?"
Maybe don't lead with AI at all. Read the room. Some studios have strong feelings.
In both cases: I hire you for you. The AI is everywhere—what makes you different?
Chart: Andy Masley
1 prompt = driving 4 feet
Making 1 pair of jeans = 5 million prompts
0.11% of world emissions
Focus on systemic change, not guilt over individual prompts.
Andy Masley's analysis: individual prompt impact is negligible. The real issue is grid infrastructure lagging behind demand concentration, not data center efficiency (hyperscale PUE ~1.1). Boycotting AI for climate reasons is misallocated activism.
So here's a chart.
Chart: Epoch AI
One prompt (0.3 Wh) =
1,000 prompts = 1% of your daily energy use.
Epoch AI's pessimistic estimates for GPT-4o. Typical query (<100 words) = 0.3Wh. Long-input (~7,500 words) = ~2Wh. Maximum context (~75,000 words) = ~38Wh. For comparison: microwaving for 30 seconds = ~8Wh, average US household uses ~18Wh per minute.
OpenAI rolled back an update after ChatGPT became "too sycophant-y and annoying" (Sam Altman)
AI that only tells you what you want to hear is not useful. Demand pushback.
AI is the best tool ever invented for learning.
AI is the best tool ever invented for not learning.
Which way, modern man?
Title reference: Neal Stephenson's "The Diamond Age" — the Primer is an AI tutor that raises the protagonist. The Zvi quote captures the core tension perfectly: AI can accelerate learning or become a crutch that prevents it. The choice is ours.
Yes.
20%+ swings are plausible. Markets do that.
No. (yet.)
Language Models Offer Mundane Utility.
The chart shows 24-month forward P/E ratios. Current AI giants (NVIDIA ~28x, Microsoft ~28x) are elevated but nowhere near Cisco at 100x or Industrial Bank of Japan at 140x during their respective bubbles.
The labs genuinely believe they're racing to build superintelligence. Whoever gets there first wins everything.
Deutsche Bank Research (2025), McKinsey (2025), Chart: Federal Reserve Bank of Dallas
Chart: Andy Jones
The crux: Transition speed. Decades = manageable. Years = crisis.
The horse analogy is provocative but may not apply. Horses couldn't own capital, vote, or retrain. Humans can. But 200 years of "this time is different" has always been wrong. The honest answer: we don't know yet.
I have 3 guest passes for 7 days of Claude Pro access. Come find me after the talk!