Transformers for LLM Users: What Developers Actually Need to Know
Learn the 5 transformer concepts that matter for developers: context windows, tokenization, sampling parameters, and cost optimization for Claude and Gemini.
Software Engineer & Trainer
TL;DR
Understanding context windows, tokenization, and sampling parameters will save you money and help you debug LLM issues. Claude Opus 4.5 has 200K context at $5/$25 per 1M tokens; Gemini 3.0 Pro has 1M context at $2/$12 per 1M tokens.
Key Takeaways
- 1 Context windows limit how much text the model can process at once (200K-1M tokens)
- 2 Code tokenizes worse than prose, often requiring 2-3x more tokens for the same content
- 3 Use temperature 0-0.2 for code generation, higher for creative tasks
- 4 Always check the stop_reason to catch truncated responses
- 5 Gemini is cheaper for volume, Claude excels at complex reasoning
You don’t need to understand backpropagation to build applications with LLMs. But understanding a handful of key concepts will save you money, help you debug issues faster, and get consistently better results.
This guide covers what you actually need to know as a developer working with LLM APIs. No machine learning degree required.
Attention: The Core Innovation
The transformer architecture, introduced in the 2017 paper Attention Is All You Need, powers every modern LLM. Its key innovation is the attention mechanism, which lets the model consider all parts of the input simultaneously rather than processing text sequentially.
Think of it like reading with multiple highlighters. When the model generates a response to your code question, it highlights relevant parts of your input (function names, variable types, error messages) and weighs their importance for each word it generates.
Why does this matter to you? It explains why LLMs understand context rather than just predicting the next word. When you ask “what does this function do?”, the model attends to the function definition, its usage patterns, and surrounding code all at once.
That’s all you need to know about attention. Let’s move to the concepts that directly affect your code and your wallet.
Context Windows: Your Working Memory Budget
The context window is the total amount of text an LLM can process in a single request. This includes both your input and the model’s output. Think of it as working memory: everything must fit, and what doesn’t fit gets ignored.
Current Limits (January 2025)
| Model | Context Window | Max Output |
|---|---|---|
| Claude Opus 4.5 | 200K tokens | 64K tokens |
| Gemini 3.0 Pro | 1M tokens | 64K tokens |
To put this in perspective:
- 200K tokens ≈ 150,000 words ≈ a 500-page book
- 1M tokens ≈ 750,000 words ≈ several novels
For most applications, Claude’s 200K is more than sufficient. Gemini’s 1M window becomes valuable when you’re processing entire codebases, long documents, or maintaining very long conversation histories.
The “Lost in the Middle” Problem
Research has shown that LLMs attend more strongly to content at the beginning and end of the context window. Information buried in the middle may receive less attention.
Practical tip: Front-load important context. Put your key instructions and critical code at the start of your prompt, not buried after pages of background.
What Counts Against Your Context
Everything in the conversation consumes tokens:
- Your system prompt
- Previous messages in the conversation
- The current user message
- The model’s response
This is why long chat conversations eventually degrade. Early context gets pushed out or receives less attention.
Tokenization: Why Your Code Costs More Than Prose
LLMs don’t process text character by character. They use tokenization to split text into chunks called tokens. Understanding this explains why code is more expensive than English prose.
How Tokenization Works
Most LLMs use Byte Pair Encoding (BPE), which learns common patterns from training data. Frequent words become single tokens; rare patterns get split into multiple tokens.
// Token counts (approximate)
"hello" → 1 token
"the" → 1 token
"useState" → 2 tokens (use + State)
"handleSubmit" → 3 tokens (handle + Sub + mit)
"@tanstack/query" → 5 tokens
"XMLHttpRequest" → 4 tokens
Code tokenizes poorly because:
- camelCase and PascalCase get split at unusual boundaries
- Symbols like
{},=>,&&often become separate tokens - Package names and paths are rarely in training data
- Indentation (especially tabs) adds tokens
The same logic expressed in natural English typically uses 30-50% fewer tokens than code.
Current Pricing (January 2025)
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Claude Opus 4.5 | $5.00 | $25.00 |
| Gemini 3.0 Pro (≤200K context) | $2.00 | $12.00 |
| Gemini 3.0 Pro (>200K context) | $4.00 | $18.00 |
Output tokens cost more because generation is computationally harder than processing input.
Counting Tokens in Your Code
Both APIs return token counts in their responses. You can also count before sending:
// Using Anthropic's SDK
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
const result = await client.messages.count_tokens({
model: 'claude-opus-4-5-20251101',
messages: [{ role: 'user', content: yourPrompt }]
});
console.log(`Input tokens: ${result.input_tokens}`);
// For Gemini, use the countTokens endpoint
import { GoogleGenerativeAI } from '@google/generative-ai';
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: 'gemini-3-pro' });
const result = await model.countTokens(yourPrompt);
console.log(`Token count: ${result.totalTokens}`);
Cost-Saving Tips
-
Choose the right model for the task. Gemini 3.0 Pro costs 60% less than Claude Opus 4.5 on input tokens. Use it for high-volume tasks where Claude’s stronger reasoning isn’t essential.
-
Use batch processing. Both providers offer ~50% discounts for asynchronous batch requests when you don’t need real-time responses.
-
Enable prompt caching. If you’re sending the same system prompt or reference documents repeatedly, caching can save up to 90% on those tokens.
-
Be selective with context. Don’t dump your entire codebase into every request. Send only what’s relevant.
Sampling Parameters: Controlling Randomness
When an LLM generates text, it predicts probabilities for possible next tokens. Sampling parameters control how the model selects from these possibilities.
Temperature
Temperature controls randomness on a scale from 0 to 2:
- 0.0 = Deterministic. The model always picks the highest-probability token. Same input produces same output.
- 1.0 = Standard sampling. Balanced creativity.
- Above 1.0 = Increasingly random. The model considers lower-probability tokens more often.
Recommendations by task:
| Task | Temperature | Why |
|---|---|---|
| Code generation | 0.0 - 0.2 | You want correct, consistent code |
| Code review | 0.2 - 0.4 | Slight variation in phrasing is fine |
| Documentation | 0.5 - 0.7 | More natural language variation |
| Brainstorming | 0.8 - 1.0 | Encourage diverse ideas |
Top-p (Nucleus Sampling)
Top-p limits token selection to a cumulative probability threshold. With top_p: 0.9, the model only considers tokens that together make up 90% of the probability mass, ignoring the long tail of unlikely tokens.
For most use cases, leave top-p at its default (usually 1.0) and adjust temperature instead. Only tune top-p if you need fine-grained control over output diversity.
Setting Parameters in Code
// Claude
const response = await client.messages.create({
model: 'claude-opus-4-5-20251101',
max_tokens: 2048,
temperature: 0, // Deterministic for code generation
messages: [
{ role: 'user', content: 'Write a function to validate email addresses' }
]
});
// Gemini
const result = await model.generateContent({
contents: [{ role: 'user', parts: [{ text: prompt }] }],
generationConfig: {
temperature: 0,
maxOutputTokens: 2048,
}
});
Debugging Tip
Getting inconsistent outputs from the same prompt? Check your temperature first. A temperature of 0.7 or higher will produce different responses each time, which may or may not be what you want.
Reading API Responses Like a Pro
Both Claude and Gemini return metadata that helps you understand what happened and debug issues.
Key Fields to Monitor
Claude response:
{
"id": "msg_01XFDUDYJgAACzvnptvVoYEL",
"type": "message",
"role": "assistant",
"content": [{ "type": "text", "text": "..." }],
"stop_reason": "end_turn",
"usage": {
"input_tokens": 1523,
"output_tokens": 847
}
}
Gemini response:
{
"candidates": [{
"content": { "parts": [{ "text": "..." }] },
"finishReason": "STOP"
}],
"usageMetadata": {
"promptTokenCount": 1523,
"candidatesTokenCount": 847,
"totalTokenCount": 2370
}
}
Stop Reasons Explained
| Claude | Gemini | Meaning |
|---|---|---|
end_turn | STOP | Completed normally ✓ |
max_tokens | MAX_TOKENS | Hit output limit ⚠️ |
stop_sequence | STOP | Hit a custom stop sequence |
content_filter | SAFETY | Blocked by safety filters |
Red Flags to Watch
-
Frequent
max_tokensstops: Your responses are being truncated. Either increasemax_tokensor break your task into smaller chunks. -
Token count much higher than expected: You might be accidentally including large files, repeated content, or verbose system prompts.
-
High latency: For real-time applications, consider streaming responses or using smaller, faster models for appropriate tasks.
-
Empty or minimal responses with safety blocks: Your prompt may be triggering content filters. Rephrase to be more specific about legitimate use cases.
Quick Cost Mental Math
Here’s a simple formula to estimate costs:
Cost = (input_tokens × input_price) + (output_tokens × output_price)
─────────────────────────────────────────────────────────────
1,000,000
Example: Generating Documentation
Suppose you want to generate documentation for 100 TypeScript files, averaging 500 input tokens and 300 output tokens per file.
Claude Opus 4.5:
- Input: 50,000 tokens × $5.00 / 1M = $0.25
- Output: 30,000 tokens × $25.00 / 1M = $0.75
- Total: $1.00
Gemini 3.0 Pro:
- Input: 50,000 tokens × $2.00 / 1M = $0.10
- Output: 30,000 tokens × $12.00 / 1M = $0.36
- Total: $0.46
For this task, Gemini costs 54% less. But if you need Claude’s stronger reasoning for complex refactoring decisions, the extra cost may be worthwhile.
Rules of Thumb
- Gemini for volume: High-volume, straightforward tasks (summarization, formatting, simple generation)
- Claude for complexity: Tasks requiring nuanced reasoning, complex code understanding, or multi-step analysis
- Batch when possible: 50% savings on both platforms for non-urgent workloads
- Cache repeated content: 90% savings on system prompts and reference documents you send repeatedly
TL;DR Cheat Sheet
| Concept | Key Points |
|---|---|
| Context window | Claude: 200K, Gemini: 1M. Front-load important content. |
| Tokenization | Code costs more than prose. Count tokens before big requests. |
| Temperature | 0-0.2 for code, 0.5-0.7 for docs, 0.8+ for brainstorming. |
| Stop reason | Check for max_tokens to catch truncated responses. |
| Cost | Gemini: cheaper. Claude: better reasoning. Batch saves 50%. |
Going Deeper
These fundamentals will get you far, but there’s much more to building production-ready LLM applications: prompt engineering patterns, RAG architectures, function calling, error handling strategies, and deployment best practices.
If you’re looking to level up your team’s LLM development skills, check out our Programming with Large Language Models training course. It’s a hands-on 3-day program covering everything from prompt engineering to deploying production AI applications.
Sources:
Written by
Francesco Donzello
Software Engineer & Trainer