Tutorial | beginner | | 10 min read

Transformers for LLM Users: What Developers Actually Need to Know

Learn the 5 transformer concepts that matter for developers: context windows, tokenization, sampling parameters, and cost optimization for Claude and Gemini.

FD
Francesco Donzello

Software Engineer & Trainer

TL;DR

Understanding context windows, tokenization, and sampling parameters will save you money and help you debug LLM issues. Claude Opus 4.5 has 200K context at $5/$25 per 1M tokens; Gemini 3.0 Pro has 1M context at $2/$12 per 1M tokens.

Key Takeaways

  • 1 Context windows limit how much text the model can process at once (200K-1M tokens)
  • 2 Code tokenizes worse than prose, often requiring 2-3x more tokens for the same content
  • 3 Use temperature 0-0.2 for code generation, higher for creative tasks
  • 4 Always check the stop_reason to catch truncated responses
  • 5 Gemini is cheaper for volume, Claude excels at complex reasoning

You don’t need to understand backpropagation to build applications with LLMs. But understanding a handful of key concepts will save you money, help you debug issues faster, and get consistently better results.

This guide covers what you actually need to know as a developer working with LLM APIs. No machine learning degree required.

Attention: The Core Innovation

The transformer architecture, introduced in the 2017 paper Attention Is All You Need, powers every modern LLM. Its key innovation is the attention mechanism, which lets the model consider all parts of the input simultaneously rather than processing text sequentially.

Think of it like reading with multiple highlighters. When the model generates a response to your code question, it highlights relevant parts of your input (function names, variable types, error messages) and weighs their importance for each word it generates.

Why does this matter to you? It explains why LLMs understand context rather than just predicting the next word. When you ask “what does this function do?”, the model attends to the function definition, its usage patterns, and surrounding code all at once.

That’s all you need to know about attention. Let’s move to the concepts that directly affect your code and your wallet.

Context Windows: Your Working Memory Budget

The context window is the total amount of text an LLM can process in a single request. This includes both your input and the model’s output. Think of it as working memory: everything must fit, and what doesn’t fit gets ignored.

Current Limits (January 2025)

ModelContext WindowMax Output
Claude Opus 4.5200K tokens64K tokens
Gemini 3.0 Pro1M tokens64K tokens

To put this in perspective:

  • 200K tokens ≈ 150,000 words ≈ a 500-page book
  • 1M tokens ≈ 750,000 words ≈ several novels

For most applications, Claude’s 200K is more than sufficient. Gemini’s 1M window becomes valuable when you’re processing entire codebases, long documents, or maintaining very long conversation histories.

The “Lost in the Middle” Problem

Research has shown that LLMs attend more strongly to content at the beginning and end of the context window. Information buried in the middle may receive less attention.

Practical tip: Front-load important context. Put your key instructions and critical code at the start of your prompt, not buried after pages of background.

What Counts Against Your Context

Everything in the conversation consumes tokens:

  • Your system prompt
  • Previous messages in the conversation
  • The current user message
  • The model’s response

This is why long chat conversations eventually degrade. Early context gets pushed out or receives less attention.

Tokenization: Why Your Code Costs More Than Prose

LLMs don’t process text character by character. They use tokenization to split text into chunks called tokens. Understanding this explains why code is more expensive than English prose.

How Tokenization Works

Most LLMs use Byte Pair Encoding (BPE), which learns common patterns from training data. Frequent words become single tokens; rare patterns get split into multiple tokens.

// Token counts (approximate)
"hello"1 token
"the"1 token
"useState"2 tokens (use + State)
"handleSubmit"3 tokens (handle + Sub + mit)
"@tanstack/query"5 tokens
"XMLHttpRequest"4 tokens

Code tokenizes poorly because:

  • camelCase and PascalCase get split at unusual boundaries
  • Symbols like {}, =>, && often become separate tokens
  • Package names and paths are rarely in training data
  • Indentation (especially tabs) adds tokens

The same logic expressed in natural English typically uses 30-50% fewer tokens than code.

Current Pricing (January 2025)

ModelInput (per 1M tokens)Output (per 1M tokens)
Claude Opus 4.5$5.00$25.00
Gemini 3.0 Pro (≤200K context)$2.00$12.00
Gemini 3.0 Pro (>200K context)$4.00$18.00

Output tokens cost more because generation is computationally harder than processing input.

Counting Tokens in Your Code

Both APIs return token counts in their responses. You can also count before sending:

// Using Anthropic's SDK
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const result = await client.messages.count_tokens({
  model: 'claude-opus-4-5-20251101',
  messages: [{ role: 'user', content: yourPrompt }]
});

console.log(`Input tokens: ${result.input_tokens}`);
// For Gemini, use the countTokens endpoint
import { GoogleGenerativeAI } from '@google/generative-ai';

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: 'gemini-3-pro' });

const result = await model.countTokens(yourPrompt);
console.log(`Token count: ${result.totalTokens}`);

Cost-Saving Tips

  1. Choose the right model for the task. Gemini 3.0 Pro costs 60% less than Claude Opus 4.5 on input tokens. Use it for high-volume tasks where Claude’s stronger reasoning isn’t essential.

  2. Use batch processing. Both providers offer ~50% discounts for asynchronous batch requests when you don’t need real-time responses.

  3. Enable prompt caching. If you’re sending the same system prompt or reference documents repeatedly, caching can save up to 90% on those tokens.

  4. Be selective with context. Don’t dump your entire codebase into every request. Send only what’s relevant.

Sampling Parameters: Controlling Randomness

When an LLM generates text, it predicts probabilities for possible next tokens. Sampling parameters control how the model selects from these possibilities.

Temperature

Temperature controls randomness on a scale from 0 to 2:

  • 0.0 = Deterministic. The model always picks the highest-probability token. Same input produces same output.
  • 1.0 = Standard sampling. Balanced creativity.
  • Above 1.0 = Increasingly random. The model considers lower-probability tokens more often.

Recommendations by task:

TaskTemperatureWhy
Code generation0.0 - 0.2You want correct, consistent code
Code review0.2 - 0.4Slight variation in phrasing is fine
Documentation0.5 - 0.7More natural language variation
Brainstorming0.8 - 1.0Encourage diverse ideas

Top-p (Nucleus Sampling)

Top-p limits token selection to a cumulative probability threshold. With top_p: 0.9, the model only considers tokens that together make up 90% of the probability mass, ignoring the long tail of unlikely tokens.

For most use cases, leave top-p at its default (usually 1.0) and adjust temperature instead. Only tune top-p if you need fine-grained control over output diversity.

Setting Parameters in Code

// Claude
const response = await client.messages.create({
  model: 'claude-opus-4-5-20251101',
  max_tokens: 2048,
  temperature: 0,  // Deterministic for code generation
  messages: [
    { role: 'user', content: 'Write a function to validate email addresses' }
  ]
});
// Gemini
const result = await model.generateContent({
  contents: [{ role: 'user', parts: [{ text: prompt }] }],
  generationConfig: {
    temperature: 0,
    maxOutputTokens: 2048,
  }
});

Debugging Tip

Getting inconsistent outputs from the same prompt? Check your temperature first. A temperature of 0.7 or higher will produce different responses each time, which may or may not be what you want.

Reading API Responses Like a Pro

Both Claude and Gemini return metadata that helps you understand what happened and debug issues.

Key Fields to Monitor

Claude response:

{
  "id": "msg_01XFDUDYJgAACzvnptvVoYEL",
  "type": "message",
  "role": "assistant",
  "content": [{ "type": "text", "text": "..." }],
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 1523,
    "output_tokens": 847
  }
}

Gemini response:

{
  "candidates": [{
    "content": { "parts": [{ "text": "..." }] },
    "finishReason": "STOP"
  }],
  "usageMetadata": {
    "promptTokenCount": 1523,
    "candidatesTokenCount": 847,
    "totalTokenCount": 2370
  }
}

Stop Reasons Explained

ClaudeGeminiMeaning
end_turnSTOPCompleted normally ✓
max_tokensMAX_TOKENSHit output limit ⚠️
stop_sequenceSTOPHit a custom stop sequence
content_filterSAFETYBlocked by safety filters

Red Flags to Watch

  1. Frequent max_tokens stops: Your responses are being truncated. Either increase max_tokens or break your task into smaller chunks.

  2. Token count much higher than expected: You might be accidentally including large files, repeated content, or verbose system prompts.

  3. High latency: For real-time applications, consider streaming responses or using smaller, faster models for appropriate tasks.

  4. Empty or minimal responses with safety blocks: Your prompt may be triggering content filters. Rephrase to be more specific about legitimate use cases.

Quick Cost Mental Math

Here’s a simple formula to estimate costs:

Cost = (input_tokens × input_price) + (output_tokens × output_price)
       ─────────────────────────────────────────────────────────────
                              1,000,000

Example: Generating Documentation

Suppose you want to generate documentation for 100 TypeScript files, averaging 500 input tokens and 300 output tokens per file.

Claude Opus 4.5:

  • Input: 50,000 tokens × $5.00 / 1M = $0.25
  • Output: 30,000 tokens × $25.00 / 1M = $0.75
  • Total: $1.00

Gemini 3.0 Pro:

  • Input: 50,000 tokens × $2.00 / 1M = $0.10
  • Output: 30,000 tokens × $12.00 / 1M = $0.36
  • Total: $0.46

For this task, Gemini costs 54% less. But if you need Claude’s stronger reasoning for complex refactoring decisions, the extra cost may be worthwhile.

Rules of Thumb

  • Gemini for volume: High-volume, straightforward tasks (summarization, formatting, simple generation)
  • Claude for complexity: Tasks requiring nuanced reasoning, complex code understanding, or multi-step analysis
  • Batch when possible: 50% savings on both platforms for non-urgent workloads
  • Cache repeated content: 90% savings on system prompts and reference documents you send repeatedly

TL;DR Cheat Sheet

ConceptKey Points
Context windowClaude: 200K, Gemini: 1M. Front-load important content.
TokenizationCode costs more than prose. Count tokens before big requests.
Temperature0-0.2 for code, 0.5-0.7 for docs, 0.8+ for brainstorming.
Stop reasonCheck for max_tokens to catch truncated responses.
CostGemini: cheaper. Claude: better reasoning. Batch saves 50%.

Going Deeper

These fundamentals will get you far, but there’s much more to building production-ready LLM applications: prompt engineering patterns, RAG architectures, function calling, error handling strategies, and deployment best practices.

If you’re looking to level up your team’s LLM development skills, check out our Programming with Large Language Models training course. It’s a hands-on 3-day program covering everything from prompt engineering to deploying production AI applications.


Sources:

FD

Written by

Francesco Donzello

Software Engineer & Trainer