AI & Agentic

How to Reduce AI Token Usage: 13 Practical Tips From Pricing to Workflow

Tokens cost money. Learn how Claude, GPT, and Gemini charge per token, what secretly eats your quota, and 13 actionable tips to cut usage by 30-60% without losing output quality.

How to Reduce AI Token Usage

Every message you send to an AI costs tokens - either money on a pay-per-use API, or quota on a subscription plan. Token is the billing unit for AI, and if you don’t understand how it works, you end up burning through your allowance on things that don’t need to be there. This guide breaks down the mechanics and gives you 13 practical ways to work more efficiently without sacrificing output quality.

Whether you’re using Claude, ChatGPT, or Gemini - they all charge by the token. The problem is most people only think about what they type. In reality, the system prompt, your entire conversation history, any files you attach, and even the AI’s internal “thinking” process all consume tokens quietly in the background.

Is this guide for you? If you’re still well under 50% of your daily quota, you probably don’t need to worry about this yet - just use AI freely. But if you’re consistently hitting 70%+ usage during your workday, adding automation routines, taking on more clients, or watching your AI workload grow exponentially - token optimization isn’t just about being frugal. It’s a skill worth building before it becomes a real bottleneck.

What Is a Token and Why Does It Matter?

A token is the smallest unit an AI processes - not a word, not a character, but a chunk of text. The common rule of thumb: 1,000 tokens equals roughly 750 words in English (other languages like Vietnamese or Japanese often use more tokens per equivalent meaning due to character structure).

A context window is the total number of tokens an AI can “see” in a single processing pass - including both input and output. Exceed this limit and the AI starts forgetting the beginning of your conversation.

Example of a context window in AI - showing how many tokens fit before and after a target token The context window is the AI’s “field of view” - anything outside it gets forgotten

Here’s what most people miss: AI re-reads your entire conversation history on every single message. A 100-message chat means the AI processes the full thread 100 times. That’s why long conversations get expensive fast.

Real API Pricing: What Tokens Actually Cost

Understanding pricing is the first step to managing it. Providers charge by $ per million tokens (MTok), billed separately for input and output.

Anthropic API Pricing overview for Claude models Anthropic API pricing - knowing your model tiers helps you match the right model to the right task

Claude pricing (2026):

ModelInput ($/MTok)Output ($/MTok)
Claude Haiku 4.5$1.00$5.00
Claude Sonnet 4.6$3.00$15.00
Claude Opus 4.7$5.00$25.00
Claude Opus 4.7 Fast Mode$30.00$150.00

For comparison - OpenAI:

ModelInput ($/MTok)Output ($/MTok)
GPT-5.4$2.50$15.00
GPT-4o mini~$0.15~$0.60

Real-world numbers: Anthropic reports an average of ~$13 per developer per active workday, and $150-250 per month for enterprise teams. 90% of users stay under $30/day. But nothing stops you from blowing past those numbers if you’re not paying attention.

The hidden cost nobody mentions: Output tokens cost 3-5x more than input tokens. A long answer is far more expensive than a long question. And extended thinking - the AI’s internal reasoning before it responds - is billed as output tokens and can consume tens of thousands of tokens per request.

What’s Actually Eating Your Tokens?

Beyond the messages you type, every session also loads:

  • System prompt: ~4,200 tokens, auto-loaded every session
  • Full conversation history: Every new message = the AI re-reads ALL of it
  • File attachments: PDFs, images, documents - especially costly if sent raw without preprocessing
  • Custom instructions / CLAUDE.md: Loaded into context every session, whether relevant or not
  • MCP tool schemas: Tool definitions take up context space
  • Extended thinking: Silent reasoning tokens billed as output

Token-Saving Tips - For Everyday Users

1. Convert PDF/DOCX to plain text before sending

PDFs carry metadata, font embedding, and formatting noise - all of which eat tokens without adding value. Instead of uploading the raw file:

  • Copy just the relevant text and paste it directly
  • Use ChatPDF or PDF2MD to convert to clean markdown first
  • On Mac: open in Preview, select all, copy text

Same content, plain text is 30-60% lighter than the original PDF.

2. Only send the relevant section - not the whole file

Instead of: “Here’s my 50-page report, can you analyze it?” Do this: Paste the relevant section (pages 12-15) + your specific question.

3. Create a “Context File” for recurring projects

Instead of re-explaining your situation at the start of every new chat, keep a ready-to-paste file:

# Context - [Project Name]
## Who I am: [Role, company, position]
## Project: [Goal, constraints]
## Output style: [bullet points / formal / casual]
## Last session: [What was done, what's pending]

Paste this at the top of each new chat. Saves 10-20 messages worth of re-explanation every single time.

4. Save AI output as a markdown log file

At the end of every important session:

“Summarize everything we worked on today into a short markdown file I can reuse next time.”

Save that file. Next session, paste the summary instead of explaining from scratch. Cost difference: ~500 tokens vs 5,000 tokens of re-briefing.

I wrote a more detailed guide on how to set this up automatically - including having the AI save session logs to your local drive: Optimize AI Chat History: Auto-Save Session Logs, Preserve Context, and Cut Token Costs.

5. Start a new chat when switching topics

A 200-message marketing chat followed by one HR question means the AI reads 200 irrelevant messages before answering. One topic per chat.

6. Write specific prompts and cut the filler

Instead ofUse
”Could you help me with…”Get straight to the request
”As you know from our previous conversation…”Paste your context file
”Please think carefully and…”Drop it - the AI already does
”Thanks, now could you…”Drop the pleasantries

Also specify your output format upfront: “Return 3 bullet points, one line each” - the AI won’t generate unnecessary text.

7. Reset a long chat with a summary

When a conversation has grown too long:

  1. Ask the AI: “Summarize this entire conversation in 10 bullet points”
  2. Copy the summary
  3. Open a new chat and paste the summary
  4. Continue from there

From 50,000 tokens of history down to ~500 tokens of compressed context.

Token-Saving Tips - For Developers / Claude Code

CloudZero Cost Anomaly Alert showing $500 in Claude 3.5 Sonnet costs in 8 hours - a real example of uncontrolled token usage A real-world cost spike: $500 in 8 hours on Claude API - what happens without token controls in place

8. /clear between tasks

Stale context from unrelated work is pure waste - it’s re-read on every new message. Use /rename to save the session with a meaningful name first, then /clear to start fresh.

9. Use Plan Mode before writing code

Press Shift+Tab to enter Plan Mode - the AI maps out an approach and waits for your approval before touching any code. This prevents the expensive “wrong direction - undo - redo” loop. If the AI starts going off track: hit Escape immediately.

10. Match the model to the task

TaskModel to use
Simple questions, formatting, grepHaiku
Most coding workSonnet (default)
Architecture decisions, complex reasoningOpus

Use /model to switch mid-session. The price gap between Haiku and Opus is 25x on input, 5x on output.

11. Reduce or disable Extended Thinking for simple tasks

Extended thinking is on by default and can burn tens of thousands of tokens per request. For straightforward tasks:

MAX_THINKING_TOKENS=8000   # Reduce the thinking budget
# or: /effort low within the session
# or: disable it in /config

12. Keep CLAUDE.md under 200 lines - move workflows into Skills

CLAUDE.md loads entirely into context at the start of every session regardless of what you’re working on. Skills only load when explicitly called. Move specific workflow instructions (PR reviews, DB migrations…) into skills and keep CLAUDE.md for only the essentials.

13. Use hooks to filter data before the AI reads it

Instead of letting the AI wade through a 10,000-line log file:

# Pre-filter with a hook: from 50,000 tokens down to a few hundred
cmd 2>&1 | grep -A 5 -E "(FAIL|ERROR)" | head -100

Preprocessing is the highest-leverage move for data-heavy tasks.

Frequently Asked Questions

How much text is 1 million tokens in practice?

About 750,000 words in English - roughly 3-4 average novels. At Claude Sonnet 4.6’s price of $3/MTok for input, 1 million input tokens costs $3. That sounds cheap, but output tokens are 5x more expensive ($15/MTok), and in practice you’re sending and receiving many times per session.

Do subscription plans (Claude Pro/Max) still care about token usage?

Yes. Subscriptions have usage rate limits measured over time windows. Burn through tokens quickly and you’ll get throttled - slowed down or temporarily paused - earlier in your day. The optimization tips in this guide apply equally to subscription plans and pay-per-use API.

How many tokens does an image use?

It depends on the resolution and model. With Claude, a 1080p image can cost 1,500-2,000 tokens just to process. If you’re sending an image so the AI can read text in it, it’s more efficient to crop to just the text area, or run OCR first and paste the extracted text instead.

What is prompt caching and how much does it save?

Prompt caching lets you cache fixed parts of your system prompt or context - repeated calls to that cached content cost only 10% of the normal price. Anthropic’s cache hit discount is 90%. This feature is primarily useful for developers building apps with repeated system prompts, rather than individual end users.

Summary

Tokens are money, and most of them get wasted on things that don’t need to be there: long conversation histories, raw files sent without preprocessing, vague prompts that trigger broad scanning, and premium models used for tasks a cheaper model handles just as well. The two highest-impact changes you can make right now: create reusable context files (cuts re-explanation overhead) and start fresh chats per topic (stops the AI from re-reading irrelevant history). Apply the habits in this guide and you can realistically cut token usage by 30-60% without any drop in output quality.

✦ Miễn phí

Muốn nhận thêm kiến thức như thế này?

Mình tổng hợp AI, marketing và tech insights mỗi tuần - gọn, có gu.

Không spam. Unsubscribe bất cứ lúc nào.