Most enterprise AI agent cost overruns don’t stem from expensive models—they stem from wasted tokens.

A typical customer service agent consumes 800-1,200 tokens per conversation on average, but fewer than 60% of those tokens actually contribute to quality responses. The remaining 40% comes from: repetitive system prompts, bloated few-shot examples, inefficient context truncation, and unchecked context accumulation.

This isn’t an isolated problem. It’s an industry-wide pattern.

This article breaks down three实战 (practical) optimization strategies—token efficiency, intelligent model routing, and production-grade semantic caching—that help engineering teams reduce per-interaction costs by 40-70%.

Why Costs Spiral Out of Control

An enterprise AI agent’s cost breakdown typically looks like this:

  • LLM inference: 60-80% of total cost
  • Vector database queries: 5-15%
  • API gateway and network: 5-10%
  • Storage and logging: 3-5%

LLM inference dominates, and token consumption directly drives inference costs.

The bigger problem is cost invisibility: most teams have AI costs buried in cloud billing with no per-agent, per-task-type, or per-user-group breakdown. Engineers can’t see which task types consume the most tokens, making optimization impossible.

Recommended first step: implement token-level cost monitoring. Alibaba Cloud Bailian, Baidu Qianfan, and Tencent Cloud all offer per-request token metering. Enable it in the console under AI Agent services → usage details → export CSV. Review it weekly—you may find that 20% of task types consume 80% of costs.

Token Optimization: A Four-Step Framework

Step 1: Prompt Compression

Long prompts are the primary token drain.

Consider a typical RAG + Agent system:

System prompt: You are a customer service assistant for Company X, helping users with product usage, order inquiries, returns and exchanges...
[~300 characters omitted]

Chat history: [potentially 20+ turns accumulated, context window inflating rapidly]

Current question: I'd like to check the warranty period for the product I bought last week

The chat history is the biggest token sink. Optimization approach:

Recent conversation truncation: Keep only the last 5 turns (typically sufficient to understand current context), and route earlier conversation to vector memory.

LLMLingua compression: Microsoft’s LLMLingua technique can compress prompt tokens by 30-50% while preserving over 95% task accuracy. Particularly effective for system prompts and few-shot examples.

Structured prompt trimming: Remove polite prefixes like “Hello, how can I help you today?”—jump straight to the task. Saving ~15-30 tokens per request compounds significantly at scale.

Step 2: Context Window Optimization

Large language models exhibit soft-max attention decay—earlier tokens receive lower attention weights. Leverage this:

Lead with core information: Place the user’s current question and key parameters at the start of the prompt, background information and history at the end.

Dynamic context depth: Adjust context length based on task type. Simple queries (product information lookup) use 2K tokens; complex analysis (annual report analysis, multi-step reasoning) warrants full context.

Step 3: Tool Call Streamlining

Agent tool calls are cost multipliers—each tool call triggers an LLM inference.

Compressed tool descriptions: Don’t paste full API documentation into prompts. Reduce tool descriptions to “Lookup product info: input product_id, returns warranty period and after-sales policy.” LLMs need to know when to call a tool, not the internal mechanics of how.

Reject meaningless calls: Add a “query history cache” at the Agent architecture layer—if a user asks a similar question within 5 minutes, return the cached result without triggering a new tool call.

Step 4: Output Length Control

Specify output format and length explicitly in the system prompt: “Answer in under 50 characters,” “Return JSON only, no explanation.” Constraining output tokens reduces both TTS (token generation) costs and response latency.

Intelligent Model Routing: Right Model for Right Task

Not every task requires GPT-4o or Claude 3.5.

The GAIA benchmark (General AI Assistants benchmark, arXiv: 2311.12983) provides a critical insight: most real-world tasks (85%+) require only basic reasoning capabilities, with complex reasoning tasks comprising less than 15%.

Model routing’s core idea: classify task difficulty and route simple tasks to smaller models, complex tasks to larger ones.

Routing Strategy Design

Rule-based routing (low-cost approach):

  • Keyword matching: contains “analyze,” “compare,” “predict” → large model; contains “query,” “confirm,” “find” → small model
  • Token length: input < 500 tokens → small model; input > 2,000 tokens → large model

LLM-based routing (high-precision approach): Use a lightweight model (DeepSeek-V3 or Qwen-7B) as a “routing brain” to classify task difficulty and select the appropriate execution model. The routing model’s inference cost is under 5% of the main model’s, but can route 70% of simple tasks to smaller models, saving 70-90% in inference costs.

Domestic Model Cost Advantages

DeepSeek-V3’s API pricing is approximately 1/20th of GPT-4o’s (at comparable MMLU performance), with no significant difference in performance on simple tasks. This gives domestic enterprises greater cost optimization headroom.

Task TypeRecommended ModelRationale
Simple Q&A, information lookupDeepSeek-V3 / Qwen-7BLow cost, minimal latency
Complex reasoning, multi-step analysisDeepSeek-V2.5 / GPT-4oRequires strong reasoning
Long-context analysisClaude 3.5 / Gemini 1.5Context window advantage

Production-Grade Semantic Caching

Semantic caching is the “big lever” of cost optimization—a cache hit bypasses LLM inference entirely.

Cache Architecture

User query → Embedding model → Vector similarity search (threshold > 0.92) → Hit? → Return cached result
                                                                        ↓ (miss)
                                                       LLM inference → Return result → Write to cache

Key parameters:

  • Similarity threshold: Recommended 0.92-0.95. Below 0.92 risks irrelevant results; above 0.95 drives cache hit rate too low
  • Cache granularity: Cache by “question + user profile + time window” dimensions, not simple text matching

Cache Invalidation Strategy

More cache isn’t always better. Actively invalidate in these scenarios:

  • Product information changes: When prices, features, or inventory change, clear related cache
  • Business rules update: When return policies or warranty terms change, clear related cache
  • TTL settings: Recommend 1 hour for simple query cache, 24 hours for complex analysis cache

Cache Hit Rate Targets

Production cache hit rates:

  • Knowledge Q&A agents: 30-50% (high repetition)
  • Data analysis agents: 10-20% (low repetition)
  • Customer service agents: 20-40% (depends on product complexity)

For a customer service agent handling 100,000 requests per day with a 35% cache hit rate and 500 tokens saved per request, that’s 1.75 billion tokens saved daily—translating to significant monthly cost reductions.

China Market Toolchain: Alibaba Cloud Bailian & Baidu Qianfan

Domestic cloud vendors have launched AI agent cost optimization tools:

Alibaba Cloud Bailian:

  • Agent-specific Token optimization SDK, auto-compresses prompts and chat history
  • Model routing service with built-in task difficulty classifier
  • Semantic caching service with vector similarity matching

Baidu Qianfan:

  • ERNIE Agent development platform with per-request token metering
  • Unified access layer supporting DeepSeek, Qwen, and other domestic models
  • Cost dashboard with per-agent and per-user-group breakdown

5 Immediate Optimization Actions

  1. Enable token-level metering: Export your cloud console’s AI agent token consumption report for the past week—find the top 3 highest-consuming task types
  2. Trim prompts: Start with system prompts—remove all “polite” expressions, save 20-50 tokens per prompt
  3. Add a caching layer: Take one high-frequency, low-complexity task and integrate semantic caching (Milvus or managed cloud service)
  4. Test small-model routing: Route the top 2 simple tasks to DeepSeek-V3, measure quality differences and cost savings
  5. Set output length constraints: Add “answers must not exceed X characters” to all Agent system prompts

Conclusion

AI agent cost optimization is a tangible, actionable engineering problem—not a mystery.

Most teams see 30-50% cost reduction after the first optimization pass. A second round (model routing + caching) can push total savings to 60-70%.

The key isn’t “which model to use”—it’s “how to make every token count.”

Next step: Check your cloud console for the AI agent token consumption report. That’s where your optimization journey begins.


References: GAIA Benchmark, arXiv:2311.12983; LLMLingua, arXiv:2310.15736; DeepSeek API Documentation, platform.deepseek.com