成本优化

LLM API 按 token 计费，生产环境中不加控制很容易产生意外账单。成本优化需要从模型选型、Prompt 压缩、缓存、路由等多个层面系统性入手。

Token 计费原理

费用 = (输入 token 数 × 输入单价) + (输出 token 数 × 输出单价)

各模型价格对比（USD / 1M token，2025 年参考）：

模型	输入	输出	缓存命中
Claude Opus 4.7	$15	$75	$1.5
Claude Sonnet 4.6	$3	$15	$0.3
Claude Haiku 4.5	$0.8	$4	$0.08
GPT-4o	$2.5	$10	$1.25
GPT-4o mini	$0.15	$0.6	$0.075
Gemini 2.5 Flash	$0.15	$0.6	—

策略一：模型路由

不是所有任务都需要顶级模型，按复杂度路由：

def route_model(task: str, complexity: str) -> str:
    routing = {
        "simple":  "claude-haiku-4-5",     # 分类、摘要、格式化
        "medium":  "claude-sonnet-4-6",    # 通用问答、代码
        "complex": "claude-opus-4-7",      # 复杂推理、创作
    }
    return routing.get(complexity, "claude-sonnet-4-6")
 
# 简单任务省 95% 费用
# Opus 价格 ≈ Haiku 的 20 倍

自动复杂度评估：

# 用小模型判断复杂度，再路由给合适模型
def estimate_complexity(prompt: str) -> str:
    judge = call_haiku(f"判断以下任务复杂度（simple/medium/complex）：{prompt[:200]}")
    return judge.strip()

策略二：Prompt 缓存（Prompt Caching）

将固定的系统提示词缓存，命中时价格降低 90%：

# Claude：使用 cache_control 标记可缓存部分
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,       # 2000+ token 的固定系统提示
            "cache_control": {"type": "ephemeral"},  # 缓存 5 分钟
        }
    ],
    messages=[{"role": "user", "content": user_message}],
)
 
# 查看缓存情况
usage = response.usage
print(f"缓存读取: {usage.cache_read_input_tokens}")
print(f"缓存写入: {usage.cache_creation_input_tokens}")

缓存命中条件：

被缓存部分必须 ≥ 1024 token（Claude）
5 分钟内重复调用
被缓存内容完全一致（含位置）

策略三：压缩 Prompt

方法	实现	节省比例
截断历史	保留最近 N 轮对话	20-50%
摘要历史	用小模型压缩旧对话	50-80%
删除冗余	去掉示例中的重复说明	10-20%
结构化输出	要求 JSON 而非自然语言	减少输出 token
去除礼貌用语	System prompt 去掉”请""谢谢”	5-10%

def compress_history(messages: list, max_tokens: int = 4000) -> list:
    """保留最近消息，超出时用摘要替换旧消息"""
    if count_tokens(messages) <= max_tokens:
        return messages
    
    old = messages[:-6]  # 保留最近 3 轮
    recent = messages[-6:]
    
    summary = call_haiku(
        f"将以下对话压缩为 200 字以内的摘要：\n{format_messages(old)}"
    )
    
    return [{"role": "system", "content": f"[对话摘要] {summary}"}] + recent

策略四：语义缓存

相似问题复用之前的答案，避免重复调用 API：

from sentence_transformers import SentenceTransformer
import numpy as np
 
encoder = SentenceTransformer("all-MiniLM-L6-v2")
cache = []  # [{embedding, question, answer}]
 
def semantic_cache_lookup(question: str, threshold: float = 0.92):
    q_emb = encoder.encode(question)
    for item in cache:
        sim = np.dot(q_emb, item["embedding"])
        if sim > threshold:
            return item["answer"]  # 命中缓存
    return None
 
def ask(question: str) -> str:
    cached = semantic_cache_lookup(question)
    if cached:
        return cached  # 0 成本
    
    answer = call_llm(question)
    cache.append({
        "embedding": encoder.encode(question),
        "question": question,
        "answer": answer,
    })
    return answer

策略五：批量处理

非实时任务用 Batch API，价格减半：

# Anthropic Batch API：50% 折扣，24 小时内返回
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"item-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 512,
                "messages": [{"role": "user", "content": item}],
            }
        }
        for i, item in enumerate(items)
    ]
)

监控与告警

# 跟踪每次调用的 token 消耗
def tracked_call(prompt, **kwargs):
    response = client.messages.create(messages=[{"role":"user","content":prompt}], **kwargs)
    usage = response.usage
    cost = (usage.input_tokens * INPUT_PRICE + usage.output_tokens * OUTPUT_PRICE) / 1_000_000
    
    metrics.record("llm_cost_usd", cost)
    metrics.record("llm_input_tokens", usage.input_tokens)
    metrics.record("llm_output_tokens", usage.output_tokens)
    
    if cost > ALERT_THRESHOLD:
        alert(f"单次调用费用超阈值：${cost:.4f}")
    
    return response

成本优化优先级

1. 模型路由            ← 收益最大，简单任务用小模型
2. Prompt 缓存         ← 固定系统提示超 1K token 必用
3. 批量处理            ← 非实时任务 50% 折扣
4. 语义缓存            ← 重复问答类场景
5. Prompt 压缩         ← 长对话场景
6. 输出长度控制        ← 设置合理的 max_tokens

知识仓库

探索

成本优化

成本优化

Token 计费原理

策略一：模型路由

策略二：Prompt 缓存（Prompt Caching）

策略三：压缩 Prompt

策略四：语义缓存

策略五：批量处理

监控与告警

成本优化优先级

相关文档

关系图谱

目录

反向链接