Agent 架构设计 — 面试版

Agent = LLM + 规划能力 + 工具调用 + 记忆，核心是一个"观察→思考→行动"的循环

1. Agent 核心循环

所有 Agent 框架（LangChain、CrewAI、AutoGen）底层都是这个循环，面试答出这个就够了。

class Agent:
    def run(self, user_input, max_steps=10):
        messages = [system_prompt(), user_msg(user_input)]

        for step in range(max_steps):
            # ② 思考：LLM 决定下一步
            response = llm.chat(messages, tools=self.tools)

            # 判断是否完成
            if response.finish_reason == "stop":
                return response.content          # 最终输出

            # ③ 行动：执行工具调用
            for tool_call in response.tool_calls:
                result = self.execute_tool(tool_call)  # 带超时+异常捕获
                messages.append(tool_result_msg(tool_call.id, result))

            # ① 观察：工具结果已追加到 messages，下轮 LLM 可见

        # 超限兜底
        return fallback_response("思考步数超限，已停止")

2. 工具设计

Agent 的能力边界 = 工具的能力边界。工具设计得好不好，直接决定 Agent 好不好用。

# 一个好的 Tool 定义 — 给 LLM 看的
tools = [
    {
        "name":        "search_orders",
        "description": "按条件搜索订单。用于用户问'我的订单'、'最近买了什么'等场景。",
        "parameters":  {
            "user_id":    { "type": "string", "description": "用户ID", "required": True },
            "status":     { "type": "string", "enum": ["pending", "shipped", "done"] },
            "limit":      { "type": "int", "default": 5, "description": "最多返回条数" },
        }
    }
]

# 工具执行层 — 你的后端代码
def execute_tool(self, tool_call):
    fn = self.tool_registry[tool_call.name]

    # 参数校验 — LLM 可能给错类型
    args = validate_args(tool_call.arguments, fn.schema)

    # 权限检查 — 不是所有工具都允许当前用户调
    check_permission(self.user, tool_call.name)

    # 执行 + 超时保护
    try:
        result = timeout(30)(fn)(**args)
    except TimeoutError:
        result = "工具执行超时"
    except Exception as e:
        result = f"工具执行失败: {e}"       # 把错误告诉 LLM，让它自行调整

    return truncate(result, max_tokens=2000)  # 防止返回太长撑爆上下文

工具设计原则

原子化 — 一个工具只做一件事
Description 写给 LLM 看 — 说清楚什么场景用
返回结构化数据 — 不要返回大段 HTML
参数有 enum 就写 enum — 减少 LLM 猜测

常见坑

工具太多 → LLM 选错工具（建议 ≤15 个）
description 模糊 → LLM 乱调
返回太长 → 吃掉上下文窗口
没有超时 → 一个工具卡住整个循环

生产必须有

参数校验 — LLM 会编造参数
权限控制 — 写操作需二次确认
超时 + 重试 — 外部 API 不可靠
结果截断 — 防撑爆上下文

3. 四种主流 Agent 模式

模式 A：ReAct 最常用

Reasoning + Acting 交替进行，每一步先说想法（Thought），再执行动作（Action），再看结果（Observation）。

模式 B：Plan-and-Execute 复杂任务

先让 LLM 生成完整计划，再逐步执行。适合步骤多、有依赖关系的任务。

class PlanAndExecuteAgent:
    def run(self, task):
        # 阶段一：规划（一次 LLM 调用）
        plan = llm.chat(f"""
            任务: {task}
            请拆解为有序步骤列表，每步包含:
            - step: 步骤描述
            - tool: 需要的工具
            - depends_on: 依赖哪些前置步骤的结果
        """)

        # 阶段二：逐步执行
        results = {}
        for step in plan.steps:
            # 把前置步骤的结果注入当前步骤的上下文
            context = {k: results[k] for k in step.depends_on}
            results[step.id] = execute_step(step, context)

        # 阶段三：汇总
        return llm.chat(f"根据以下执行结果回答用户: {results}")

# 优点：步骤可并行（无依赖的步骤同时执行）
# 缺点：计划可能一开始就错了 → 需要 Replan 机制

模式 C：Router Agent 意图分发

不执行具体任务，只负责理解意图并路由到对应的专业处理链。适合多业务场景。

class RouterAgent:
    routes = {
        "order_query":    OrderChain,       # 查订单 → 确定性工作流
        "product_consult": RAGChain,         # 商品咨询 → RAG 检索
        "complaint":      ComplaintAgent,    # 投诉 → 需要多轮交互的 Agent
        "chit_chat":      DirectLLM,        # 闲聊 → 直接 LLM 回答
    }

    def run(self, query):
        # LLM 只做分类，不做执行
        intent = llm.classify(query, categories=self.routes.keys())
        handler = self.routes[intent]
        return handler.run(query)

# 核心思想：能用确定性链路解决的，不走 Agent 循环
# Router 只在"不确定该走哪条路"时才用 LLM

模式 D：Multi-Agent 协作系统

多个专业 Agent 协作完成任务。每个 Agent 有自己的工具集和 System Prompt。

class MultiAgentOrchestrator:
    agents = {
        "researcher": Agent(tools=[search, read_url],  prompt="你是调研专家..."),
        "coder":      Agent(tools=[run_code, git],   prompt="你是程序员..."),
        "reviewer":   Agent(tools=[lint, test],      prompt="你是代码审查员..."),
    }

    def run(self, task):
        # 1. Orchestrator 拆解任务
        subtasks = llm.chat(f"把任务拆给不同角色: {task}")

        # 2. 分发 + 并行执行无依赖的子任务
        results = {}
        for batch in topological_sort(subtasks):
            futures = {
                st.agent: async_run(self.agents[st.agent], st, results)
                for st in batch
            }
            results.update(await gather(futures))

        # 3. 汇总
        return llm.chat(f"综合所有结果回答: {results}")

4. 模式选型对照表

模式	适用场景	优点	缺点	生产建议
ReAct	简单工具调用、问答	实现简单、可审计	步骤多时 token 爆炸	首选大多数场景够用
Plan-and-Execute	多步骤、有依赖关系	可并行、结构清晰	初始计划可能错、需 Replan	推荐复杂任务
Router	多业务线入口	简单路由不走 Agent 循环	分类错了就全错	首选多场景入口
Multi-Agent	大型协作任务	职责隔离、可组合	通信成本高、debug 难	谨慎内部工具可以，toC 慎用
确定性工作流	步骤固定、无需推理	可靠、快、便宜	不灵活	首选能不用 Agent 就别用

5. 记忆系统

class ConversationMemory:
    def build_context(self, session_id, new_msg):
        # 1. 短期：最近 N 轮原始对话
        recent = redis.get_recent(session_id, limit=20)

        # 2. 如果历史太长 → 压缩旧的部分
        if token_count(recent) > 3000:
            old, keep = recent[:-10], recent[-10:]
            summary = llm.chat(f"压缩这段对话为摘要: {old}")
            recent = [system_msg(f"之前的对话摘要: {summary}")] + keep

        # 3. 长期：检索相关历史记忆
        memories = vector_db.search(embed(new_msg), user_id=..., top_k=3)

        return recent + [system_msg(f"相关历史: {memories}")]

6. 生产环境必须解决的问题

可控性

max_steps 上限 — 防止死循环，到了就强制终止
工具白名单 — Agent 只能调被注册的工具
写操作二次确认 — 删除/修改类操作需要 human-in-the-loop
Token 预算 — 单次对话总 token 有上限
敏感操作审批 — 转账、发邮件等需人工审批

可观测

Trace 链路 — 每步 Thought→Action→Observation 都要记录
工具调用日志 — 参数、返回值、耗时
LLM 调用详情 — prompt、completion、token 数
步骤成功率 — 哪个工具经常失败
用户满意度 — Agent 完成任务了吗

容错

工具失败 → 告诉 LLM，让它自己调整策略
LLM 返回格式错误 → 重试（最多 2 次）
整体超时 → 返回已有结果 + 告知用户未完成
模型降级 — 主模型不可用时切备用
幂等设计 — 工具重复调用不能产生副作用

成本

每轮循环都是一次完整 LLM 调用（含历史上下文）
10 步 Agent = 10 次 LLM 调用，token 成本指数增长
对策：结果截断 + 上下文压缩 + 缓存
简单意图用 Router 分流，不进 Agent 循环
固定流程用确定性工作流，零 LLM 调用

7. 面试回答框架

一句话定义：Agent = LLM 作为推理引擎 + 工具调用作为执行引擎 + 记忆作为状态管理，通过"观察→思考→行动"循环自主完成任务。

面试关键态度：不要把 Agent 说得太万能。面试官想听到你的取舍判断——

"生产环境我会优先用确定性工作流覆盖 80% 的确定场景，只在需要灵活推理的 20% 场景用 Agent。
Agent 的核心问题是不可控和成本高，所以必须有 max_steps、工具白名单、Token 预算、Human-in-the-loop 这些约束机制。
模式选型上，简单场景用 ReAct，多业务入口用 Router，复杂任务用 Plan-and-Execute，Multi-Agent 目前更适合内部场景。"

面试常见追问 & 要点

"Agent 和工作流的区别？" — 工作流步骤固定，Agent 步骤由 LLM 动态决定
"怎么防止 Agent 失控？" — max_steps + 工具白名单 + 写操作审批 + Token 预算
"上下文太长怎么办？" — 滑动窗口 + 历史摘要压缩 + 工具结果截断
"Multi-Agent 通信怎么做？" — 消息传递（最简单）/ 共享黑板 / 事件总线
"怎么评估 Agent 效果？" — 任务完成率 + 平均步骤数 + 成本/任务 + 用户满意度
"用什么框架？" — 了解 LangGraph / CrewAI / AutoGen，但强调理解原理比框架重要