Handle async errors and retries inside a Tool by using structured error returns, implementing retry logic with exponential backoff, classifying error types, and leveraging LangChain's built-in retry utilities and middleware to keep the agent loop running and allow the LLM to recover.
To prevent a tool error from crashing the entire agent loop, you must treat errors as recoverable events rather than fatal exceptions. The core pattern is to return structured error information to the agent instead of throwing exceptions. This allows the LLM to see what went wrong and attempt to correct its approach. You can then layer on retry logic for transient failures, ensuring that the tool execution is resilient without breaking the agent's reasoning flow .
For transient failures like network timeouts or rate limits, you can implement retry logic directly inside the tool function. Use exponential backoff to avoid overwhelming the failing service. This approach keeps the retry logic encapsulated within the tool, so the agent loop sees only the final success or a well-formatted error after retries are exhausted .
LangChain provides built-in mechanisms for adding retry logic to any Runnable, including tools. The .with_retry() method can be applied to a tool instance, allowing you to specify which exceptions should trigger a retry, the maximum number of attempts, and exponential backoff with jitter. This approach keeps your tool implementation clean while still providing resilience .
For more sophisticated recovery scenarios, you can use the ToolRecoverMiddleware from the langchain-tool-recover package. This middleware classifies errors into categories (timeout, rate_limit, validation_error, empty_result, etc.) and applies configurable recovery actions. When a tool fails or returns an empty result, the middleware intercepts the error and returns a structured JSON message to the agent, allowing the LLM to reason about the failure and adjust its strategy .
In some cases, a tool may need to determine that the agent loop should end early, even if the agent hasn't reached its final answer. For example, after a successful verification step, you might want to skip further tool calls and terminate. This can be achieved by having the tool return a Command object with a jump_to instruction. This feature is currently experimental and discussed in LangChain's feature requests, but it provides a powerful way to dynamically control the agent's execution flow .
Production-ready error handling requires classifying errors into appropriate categories and applying suitable recovery strategies. Classification should be deterministic and conservative. Tools like ToolRecoverMiddleware implement this classification automatically, but you can also implement your own logic. Common error classes include timeout, rate_limit, validation_error, auth_error, empty_result, and unknown_error. Each class should map to a specific recovery action: retry with backoff for transient errors, return to agent for validation issues, or fail fast for authentication errors .
timeout or rate_limit: Retry with exponential backoff — These are transient and often succeed on retry
validation_error: Return to agent with structured message — The agent may correct its input format
empty_result: Return to agent with suggestion — The agent may adjust its query
auth_error: Fail fast (do not retry) — Retrying won't fix authentication issues
unknown_error: Log and fail fast, or return to agent for recovery — Depends on error criticality
The LangGraph team is also exploring built-in reliability features for create_react_agent, including error_handling configuration that would automatically classify errors and apply appropriate retry strategies. This would further simplify production deployments by providing battle-tested defaults for common failure modes .