An LLM decides which tool to call based on the tool's name and description, which guide the model's reasoning about when and how to use each tool; the description is the most critical factor for effective tool use.
An LLM does not have the ability to execute tools or code on its own. Instead, when you provide it with a set of tool definitions (name, description, and input schema), the model processes these as part of its context. When given a user request, the LLM analyzes the request and the available tool definitions, then decides whether it needs to use a tool to fulfill the request. If it decides a tool is needed, it outputs a structured response containing the name of the tool to call and the arguments to pass to it — but it does not execute the tool itself[citation:8]. The application code is then responsible for actually invoking the tool function and returning the result back to the LLM for the final answer[citation:10].
The tool description is arguably the single most important element in the tool definition for guiding LLM behavior[citation:1][citation:4]. Since the LLM has no innate understanding of what a tool does, it relies entirely on the name and description you provide to understand the tool's purpose and decide when it's appropriate to use it. A clear, detailed description helps the model correctly select the right tool for the right task. Conversely, a vague or missing description can lead to the model either ignoring the tool entirely or using it incorrectly[citation:10].
LangChain's official documentation and community best practices emphasize several guidelines for writing effective tool descriptions.
Be Clear and Descriptive: Explain exactly what the tool does, when to use it, and what the expected outputs are. This is the most important factor for effective tool use[citation:4].
Use Descriptive Tool Names: Choose names that clearly indicate functionality, such as search_weather instead of get_data[citation:4].
Include Examples: Where possible, include examples of proper tool usage in the description to clarify expected input formats[citation:4].
Describe Parameters: Provide detailed descriptions for each parameter, explaining the expected format, constraints, and examples[citation:4].
Use JSON Schema: Specify proper data types and constraints using JSON Schema to guide the LLM's argument generation[citation:8].
Balance Detail with Token Usage: Descriptions become part of the prompt tokens and contribute to the overall cost, so be thorough but not overly verbose[citation:4].
Even with perfect descriptions, tool-calling accuracy varies significantly across different LLMs. In a Docker benchmark evaluating 21 models across 3,570 test cases, OpenAI's GPT-4 achieved a near-perfect tool selection F1 score of 0.974. Among open-source models, Qwen 3 (14B) performed exceptionally well with an F1 score of 0.971, while Qwen 3 (8B) achieved 0.933 with significantly lower latency[citation:7]. Quantized versions of models showed no significant difference in tool-calling accuracy compared to their non-quantized counterparts, suggesting quantization can reduce resource usage without negatively impacting performance[citation:7].
When tool calling fails, the issues typically fall into several categories, based on Docker's testing of local models. Some models exhibit eager invocation, calling tools even for simple greeting messages like "Hi there!" Others show wrong tool selection, choosing an incorrect tool for the task, such as using a search tool when they should use an add-to-cart tool. Invalid arguments are another common failure, where parameters are missing or malformed. Finally, some models display ignored responses, failing to incorporate tool outputs into their final answer, leading to awkward or incomplete conversations[citation:7]. These issues highlight why evaluating model tool-calling capabilities is essential for production applications.
In LangChain, tools are passed to agents, and the LLM decides when and how to invoke them based on the prompt and goal[citation:1]. The agent follows the ReAct (Reasoning + Acting) pattern, alternating between reasoning steps and tool calls, feeding observations back into the loop until it can deliver a final answer[citation:10]. This iterative process continues until the model either emits a final output without tool calls or reaches an iteration limit[citation:10].