Skip to content

Section 0b: From API Calls to Tool Use

In the last section, the model could only talk. In this one, it learns to do things. The difference is smaller than you think: about 15 lines of code.

Here is what actually happens when a model "uses a tool": your code sends a list of function signatures to the API. The model reads those signatures and, instead of returning a text response, returns a JSON blob that says "call this function with these arguments." Your code calls the function. You send the result back to the model. The model uses the result to generate a text response. That's the entire mechanism. No remote procedure calls. No plugin system. No runtime loaded from the model's side. The model writes JSON. You do the work.

Before we get to function calling, though, we need to talk about the two techniques that make it actually work in production: structured prompting and few-shot examples. These aren't optional. They're the difference between a system that works in demos and one that works at 2am when you're asleep.

Prompting as engineering, not art

System prompts are code. Version them. Test them. Review them in pull requests. If your prompt engineering process is "try things until it works," you're debugging without logs.

Here's a prompt I see in prototypes all the time:

You are a helpful assistant that analyzes data.

This prompt tells the model almost nothing. What kind of data? What does "analyze" mean? What format should the output be in? The model will fill in the gaps with its own interpretation, and that interpretation will change between calls, between models, and between API versions. You've written a function with no type signature and no docstring, then wondered why consumers use it wrong.

Here's the same prompt written like code:

You are a financial data analyst. You receive quarterly revenue data as CSV.
Your job: identify the quarter-over-quarter growth rate for each line item.

Rules:
1. Output valid JSON with the schema: {"line_items": [{"name": str, "q1": float, "q2": float, "growth_pct": float}]}
2. Calculate growth as: (q2 - q1) / q1 * 100, rounded to 1 decimal place.
3. If q1 is zero, set growth_pct to null.
4. Do not include commentary. Return only the JSON object.

What just happened

The second prompt specifies the domain, the input format, the output schema, the calculation method, the edge case handling, and what to omit. Every one of those constraints reduces variance in the output. Each constraint is testable. You can write an assertion that checks whether the output matches the JSON schema. You can verify the growth calculation is correct. You can confirm there's no commentary outside the JSON. You can't test "be helpful."

The principle is straightforward: every ambiguity in your prompt is a degree of freedom the model will use in ways you didn't intend. Remove them. Specify the format. Specify the edge cases. Specify what not to do. Treat your system prompt as a contract, the same way you'd treat a function signature or an API spec.

Version your prompts in source control. When output quality degrades, git diff the prompt. When you swap models, run your prompt test suite. This is not overhead. This is the minimum viable engineering practice for systems that depend on natural language interfaces.

Few-shot examples are your type system

When the model needs to follow a pattern, show it the pattern. This is called few-shot prompting, and in my experience it is the most effective technique for getting consistent structured output.

Consider a task where you need the model to classify customer support messages into categories. Without examples:

messages = [
    {"role": "system", "content": """Classify the following support message
into exactly one category: billing, technical, account, or general.
Return only the category name, lowercase, no punctuation."""},
    {"role": "user", "content": "I can't log into my account after resetting my password"}
]
# Model might return: "account"
# Model might return: "Account"
# Model might return: "This is an account issue."
# Model might return: "technical (login issue)"

The model understood the task. It just couldn't commit to a format. Now with two examples:

messages = [
    {"role": "system", "content": """Classify the following support message
into exactly one category: billing, technical, account, or general.
Return only the category name, lowercase, no punctuation."""},
    {"role": "user", "content": "My credit card was charged twice this month"},
    {"role": "assistant", "content": "billing"},
    {"role": "user", "content": "The dashboard keeps showing a 500 error"},
    {"role": "assistant", "content": "technical"},
    {"role": "user", "content": "I can't log into my account after resetting my password"}
]
# Model returns: "account"
# Every time.

What just happened

Two examples did more than a paragraph of instructions. The model saw the pattern: one word, lowercase, no explanation. It follows that pattern. Few-shot examples are like type annotations for natural language. They constrain the output space by demonstration rather than description.

Two to three examples is the sweet spot for most tasks. One example can be dismissed as coincidence. Four or more starts eating context budget without proportional improvement. Pick examples that cover your edge cases: one straightforward case, one boundary case, and one that's easy to get wrong.

Few-shot examples are particularly powerful for tool-using systems. When the model sees examples of choosing the right tool for a given query, it learns the selection criteria far more reliably than from descriptions alone. We'll come back to this when we discuss tool selection later in this section.

Function calling from scratch

This is the core mechanism. Every agent framework wraps it. Every tool-using system depends on it. And there is no magic. The model outputs JSON that says "call this function with these arguments." Your code does the calling.

The cycle has four steps:

  1. You define tool schemas (JSON that tells the model what tools exist)
  2. You send those schemas alongside the user's message
  3. The model responds with a structured tool call (not text, but a JSON object naming the function and its arguments)
  4. Your code validates the arguments, executes the function, and returns the result
The four-step function calling cycle: define schema, send with message, model returns tool call JSON, your code executes
Figure 0b.1: The function calling cycle. The model never executes anything. It writes JSON. You do the work.

Let's build this from scratch. First, we need a way to define tools.

The code below uses Pydantic, a Python library for data validation using type annotations. If you have used dataclasses, think of Pydantic as dataclasses with built-in validation: you declare fields with types, and Pydantic rejects any input that does not match. BaseModel is the base class. Field adds constraints like minimum values or defaults. model_validate() checks a dictionary against the type annotations and raises ValidationError if anything is wrong. We use it here because the model will send us JSON arguments, and we need to validate those arguments before running any code.

Here is the Tool class and ToolRegistry from this book's companion code:

from pydantic import BaseModel, Field
from enum import StrEnum
from typing import Any, Callable, Type


class Tool:
    """An entry in the tool registry.

    Holds the callable function and its associated Pydantic input model so
    arguments can be validated before dispatch.
    """

    def __init__(
        self,
        name: str,
        description: str,
        fn: Callable[..., str],
        input_model: Type[BaseModel],
    ) -> None:
        self.name = name
        self.description = description
        self.fn = fn
        self.input_model = input_model


class ToolRegistry:
    """Registry that maps tool names to Tool entries."""

    def __init__(self) -> None:
        self._tools: dict[str, Tool] = {}

    def register(
        self,
        name: str,
        description: str,
        fn: Callable[..., str],
        input_model: Type[BaseModel],
    ) -> None:
        self._tools[name] = Tool(
            name=name, description=description, fn=fn, input_model=input_model
        )

    def get_schemas(self) -> list[dict]:
        """Return schema objects for all registered tools."""
        return [entry.to_schema() for entry in self._tools.values()]

    def get(self, name: str) -> Tool | None:
        return self._tools.get(name)

What just happened

The ToolRegistry is a dictionary with extra structure. Each tool has a name (how the model refers to it), a description (how the model decides whether to use it), a callable (what your code actually runs), and a Pydantic input model (how you validate arguments before calling the function). That last one matters a lot. We'll get to why in the next section.

Now let's define some actual tools. A calculator:

class Operation(StrEnum):
    ADD = "add"
    SUBTRACT = "subtract"
    MULTIPLY = "multiply"
    DIVIDE = "divide"


class CalculatorInput(BaseModel):
    """Input schema for the calculator tool."""
    operation: Operation
    a: float
    b: float


def calculator(operation: str, a: float, b: float) -> str:
    op = Operation(operation)
    if op == Operation.ADD:
        return str(float(a + b))
    elif op == Operation.SUBTRACT:
        return str(float(a - b))
    elif op == Operation.MULTIPLY:
        return str(float(a * b))
    elif op == Operation.DIVIDE:
        if b == 0:
            return "Error: division by zero"
        return str(float(a / b))
    return f"Error: unknown operation '{operation}'"

And a search tool:

class SearchInput(BaseModel):
    """Input schema for the fake search tool."""
    query: str
    max_results: int = Field(default=3, ge=1, le=10)


def fake_search(query: str, max_results: int = 3) -> str:
    results = [
        {"title": f"Result {i + 1} for '{query}'",
         "url": f"https://example.com/{i + 1}"}
        for i in range(max_results)
    ]
    return json.dumps(results, indent=2)

Now the critical piece: the Tool.to_schema() method that converts a Pydantic model into the JSON schema the model needs:

def to_schema(self) -> ToolSchema:
    """Derive a ToolSchema from the Pydantic model's field definitions."""
    parameters: list[ToolParameter] = []
    model_fields = self.input_model.model_fields

    for field_name, field_info in model_fields.items():
        annotation = field_info.annotation
        type_str = _python_type_to_json_type(annotation)

        enum_values: list[str] | None = None
        if isinstance(annotation, type) and issubclass(annotation, StrEnum):
            enum_values = [e.value for e in annotation]

        parameters.append(
            ToolParameter(
                name=field_name,
                type=type_str,
                description=field_info.description or field_name,
                required=field_info.is_required(),
                enum=enum_values,
            )
        )

    return ToolSchema(
        name=self.name,
        description=self.description,
        parameters=parameters
    )

What just happened

to_schema() walks the Pydantic model's fields and converts them into a ToolSchema, the book's provider-neutral representation of a tool definition. This schema gets sent to the model alongside the user's message. The model reads this schema to understand what tools are available, what arguments they accept, and what constraints those arguments have (like enum values for the operation field). The model never sees your Python code. It sees the schema.

This is the full picture. You define a function. You define a Pydantic model that describes its inputs. You register both in the registry. The registry generates schemas. You send those schemas to the model. The model reads them and decides whether to call a tool and with what arguments.

Now let's see what happens on the other side. When the model decides to use a tool, it doesn't return text. It returns a ToolCall object:

# This is what the model returns instead of text:
{
    "id": "call_001",
    "name": "calculator",
    "arguments": {"operation": "add", "a": 100, "b": 200}
}

Your code receives this, looks up the tool in the registry, validates the arguments, and executes the function:

def execute_tool_call(
    registry: ToolRegistry, tool_name: str, arguments: dict[str, Any]
) -> str:
    """Validate and execute a tool call from the model."""
    entry = registry.get(tool_name)
    if entry is None:
        return f"Error: unknown tool '{tool_name}'"

    try:
        validated = entry.input_model.model_validate(arguments)
    except ValidationError as exc:
        return f"Validation error: {exc}"

    try:
        return entry.fn(**validated.model_dump())
    except Exception as exc:
        return f"Error executing tool '{tool_name}': {exc}"

What just happened

Three things happen in execute_tool_call: lookup, validation, execution. The lookup catches hallucinated tool names. The validation (via Pydantic's model_validate) catches malformed arguments before they reach your function. The try/except around execution catches runtime errors. Each layer returns a string error instead of crashing, because this error message gets sent back to the model as the tool result. The model can then apologize, retry with different arguments, or try a different tool. Crashing would end the conversation.

Here's the full flow wired together:

registry = create_default_registry()

# Direct call: what is 6 * 7?
result = execute_tool_call(
    registry, "calculator",
    {"operation": "multiply", "a": 6, "b": 7}
)
print(result)
# "42.0"

# Division by zero: handled gracefully
result = execute_tool_call(
    registry, "calculator",
    {"operation": "divide", "a": 10, "b": 0}
)
print(result)
# "Error: division by zero"

That is the entire mechanism behind function calling. You defined a Pydantic model, wrote to_schema() to generate the JSON schema, built execute_tool_call() to validate and dispatch. Every major framework automates exactly these three steps. LangChain's @tool decorator reads your type hints and generates the schema. Google ADK's FunctionTool reads the docstring. CrewAI's tool registration does the same. The pattern is identical; only the syntax changes. You built it by hand so you understand what the framework is doing when it hides these lines from you. Section 0d will show you the same tools wrapped in ADK and LangChain, and you will see that the 30 lines of schema and validation machinery you just wrote collapse into a single decorator.

Schema validation is your safety net

The model will hallucinate arguments. Not occasionally. Routinely. The model will pass a string where you need an integer. It will invent parameters that don't exist. It will pass "modulo" as an operation to your calculator that only knows add, subtract, multiply, divide.

Without validation, here's what happens:

# Without Pydantic validation: the model passes "modulo"
try:
    Operation("modulo")
except ValueError as e:
    print(f"Raw error: {e}")
# Raw error: 'modulo' is not a valid Operation

That ValueError crashes your tool execution. If you're not catching it, it crashes your agent loop. If you are catching it but returning a generic "something went wrong" message, the model has no idea what went wrong and will likely try the same thing again.

With Pydantic validation, the error is specific and actionable:

from pydantic import ValidationError

# With Pydantic: structured error that the model can learn from
try:
    CalculatorInput(operation="modulo", a=10, b=3)
except ValidationError as exc:
    print(exc)
1 validation error for CalculatorInput
operation
  Input should be 'add', 'subtract', 'multiply' or 'divide'
    [type=enum, input_value='modulo', input='modulo']

What just happened

The Pydantic error message tells the model exactly what went wrong and what the valid options are. When you send this error back as the tool result, the model can read "Input should be 'add', 'subtract', 'multiply' or 'divide'" and retry with a valid operation. Compare this to a stack trace or a generic error message. Structured validation errors are documentation for the model.

This is why every tool in the companion code has a Pydantic input model. Not for elegance. For survival. Here are the failure modes that validation catches:

Wrong type: The model passes "seven" instead of 7 for a numeric parameter. Pydantic rejects it (or coerces it, depending on your config). Without validation, your arithmetic function receives a string and either crashes or produces nonsense.

Missing required field: The model forgets to include the operation parameter. Pydantic reports exactly which field is missing. Without validation, your function receives None and crashes with a confusing AttributeError.

Extra fields: The model invents a precision parameter that your calculator doesn't support. Pydantic ignores it by default (configurable). Without validation, your function receives an unexpected keyword argument.

Out of range: The SearchInput model constrains max_results to between 1 and 10 using Field(ge=1, le=10). The model passes 500. Pydantic rejects it with a clear message. Without validation, your search function tries to generate 500 results.

# max_results out of range
result = execute_tool_call(
    registry, "search",
    {"query": "python best practices", "max_results": 500}
)
print(result)
# Validation error: 1 validation error for SearchInput
# max_results
#   Input should be less than or equal to 10
#     [type=less_than_equal, input_value=500, input=500]

Failure case study: the hallucinated argument that almost caused silent corruption

A data pipeline agent had a write_record tool that accepted a priority field: an integer from 1 (low) to 5 (critical). Without Pydantic validation, the model passed priority: 10 for a routine log entry. The write succeeded. No error. The record was stored with priority 10, which was outside the application's expected range. Downstream alerting logic treated anything above 5 as a system emergency and paged the on-call team at 3am. With Field(ge=1, le=5) on the Pydantic model, the validation error would have been caught before the write, the model would have received "Input should be less than or equal to 5," and it would have retried with a valid value. The gap between "the function accepts an int" and "the function accepts an int between 1 and 5" is where silent data corruption lives.

Comparison of tool calls with and without schema validation, showing clean errors vs runtime crashes
Figure 0b.2: Schema validation turns runtime crashes into structured errors that the model can learn from.

Validation isn't just about preventing crashes. It's about giving the model enough information to self-correct. A model that receives "Validation error: Input should be 'add', 'subtract', 'multiply' or 'divide'" will fix its next attempt. A model that receives "Error: unhandled exception in tool execution" will guess, and guess wrong.

Multiple tools and selection

Give the model three tools and a question. Watch what happens.

registry = create_default_registry()

# Three tools registered:
print(registry.list_tools())
# ['calculator', 'word_count', 'search']

When you send a message like "What is 42 times 17?" along with all three tool schemas, the model reads the descriptions and picks calculator. When you send "How many words are in the Gettysburg Address?", it picks word_count. When you send "Find me information about Rust async patterns," it picks search.

This works because the tool descriptions are clear and distinct. The model selects tools by matching the user's intent to the tool descriptions. This is not semantic search. It is not embeddings. It is the model reading your descriptions and making a judgment call, the same way a person would read an API catalog and pick the right endpoint.

Tool descriptions are your API documentation for the model. Write them like you'd write docs for a junior developer who takes everything literally. Because that is exactly what the model is doing: reading your description literally and picking the tool that sounds most relevant.

Here are the descriptions from the companion code:

registry.register(
    name="calculator",
    description="Perform basic arithmetic: add, subtract, multiply, "
                "or divide two numbers.",
    fn=calculator,
    input_model=CalculatorInput,
)
registry.register(
    name="word_count",
    description="Count the number of words in a piece of text.",
    fn=word_count,
    input_model=WordCountInput,
)
registry.register(
    name="search",
    description="Search for information on a topic and return "
                "the top results.",
    fn=fake_search,
    input_model=SearchInput,
)

What just happened

Each description is one sentence that says what the tool does, not how it works. The model doesn't need to know that the calculator uses Python's arithmetic operators or that the search function returns mock data. It needs to know when to use each tool. "Perform basic arithmetic" tells it to use this tool for math. "Count the number of words" tells it to use this tool for word counting. Descriptions are selection criteria, not implementation details.

When does the model choose wrong? When the descriptions overlap or when the query is ambiguous. Consider this message: "What's the word count of 'four score and seven years ago'?"

The model should pick word_count. But I've seen models pick calculator because they interpret "count" as a numeric operation and try to count the words themselves using arithmetic. This happens more often with smaller models and with vague descriptions.

The fix is better descriptions, not more complex routing logic. If you find the model consistently selecting the wrong tool, the description is the first place to look. Make it more specific. Add what the tool is NOT for if the confusion is persistent. "Count the number of words in a piece of text. Do not use for arithmetic or calculations." This sounds redundant to a human reader. It's not redundant to the model.

A few practical guidelines for tool descriptions:

State the input, not just the output. "Perform basic arithmetic on two numbers" is better than "Do math." The model needs to know what to pass, not just what comes back.

Use the vocabulary of the domain. If your users say "look up" and your tool description says "retrieve," the model might hesitate. Match the language your users actually use.

Keep it under two sentences. Long descriptions get lost in the context. The model pays the most attention to the first sentence. Put the selection-critical information there.

Don't describe the implementation. "Uses a PostgreSQL full-text search index with ts_vector" is irrelevant to the model. "Search the knowledge base for documents matching a query" is what it needs.

Failure case study: the vague description

A customer support agent had two tools: get_order ("Get order information") and get_account ("Get account information"). When users asked "Where's my package?", the model picked get_account about 40% of the time. Both descriptions mentioned "information" without saying what kind. The fix: change get_order to "Look up shipping status, delivery date, and tracking number for a specific order ID" and get_account to "Retrieve account profile, billing address, and payment methods for a customer." After that change, routing accuracy on order-tracking queries went from ~60% to over 95%. The model was not confused about the task. It was confused about which tool matched the task. Specific descriptions fixed it.

When you have more than five or six tools, the model's selection accuracy starts to degrade. Not because it can't read six descriptions, but because the probability of description overlap increases. If you find yourself registering fifteen tools, that's an architectural signal. Split them into groups. Use a routing step where one model call picks the category, then a second call picks the specific tool within that category. Chapter 4 covers this pattern in depth.

The gap this doesn't cover

You now have a system that can take actions. But it takes them once. The model calls a tool, gets the result, and the conversation is over. It can't look at the result and decide "that's not enough, let me search again with different terms." It can't chain together a search, a calculation on the search results, and then a summary. It does one thing, and stops.

This is the single-turn ceiling. It's useful. Single-turn tool use handles a surprising number of real-world tasks: calculators, data lookups, format conversions, API calls. But it's not an agent.

An agent can observe the result of its action, decide it's insufficient, and take another action. An agent can plan a sequence of steps, adjust the plan based on intermediate results, and know when to stop. The difference is a loop: observe, think, act, repeat.

Single-turn tool use stops after one action. The agent loop continues: observe, think, act, repeat.
Figure 0b.3: The single-turn ceiling. Tool use without a loop handles one action. The agent loop adds iteration, and that changes everything.

Consider a concrete example. A user asks: "What's the population density of the country with the tallest building in the world?"

With single-turn tool use, the model can call search for "tallest building in the world." It gets back results mentioning the Burj Khalifa in the UAE. But now what? It needs to search again for "UAE population density" and then maybe use the calculator to verify the numbers. In a single-turn system, it got one search result and had to guess the rest. In an agent loop, it would chain those three tool calls together, each informed by the previous result.

That loop is the subject of the next section.

Putting it together

You've gone from text-in-text-out to a system that can act. It can calculate, search, and analyze. The model reads tool schemas, decides which tool to call, and returns structured JSON with the function name and arguments. Your code validates those arguments with Pydantic, executes the function, and returns the result. Every tool-using system you'll encounter, from simple chatbot plugins to complex agent frameworks, is built on this cycle.

The key insights from this section:

System prompts are code. Version them, test them, specify them precisely. Every ambiguity is a bug waiting to happen.

Few-shot examples are the most effective technique for consistent output. Two examples beat two paragraphs of instructions.

Function calling is JSON in, JSON out. The model writes a request. Your code fulfills it. There is no magic on either side.

Schema validation isn't optional. The model will hallucinate arguments. Pydantic catches the bad ones before they reach your functions, and the structured error messages give the model enough information to self-correct.

Tool descriptions are selection criteria. Write them for a literal reader who will pick the tool that sounds most relevant to the query.

And the limitation: all of this operates in a single turn. One tool call, one result. The model cannot look at a search result and decide "that's not what I needed, let me refine my query." It cannot chain a search, a calculation on the search results, and a summary together. It does one thing, and stops.

The next section adds the loop that removes that ceiling. You will build, in about 100 lines of Python, a system that can observe the result of its action, decide whether it is sufficient, and take another action if it is not. The mechanism is a while loop with an LLM inside it. The engineering challenge is knowing when that loop should stop.

For a fully built tool-using assistant with proper error handling and logging, see the Tool-Using Assistant project.