How to tune costs for reasoning models, make tokens not wasted, and call tools only when they earn their keep
Sequel to “The EV+ of GenAI” and “Measuring the EV+ of GenAI.” Same formula. This time the unit of analysis is an agentic workflow.
The first article defined the economic frame.
EV = p(success) x business_value - inference_cost - risk_cost The second article measured the frame with a small RAG system.
This third step is where the economics become sharper: agentic AI with reasoning models and tools.
I curated this topic from a GenAI Works discussion about the practical layers of AI systems: LLMs, RAG, AI agents, and agentic AI. The useful direction is not another “agents are powerful” post. The useful direction is more operational:
Curated Source
GenAI Works: AI system layers
GenAI Works frames the shift as understanding the layers of AI that many teams miss, from LLMs through RAG and agents into agentic systems.
You cannot run a 2026 business on a 2024 intelligence stack.View source context
How do we stop reasoning models from wasting tokens while deciding whether to call tools?
That question matters because agentic systems do not spend money once. They spend money repeatedly:
- planning,
- reasoning,
- retrieving,
- calling tools,
- observing results,
- reasoning again,
- generating the final answer,
- and sometimes looping because nobody gave the agent a budget.
In a normal chat completion, token waste is annoying.
In an agentic workflow, token waste compounds.
The Agentic Cost Trap
An agentic AI process usually looks like this:
user request
-> planner
-> tool selection
-> tool call
-> observation
-> more reasoning
-> maybe another tool
-> final answer The dangerous version looks like this:
user request
-> expensive reasoning model
-> verbose plan
-> broad retrieval
-> expensive reasoning model
-> unnecessary tool call
-> large observation copied into context
-> expensive reasoning model
-> second unnecessary tool call
-> final answer that could have been produced after step 2 The model may look intelligent. The trace may look impressive. The answer may even be correct.
But the EV can still be worse than a simpler pipeline.
The agent is not paid for thinking. It is paid for improving the expected outcome enough to justify the cost of thinking.
The Tool Tax
Every tool call has two costs.
tool_call_cost = direct_tool_cost + tokenized_observation_cost The direct cost is obvious: API fees, database reads, vector search, browser automation, compute, queue time, or external service calls.
The hidden cost is the observation. Tool results usually get pasted back into the model context. If the result is large, noisy, or redundant, the next reasoning step becomes more expensive.
So the real question before calling a tool is:
expected_tool_gain > tool_call_cost + extra_reasoning_cost Where:
expected_tool_gain =
(p_success_after_tool - p_success_before_tool) x business_value
+ risk_reduction A tool call is EV-positive when:
(p_after - p_before) x business_value + risk_reduction
>
direct_tool_cost + observation_token_cost + extra_reasoning_cost This is the missing runtime gate in many agent demos.
They ask:
Can the model call a tool?
Production systems should ask:
Is this tool call worth the next dollar, millisecond, and token?
A Worked Example
Imagine an AI support agent handling a refund question for a user who may be eligible for an exception.
Assume:
business_value = $0.80
risk_cost_if_wrong = $0.18 The agent has three choices.
| Strategy | p(success) | Inference + tool cost | Risk cost | EV |
|---|---|---|---|---|
| Direct answer | 0.58 | $0.006 | $0.18 | $0.278 |
| Retrieve policy only | 0.76 | $0.014 | $0.10 | $0.494 |
| Retrieve policy + call order API + reason deeply | 0.88 | $0.071 | $0.06 | $0.573 |
The full agentic path has the highest EV here.
direct_ev = 0.58 x 0.80 - 0.006 - 0.18 = $0.278
rag_ev = 0.76 x 0.80 - 0.014 - 0.10 = $0.494
agent_ev = 0.88 x 0.80 - 0.071 - 0.06 = $0.573 But now change the business value from $0.80 to $0.30.
| Strategy | p(success) | Inference + tool cost | Risk cost | EV |
|---|---|---|---|---|
| Direct answer | 0.58 | $0.006 | $0.18 | -$0.012 |
| Retrieve policy only | 0.76 | $0.014 | $0.10 | $0.114 |
| Retrieve policy + call order API + reason deeply | 0.88 | $0.071 | $0.06 | $0.133 |
The agentic path is still positive, but the margin over RAG is only:
$0.133 - $0.114 = $0.019 If the tool is slower, the order API is unreliable, the observation is verbose, or the model takes one extra reasoning loop, the premium path stops being worth it.
The lesson is not “never use agents.”
The lesson is:
Agentic depth should be purchased only when uncertainty, risk, or business value make the marginal reasoning step EV-positive.
Break-Even Tool Calling
Before a tool call, compute the maximum affordable cost.
max_tool_cost =
(p_after - p_before) x business_value
+ risk_reduction
- extra_reasoning_cost Example:
p_before = 0.76
p_after = 0.88
business_value = $0.80
risk_reduction = $0.04
extra_reasoning_cost = $0.018
max_tool_cost = (0.88 - 0.76) x 0.80 + 0.04 - 0.018
max_tool_cost = $0.118 If the order API call plus the observation tokens cost less than $0.118, the tool is EV-positive.
If the same request has only $0.30 of business value:
max_tool_cost = (0.88 - 0.76) x 0.30 + 0.04 - 0.018
max_tool_cost = $0.058 Same model. Same tool. Same expected accuracy lift.
Different product economics.
This is why the routing layer needs business context. An agent without value awareness is just an expensive loop with good manners.
Reasoning Tokens Are a Budget, Not a Personality Trait
Reasoning models are valuable because they can spend more computation on difficult tasks. That does not mean every task deserves the same reasoning budget.
A useful production policy:
| Request type | Reasoning budget | Tool policy |
|---|---|---|
| Simple FAQ | Low | No tool unless retrieval confidence is low |
| Known policy lookup | Low to medium | Retrieve small context only |
| Account-specific support | Medium | Call account/order tool only after policy retrieval |
| High-value transaction | Medium to high | Use tools, verify, and summarize observations |
| Compliance-sensitive action | High | Use tools and require structured evidence |
The budget should be explicit:
max_planning_tokens = 200
max_observation_tokens = 700
max_tool_calls = 2
max_total_cost = $0.08 Those limits are not arbitrary. They are the economic boundaries of the hand.
Sample Code: EV-Gated Agent Loop
The following Python example is deliberately small. It does not call a real model. Instead, it models the economics around a reasoning model so the policy is visible.
You can plug the same accounting into OpenAI, Anthropic, Gemini, local vLLM, or any other model stack.
from dataclasses import dataclass
from typing import Callable
@dataclass(frozen=True)
class RateCard:
input_per_1k: float
output_per_1k: float
reasoning_per_1k: float
@dataclass(frozen=True)
class ToolSpec:
name: str
direct_cost: float
expected_observation_tokens: int
expected_success_lift: float
expected_risk_reduction: float
@dataclass
class AgentBudget:
max_total_cost: float
max_tool_calls: int
max_reasoning_tokens: int
spent: float = 0.0
tool_calls: int = 0
reasoning_tokens: int = 0
def token_cost(
*,
input_tokens: int,
output_tokens: int,
reasoning_tokens: int,
rates: RateCard,
) -> float:
return (
input_tokens / 1000 * rates.input_per_1k
+ output_tokens / 1000 * rates.output_per_1k
+ reasoning_tokens / 1000 * rates.reasoning_per_1k
)
def max_affordable_tool_cost(
*,
p_before: float,
p_after: float,
business_value: float,
risk_reduction: float,
extra_reasoning_cost: float,
) -> float:
return (
(p_after - p_before) * business_value
+ risk_reduction
- extra_reasoning_cost
)
def should_call_tool(
*,
tool: ToolSpec,
p_before: float,
business_value: float,
extra_reasoning_cost: float,
observation_token_cost: Callable[[int], float],
budget: AgentBudget,
) -> bool:
if budget.tool_calls >= budget.max_tool_calls:
return False
p_after = min(0.99, p_before + tool.expected_success_lift)
affordable = max_affordable_tool_cost(
p_before=p_before,
p_after=p_after,
business_value=business_value,
risk_reduction=tool.expected_risk_reduction,
extra_reasoning_cost=extra_reasoning_cost,
)
expected_cost = (
tool.direct_cost
+ observation_token_cost(tool.expected_observation_tokens)
)
return budget.spent + expected_cost <= budget.max_total_cost and expected_cost <= affordable
Now define a few tools.
rates = RateCard(
input_per_1k=0.003,
output_per_1k=0.012,
reasoning_per_1k=0.018,
)
tools = [
ToolSpec(
name="policy_search",
direct_cost=0.002,
expected_observation_tokens=450,
expected_success_lift=0.18,
expected_risk_reduction=0.06,
),
ToolSpec(
name="order_lookup",
direct_cost=0.015,
expected_observation_tokens=900,
expected_success_lift=0.12,
expected_risk_reduction=0.04,
),
ToolSpec(
name="web_search",
direct_cost=0.035,
expected_observation_tokens=3000,
expected_success_lift=0.02,
expected_risk_reduction=0.005,
),
]
budget = AgentBudget(
max_total_cost=0.08,
max_tool_calls=3,
max_reasoning_tokens=1200,
)
def observation_cost(tokens: int) -> float:
return token_cost(
input_tokens=tokens,
output_tokens=0,
reasoning_tokens=0,
rates=rates,
)
Simulate the decision.
business_value = 0.80
p_success = 0.58
for tool in tools:
extra_reasoning_cost = token_cost(
input_tokens=250,
output_tokens=120,
reasoning_tokens=300,
rates=rates,
)
allowed = should_call_tool(
tool=tool,
p_before=p_success,
business_value=business_value,
extra_reasoning_cost=extra_reasoning_cost,
observation_token_cost=observation_cost,
budget=budget,
)
print(f"{tool.name:14s} -> {'CALL' if allowed else 'SKIP'}")
if allowed:
tool_cost = tool.direct_cost + observation_cost(tool.expected_observation_tokens)
budget.spent += tool_cost + extra_reasoning_cost
budget.tool_calls += 1
budget.reasoning_tokens += 300
p_success = min(0.99, p_success + tool.expected_success_lift)
Expected output:
policy_search -> CALL
order_lookup -> CALL
web_search -> SKIP
The agent does not skip web_search because web search is bad.
It skips it because the expected marginal lift is too small after the first two tools already improved the state. The later the tool appears in the loop, the more it has to justify itself.
That is the agentic version of EV-positive play.
Add a Context Diet
The fastest way to waste money in tool-using agents is to paste raw tool output back into the model.
Do not do this:
Tool returned 9,000 tokens.
Append all 9,000 tokens to context.
Ask the reasoning model what matters. Do this instead:
Tool returned 9,000 tokens.
Compress to structured evidence.
Append only the 300-700 tokens needed for the next decision. A simple observation contract:
def compress_order_observation(raw_order: dict) -> dict:
return {
"order_id": raw_order["id"],
"status": raw_order["status"],
"delivered_at": raw_order.get("delivered_at"),
"is_digital": raw_order["product_type"] == "digital",
"return_window_days": raw_order.get("return_window_days"),
"prior_refunds": raw_order.get("prior_refunds", 0),
}
The goal is not to hide information from the model. The goal is to preserve decision-relevant information while removing token noise.
A reasoning model does not need the entire order JSON. It needs the variables that change the decision.
Add a Stop-Loss
Poker has bankroll management. Agentic AI needs the same instinct.
Add stop-loss rules:
def stop_loss_triggered(
*,
budget: AgentBudget,
p_success: float,
min_required_success: float,
) -> bool:
if budget.spent >= budget.max_total_cost:
return True
if budget.reasoning_tokens >= budget.max_reasoning_tokens:
return True
if budget.tool_calls >= budget.max_tool_calls and p_success < min_required_success:
return True
return False
Then the runtime policy becomes:
if stop_loss_triggered:
escalate, ask a clarifying question, or give a bounded answer This is not a failure mode. It is a product decision.
Sometimes the EV-positive answer is:
I need one more piece of information before I can answer safely.
That sentence can be cheaper and better than another hidden reasoning loop.
The Agentic EV Trace
Production agents should emit a cost trace for every request.
{
"request_id": "req_123",
"route": "agentic_support",
"business_value_estimate": 0.8,
"initial_p_success": 0.58,
"final_p_success": 0.88,
"tool_calls": [
{
"name": "policy_search",
"direct_cost": 0.002,
"observation_tokens": 450,
"decision": "called"
},
{
"name": "order_lookup",
"direct_cost": 0.015,
"observation_tokens": 900,
"decision": "called"
},
{
"name": "web_search",
"direct_cost": 0.035,
"observation_tokens": 3000,
"decision": "skipped",
"reason": "marginal_ev_below_threshold"
}
],
"total_cost": 0.061,
"estimated_ev": 0.583
}
Without this trace, cost tuning becomes vibes.
With this trace, cost tuning becomes engineering:
- Which tools are called most often?
- Which tools improve success probability?
- Which observations are too large?
- Which route spends too much reasoning on low-value requests?
- Which request classes should be downgraded to RAG?
- Which request classes deserve deeper reasoning?
The point is not to make the system cheap everywhere.
The point is to spend expensive reasoning where it compounds.
Practical Tuning Rules
Start with retrieval before tools. If a small retrieval step resolves the uncertainty, do not invoke a full agent.
Pass summaries, not transcripts. Tool observations should be structured, minimal, and decision-relevant.
Use confidence deltas. A tool should estimate how much it improves the answer, not merely whether it can be called.
Cap loops by cost, not only by count. Two expensive tool calls can be worse than five cheap ones.
Separate planning from execution. Let a cheaper model classify and route when possible. Use the reasoning model when the expected value clears the threshold.
Cache stable observations. Policies, product metadata, shipping rules, and public documentation should not be re-read expensively for every request.
Measure final EV, not model cleverness. A beautiful chain of thought that spends the margin is not a win.
The Next Layer of AI Engineering
The first phase of GenAI engineering was about making models answer.
The second phase was about grounding them with retrieval.
The third phase is about controlling the economics of autonomous action.
Reasoning models are powerful. Tools are powerful. Agents are powerful.
But power is not the same thing as profit.
The EV-positive agent does not ask:
What else can I call? It asks:
What is the cheapest next action that changes the expected outcome enough to matter? That is the next step in the EV+ of GenAI series: from measuring inference to governing agency.
Not less intelligence.
Better bet sizing.