The EV+ of GenAI, Part 3: Agentic Cost Tuning

How to tune costs for reasoning models, make tokens not wasted, and call tools only when they earn their keep

Sequel to “The EV+ of GenAI” and “Measuring the EV+ of GenAI.” Same formula. This time the unit of analysis is an agentic workflow.

The first article defined the economic frame.

text

EV = p(success) x business_value - inference_cost - risk_cost

The second article measured the frame with a small RAG system.

This third step is where the economics become sharper: agentic AI with reasoning models and tools.

I curated this topic from a GenAI Works discussion about the practical layers of AI systems: LLMs, RAG, AI agents, and agentic AI. The useful direction is not another “agents are powerful” post. The useful direction is more operational:

Curated Source

GenAI Works LinkedIn source graphic about LLM, RAG, AI agents and agentic AI

GenAI Works: AI system layers

GenAI Works frames the shift as understanding the layers of AI that many teams miss, from LLMs through RAG and agents into agentic systems.

You cannot run a 2026 business on a 2024 intelligence stack.

View source context

How do we stop reasoning models from wasting tokens while deciding whether to call tools?

That question matters because agentic systems do not spend money once. They spend money repeatedly:

planning,
reasoning,
retrieving,
calling tools,
observing results,
reasoning again,
generating the final answer,
and sometimes looping because nobody gave the agent a budget.

In a normal chat completion, token waste is annoying.

In an agentic workflow, token waste compounds.

The Agentic Cost Trap

An agentic AI process usually looks like this:

text

user request
 -> planner
 -> tool selection
 -> tool call
 -> observation
 -> more reasoning
 -> maybe another tool
 -> final answer

The dangerous version looks like this:

text

user request
 -> expensive reasoning model
 -> verbose plan
 -> broad retrieval
 -> expensive reasoning model
 -> unnecessary tool call
 -> large observation copied into context
 -> expensive reasoning model
 -> second unnecessary tool call
 -> final answer that could have been produced after step 2

The model may look intelligent. The trace may look impressive. The answer may even be correct.

But the EV can still be worse than a simpler pipeline.

The agent is not paid for thinking. It is paid for improving the expected outcome enough to justify the cost of thinking.

The Tool Tax

Every tool call has two costs.

text

tool_call_cost = direct_tool_cost + tokenized_observation_cost

The direct cost is obvious: API fees, database reads, vector search, browser automation, compute, queue time, or external service calls.

The hidden cost is the observation. Tool results usually get pasted back into the model context. If the result is large, noisy, or redundant, the next reasoning step becomes more expensive.

So the real question before calling a tool is:

text

expected_tool_gain > tool_call_cost + extra_reasoning_cost

Where:

text

expected_tool_gain =
(p_success_after_tool - p_success_before_tool) x business_value
+ risk_reduction

A tool call is EV-positive when:

text

(p_after - p_before) x business_value + risk_reduction
>
direct_tool_cost + observation_token_cost + extra_reasoning_cost

This is the missing runtime gate in many agent demos.

They ask:

Can the model call a tool?

Production systems should ask:

Is this tool call worth the next dollar, millisecond, and token?

A Worked Example

Imagine an AI support agent handling a refund question for a user who may be eligible for an exception.

Assume:

text

business_value = $0.80
risk_cost_if_wrong = $0.18

The agent has three choices.

Strategy	p(success)	Inference + tool cost	Risk cost	EV
Direct answer	0.58	$0.006	$0.18	$0.278
Retrieve policy only	0.76	$0.014	$0.10	$0.494
Retrieve policy + call order API + reason deeply	0.88	$0.071	$0.06	$0.573

The full agentic path has the highest EV here.

text

direct_ev = 0.58 x 0.80 - 0.006 - 0.18 = $0.278
rag_ev    = 0.76 x 0.80 - 0.014 - 0.10 = $0.494
agent_ev  = 0.88 x 0.80 - 0.071 - 0.06 = $0.573

But now change the business value from $0.80 to $0.30.

Strategy	p(success)	Inference + tool cost	Risk cost	EV
Direct answer	0.58	$0.006	$0.18	-$0.012
Retrieve policy only	0.76	$0.014	$0.10	$0.114
Retrieve policy + call order API + reason deeply	0.88	$0.071	$0.06	$0.133

The agentic path is still positive, but the margin over RAG is only:

text

$0.133 - $0.114 = $0.019

If the tool is slower, the order API is unreliable, the observation is verbose, or the model takes one extra reasoning loop, the premium path stops being worth it.

The lesson is not “never use agents.”

The lesson is:

Agentic depth should be purchased only when uncertainty, risk, or business value make the marginal reasoning step EV-positive.

Break-Even Tool Calling

Before a tool call, compute the maximum affordable cost.

text

max_tool_cost =
(p_after - p_before) x business_value
+ risk_reduction
- extra_reasoning_cost

Example:

text

p_before = 0.76
p_after  = 0.88
business_value = $0.80
risk_reduction = $0.04
extra_reasoning_cost = $0.018

max_tool_cost = (0.88 - 0.76) x 0.80 + 0.04 - 0.018
max_tool_cost = $0.118

If the order API call plus the observation tokens cost less than $0.118, the tool is EV-positive.

If the same request has only $0.30 of business value:

text

max_tool_cost = (0.88 - 0.76) x 0.30 + 0.04 - 0.018
max_tool_cost = $0.058

Same model. Same tool. Same expected accuracy lift.

Different product economics.

This is why the routing layer needs business context. An agent without value awareness is just an expensive loop with good manners.

Reasoning Tokens Are a Budget, Not a Personality Trait

Reasoning models are valuable because they can spend more computation on difficult tasks. That does not mean every task deserves the same reasoning budget.

A useful production policy:

Request type	Reasoning budget	Tool policy
Simple FAQ	Low	No tool unless retrieval confidence is low
Known policy lookup	Low to medium	Retrieve small context only
Account-specific support	Medium	Call account/order tool only after policy retrieval
High-value transaction	Medium to high	Use tools, verify, and summarize observations
Compliance-sensitive action	High	Use tools and require structured evidence

The budget should be explicit:

text

max_planning_tokens = 200
max_observation_tokens = 700
max_tool_calls = 2
max_total_cost = $0.08

Those limits are not arbitrary. They are the economic boundaries of the hand.

Sample Code: EV-Gated Agent Loop

The following Python example is deliberately small. It does not call a real model. Instead, it models the economics around a reasoning model so the policy is visible.

You can plug the same accounting into OpenAI, Anthropic, Gemini, local vLLM, or any other model stack.

from dataclasses import dataclass
from typing import Callable


@dataclass(frozen=True)
class RateCard:
    input_per_1k: float
    output_per_1k: float
    reasoning_per_1k: float


@dataclass(frozen=True)
class ToolSpec:
    name: str
    direct_cost: float
    expected_observation_tokens: int
    expected_success_lift: float
    expected_risk_reduction: float


@dataclass
class AgentBudget:
    max_total_cost: float
    max_tool_calls: int
    max_reasoning_tokens: int
    spent: float = 0.0
    tool_calls: int = 0
    reasoning_tokens: int = 0


def token_cost(
    *,
    input_tokens: int,
    output_tokens: int,
    reasoning_tokens: int,
    rates: RateCard,
) -> float:
    return (
        input_tokens / 1000 * rates.input_per_1k
        + output_tokens / 1000 * rates.output_per_1k
        + reasoning_tokens / 1000 * rates.reasoning_per_1k
    )


def max_affordable_tool_cost(
    *,
    p_before: float,
    p_after: float,
    business_value: float,
    risk_reduction: float,
    extra_reasoning_cost: float,
) -> float:
    return (
        (p_after - p_before) * business_value
        + risk_reduction
        - extra_reasoning_cost
    )


def should_call_tool(
    *,
    tool: ToolSpec,
    p_before: float,
    business_value: float,
    extra_reasoning_cost: float,
    observation_token_cost: Callable[[int], float],
    budget: AgentBudget,
) -> bool:
    if budget.tool_calls >= budget.max_tool_calls:
        return False

    p_after = min(0.99, p_before + tool.expected_success_lift)
    affordable = max_affordable_tool_cost(
        p_before=p_before,
        p_after=p_after,
        business_value=business_value,
        risk_reduction=tool.expected_risk_reduction,
        extra_reasoning_cost=extra_reasoning_cost,
    )

    expected_cost = (
        tool.direct_cost
        + observation_token_cost(tool.expected_observation_tokens)
    )

    return budget.spent + expected_cost <= budget.max_total_cost and expected_cost <= affordable

Now define a few tools.

rates = RateCard(
    input_per_1k=0.003,
    output_per_1k=0.012,
    reasoning_per_1k=0.018,
)

tools = [
    ToolSpec(
        name="policy_search",
        direct_cost=0.002,
        expected_observation_tokens=450,
        expected_success_lift=0.18,
        expected_risk_reduction=0.06,
    ),
    ToolSpec(
        name="order_lookup",
        direct_cost=0.015,
        expected_observation_tokens=900,
        expected_success_lift=0.12,
        expected_risk_reduction=0.04,
    ),
    ToolSpec(
        name="web_search",
        direct_cost=0.035,
        expected_observation_tokens=3000,
        expected_success_lift=0.02,
        expected_risk_reduction=0.005,
    ),
]

budget = AgentBudget(
    max_total_cost=0.08,
    max_tool_calls=3,
    max_reasoning_tokens=1200,
)


def observation_cost(tokens: int) -> float:
    return token_cost(
        input_tokens=tokens,
        output_tokens=0,
        reasoning_tokens=0,
        rates=rates,
    )

Simulate the decision.

business_value = 0.80
p_success = 0.58

for tool in tools:
    extra_reasoning_cost = token_cost(
        input_tokens=250,
        output_tokens=120,
        reasoning_tokens=300,
        rates=rates,
    )

    allowed = should_call_tool(
        tool=tool,
        p_before=p_success,
        business_value=business_value,
        extra_reasoning_cost=extra_reasoning_cost,
        observation_token_cost=observation_cost,
        budget=budget,
    )

    print(f"{tool.name:14s} -> {'CALL' if allowed else 'SKIP'}")

    if allowed:
        tool_cost = tool.direct_cost + observation_cost(tool.expected_observation_tokens)
        budget.spent += tool_cost + extra_reasoning_cost
        budget.tool_calls += 1
        budget.reasoning_tokens += 300
        p_success = min(0.99, p_success + tool.expected_success_lift)

Expected output:

policy_search  -> CALL
order_lookup   -> CALL
web_search     -> SKIP

The agent does not skip web_search because web search is bad.

It skips it because the expected marginal lift is too small after the first two tools already improved the state. The later the tool appears in the loop, the more it has to justify itself.

That is the agentic version of EV-positive play.

Add a Context Diet

The fastest way to waste money in tool-using agents is to paste raw tool output back into the model.

Do not do this:

text

Tool returned 9,000 tokens.
Append all 9,000 tokens to context.
Ask the reasoning model what matters.

Do this instead:

text

Tool returned 9,000 tokens.
Compress to structured evidence.
Append only the 300-700 tokens needed for the next decision.

A simple observation contract:

def compress_order_observation(raw_order: dict) -> dict:
    return {
        "order_id": raw_order["id"],
        "status": raw_order["status"],
        "delivered_at": raw_order.get("delivered_at"),
        "is_digital": raw_order["product_type"] == "digital",
        "return_window_days": raw_order.get("return_window_days"),
        "prior_refunds": raw_order.get("prior_refunds", 0),
    }

The goal is not to hide information from the model. The goal is to preserve decision-relevant information while removing token noise.

A reasoning model does not need the entire order JSON. It needs the variables that change the decision.

Add a Stop-Loss

Poker has bankroll management. Agentic AI needs the same instinct.

Add stop-loss rules:

def stop_loss_triggered(
    *,
    budget: AgentBudget,
    p_success: float,
    min_required_success: float,
) -> bool:
    if budget.spent >= budget.max_total_cost:
        return True

    if budget.reasoning_tokens >= budget.max_reasoning_tokens:
        return True

    if budget.tool_calls >= budget.max_tool_calls and p_success < min_required_success:
        return True

    return False

Then the runtime policy becomes:

text

if stop_loss_triggered:
  escalate, ask a clarifying question, or give a bounded answer

This is not a failure mode. It is a product decision.

Sometimes the EV-positive answer is:

I need one more piece of information before I can answer safely.

That sentence can be cheaper and better than another hidden reasoning loop.

The Agentic EV Trace

Production agents should emit a cost trace for every request.

{
  "request_id": "req_123",
  "route": "agentic_support",
  "business_value_estimate": 0.8,
  "initial_p_success": 0.58,
  "final_p_success": 0.88,
  "tool_calls": [
    {
      "name": "policy_search",
      "direct_cost": 0.002,
      "observation_tokens": 450,
      "decision": "called"
    },
    {
      "name": "order_lookup",
      "direct_cost": 0.015,
      "observation_tokens": 900,
      "decision": "called"
    },
    {
      "name": "web_search",
      "direct_cost": 0.035,
      "observation_tokens": 3000,
      "decision": "skipped",
      "reason": "marginal_ev_below_threshold"
    }
  ],
  "total_cost": 0.061,
  "estimated_ev": 0.583
}

Without this trace, cost tuning becomes vibes.

With this trace, cost tuning becomes engineering:

Which tools are called most often?
Which tools improve success probability?
Which observations are too large?
Which route spends too much reasoning on low-value requests?
Which request classes should be downgraded to RAG?
Which request classes deserve deeper reasoning?

The point is not to make the system cheap everywhere.

The point is to spend expensive reasoning where it compounds.

Practical Tuning Rules

Start with retrieval before tools. If a small retrieval step resolves the uncertainty, do not invoke a full agent.

Pass summaries, not transcripts. Tool observations should be structured, minimal, and decision-relevant.

Use confidence deltas. A tool should estimate how much it improves the answer, not merely whether it can be called.

Cap loops by cost, not only by count. Two expensive tool calls can be worse than five cheap ones.

Separate planning from execution. Let a cheaper model classify and route when possible. Use the reasoning model when the expected value clears the threshold.

Cache stable observations. Policies, product metadata, shipping rules, and public documentation should not be re-read expensively for every request.

Measure final EV, not model cleverness. A beautiful chain of thought that spends the margin is not a win.

The Next Layer of AI Engineering

The first phase of GenAI engineering was about making models answer.

The second phase was about grounding them with retrieval.

The third phase is about controlling the economics of autonomous action.

Reasoning models are powerful. Tools are powerful. Agents are powerful.

But power is not the same thing as profit.

The EV-positive agent does not ask:

text

What else can I call?

It asks:

text

What is the cheapest next action that changes the expected outcome enough to matter?

That is the next step in the EV+ of GenAI series: from measuring inference to governing agency.

Not less intelligence.

Better bet sizing.

The EV+ of GenAI, Part 3

Want to tune the economics of your agentic workflow?

Agentic depth is a budget decision

How to tune costs for reasoning models, make tokens not wasted, and call tools only when they earn their keep

GenAI Works: AI system layers

The Agentic Cost Trap

The Tool Tax

A Worked Example

Break-Even Tool Calling

Reasoning Tokens Are a Budget, Not a Personality Trait

Sample Code: EV-Gated Agent Loop

Add a Context Diet

Add a Stop-Loss

The Agentic EV Trace

Practical Tuning Rules

The Next Layer of AI Engineering

Build agents that know when to spend