Back to Insights
AI Economics · Agentic Workflows

The EV+ of GenAI, Part 3

How to tune costs for reasoning models, make tokens not wasted, and call tools only when they earn their keep.

Dan Stativa

Want to tune the economics of your agentic workflow?

EV+ Agents

Agentic depth is a budget decision

Reasoning models, tool calls, and observations all spend margin. This article turns agentic orchestration into an EV-gated policy with budgets, context diets, stop-loss rules, and cost traces.

  • Break-even tool-calling formula
  • Reasoning-token and observation budgets
  • EV-gated agent loop in Python
  • Context diet and stop-loss patterns
  • Production agentic EV trace
EV-positive agentic AI cost tuning

How to tune costs for reasoning models, make tokens not wasted, and call tools only when they earn their keep

Sequel to “The EV+ of GenAI” and “Measuring the EV+ of GenAI.” Same formula. This time the unit of analysis is an agentic workflow.

The first article defined the economic frame.

text
EV = p(success) x business_value - inference_cost - risk_cost

The second article measured the frame with a small RAG system.

This third step is where the economics become sharper: agentic AI with reasoning models and tools.

I curated this topic from a GenAI Works discussion about the practical layers of AI systems: LLMs, RAG, AI agents, and agentic AI. The useful direction is not another “agents are powerful” post. The useful direction is more operational:

Curated Source

GenAI Works LinkedIn source graphic about LLM, RAG, AI agents and agentic AI

GenAI Works: AI system layers

GenAI Works frames the shift as understanding the layers of AI that many teams miss, from LLMs through RAG and agents into agentic systems.

You cannot run a 2026 business on a 2024 intelligence stack.
View source context

How do we stop reasoning models from wasting tokens while deciding whether to call tools?

That question matters because agentic systems do not spend money once. They spend money repeatedly:

In a normal chat completion, token waste is annoying.

In an agentic workflow, token waste compounds.

The Agentic Cost Trap

An agentic AI process usually looks like this:

text
user request
 -> planner
 -> tool selection
 -> tool call
 -> observation
 -> more reasoning
 -> maybe another tool
 -> final answer

The dangerous version looks like this:

text
user request
 -> expensive reasoning model
 -> verbose plan
 -> broad retrieval
 -> expensive reasoning model
 -> unnecessary tool call
 -> large observation copied into context
 -> expensive reasoning model
 -> second unnecessary tool call
 -> final answer that could have been produced after step 2

The model may look intelligent. The trace may look impressive. The answer may even be correct.

But the EV can still be worse than a simpler pipeline.

The agent is not paid for thinking. It is paid for improving the expected outcome enough to justify the cost of thinking.

The Tool Tax

Every tool call has two costs.

text
tool_call_cost = direct_tool_cost + tokenized_observation_cost

The direct cost is obvious: API fees, database reads, vector search, browser automation, compute, queue time, or external service calls.

The hidden cost is the observation. Tool results usually get pasted back into the model context. If the result is large, noisy, or redundant, the next reasoning step becomes more expensive.

So the real question before calling a tool is:

text
expected_tool_gain > tool_call_cost + extra_reasoning_cost

Where:

text
expected_tool_gain =
(p_success_after_tool - p_success_before_tool) x business_value
+ risk_reduction

A tool call is EV-positive when:

text
(p_after - p_before) x business_value + risk_reduction
>
direct_tool_cost + observation_token_cost + extra_reasoning_cost

This is the missing runtime gate in many agent demos.

They ask:

Can the model call a tool?

Production systems should ask:

Is this tool call worth the next dollar, millisecond, and token?

A Worked Example

Imagine an AI support agent handling a refund question for a user who may be eligible for an exception.

Assume:

text
business_value = $0.80
risk_cost_if_wrong = $0.18

The agent has three choices.

Strategyp(success)Inference + tool costRisk costEV
Direct answer0.58$0.006$0.18$0.278
Retrieve policy only0.76$0.014$0.10$0.494
Retrieve policy + call order API + reason deeply0.88$0.071$0.06$0.573

The full agentic path has the highest EV here.

text
direct_ev = 0.58 x 0.80 - 0.006 - 0.18 = $0.278
rag_ev    = 0.76 x 0.80 - 0.014 - 0.10 = $0.494
agent_ev  = 0.88 x 0.80 - 0.071 - 0.06 = $0.573

But now change the business value from $0.80 to $0.30.

Strategyp(success)Inference + tool costRisk costEV
Direct answer0.58$0.006$0.18-$0.012
Retrieve policy only0.76$0.014$0.10$0.114
Retrieve policy + call order API + reason deeply0.88$0.071$0.06$0.133

The agentic path is still positive, but the margin over RAG is only:

text
$0.133 - $0.114 = $0.019

If the tool is slower, the order API is unreliable, the observation is verbose, or the model takes one extra reasoning loop, the premium path stops being worth it.

The lesson is not “never use agents.”

The lesson is:

Agentic depth should be purchased only when uncertainty, risk, or business value make the marginal reasoning step EV-positive.

Break-Even Tool Calling

Before a tool call, compute the maximum affordable cost.

text
max_tool_cost =
(p_after - p_before) x business_value
+ risk_reduction
- extra_reasoning_cost

Example:

text
p_before = 0.76
p_after  = 0.88
business_value = $0.80
risk_reduction = $0.04
extra_reasoning_cost = $0.018

max_tool_cost = (0.88 - 0.76) x 0.80 + 0.04 - 0.018
max_tool_cost = $0.118

If the order API call plus the observation tokens cost less than $0.118, the tool is EV-positive.

If the same request has only $0.30 of business value:

text
max_tool_cost = (0.88 - 0.76) x 0.30 + 0.04 - 0.018
max_tool_cost = $0.058

Same model. Same tool. Same expected accuracy lift.

Different product economics.

This is why the routing layer needs business context. An agent without value awareness is just an expensive loop with good manners.

Reasoning Tokens Are a Budget, Not a Personality Trait

Reasoning models are valuable because they can spend more computation on difficult tasks. That does not mean every task deserves the same reasoning budget.

A useful production policy:

Request typeReasoning budgetTool policy
Simple FAQLowNo tool unless retrieval confidence is low
Known policy lookupLow to mediumRetrieve small context only
Account-specific supportMediumCall account/order tool only after policy retrieval
High-value transactionMedium to highUse tools, verify, and summarize observations
Compliance-sensitive actionHighUse tools and require structured evidence

The budget should be explicit:

text
max_planning_tokens = 200
max_observation_tokens = 700
max_tool_calls = 2
max_total_cost = $0.08

Those limits are not arbitrary. They are the economic boundaries of the hand.

Sample Code: EV-Gated Agent Loop

The following Python example is deliberately small. It does not call a real model. Instead, it models the economics around a reasoning model so the policy is visible.

You can plug the same accounting into OpenAI, Anthropic, Gemini, local vLLM, or any other model stack.

from dataclasses import dataclass
from typing import Callable


@dataclass(frozen=True)
class RateCard:
    input_per_1k: float
    output_per_1k: float
    reasoning_per_1k: float


@dataclass(frozen=True)
class ToolSpec:
    name: str
    direct_cost: float
    expected_observation_tokens: int
    expected_success_lift: float
    expected_risk_reduction: float


@dataclass
class AgentBudget:
    max_total_cost: float
    max_tool_calls: int
    max_reasoning_tokens: int
    spent: float = 0.0
    tool_calls: int = 0
    reasoning_tokens: int = 0


def token_cost(
    *,
    input_tokens: int,
    output_tokens: int,
    reasoning_tokens: int,
    rates: RateCard,
) -> float:
    return (
        input_tokens / 1000 * rates.input_per_1k
        + output_tokens / 1000 * rates.output_per_1k
        + reasoning_tokens / 1000 * rates.reasoning_per_1k
    )


def max_affordable_tool_cost(
    *,
    p_before: float,
    p_after: float,
    business_value: float,
    risk_reduction: float,
    extra_reasoning_cost: float,
) -> float:
    return (
        (p_after - p_before) * business_value
        + risk_reduction
        - extra_reasoning_cost
    )


def should_call_tool(
    *,
    tool: ToolSpec,
    p_before: float,
    business_value: float,
    extra_reasoning_cost: float,
    observation_token_cost: Callable[[int], float],
    budget: AgentBudget,
) -> bool:
    if budget.tool_calls >= budget.max_tool_calls:
        return False

    p_after = min(0.99, p_before + tool.expected_success_lift)
    affordable = max_affordable_tool_cost(
        p_before=p_before,
        p_after=p_after,
        business_value=business_value,
        risk_reduction=tool.expected_risk_reduction,
        extra_reasoning_cost=extra_reasoning_cost,
    )

    expected_cost = (
        tool.direct_cost
        + observation_token_cost(tool.expected_observation_tokens)
    )

    return budget.spent + expected_cost <= budget.max_total_cost and expected_cost <= affordable

Now define a few tools.

rates = RateCard(
    input_per_1k=0.003,
    output_per_1k=0.012,
    reasoning_per_1k=0.018,
)

tools = [
    ToolSpec(
        name="policy_search",
        direct_cost=0.002,
        expected_observation_tokens=450,
        expected_success_lift=0.18,
        expected_risk_reduction=0.06,
    ),
    ToolSpec(
        name="order_lookup",
        direct_cost=0.015,
        expected_observation_tokens=900,
        expected_success_lift=0.12,
        expected_risk_reduction=0.04,
    ),
    ToolSpec(
        name="web_search",
        direct_cost=0.035,
        expected_observation_tokens=3000,
        expected_success_lift=0.02,
        expected_risk_reduction=0.005,
    ),
]

budget = AgentBudget(
    max_total_cost=0.08,
    max_tool_calls=3,
    max_reasoning_tokens=1200,
)


def observation_cost(tokens: int) -> float:
    return token_cost(
        input_tokens=tokens,
        output_tokens=0,
        reasoning_tokens=0,
        rates=rates,
    )

Simulate the decision.

business_value = 0.80
p_success = 0.58

for tool in tools:
    extra_reasoning_cost = token_cost(
        input_tokens=250,
        output_tokens=120,
        reasoning_tokens=300,
        rates=rates,
    )

    allowed = should_call_tool(
        tool=tool,
        p_before=p_success,
        business_value=business_value,
        extra_reasoning_cost=extra_reasoning_cost,
        observation_token_cost=observation_cost,
        budget=budget,
    )

    print(f"{tool.name:14s} -> {'CALL' if allowed else 'SKIP'}")

    if allowed:
        tool_cost = tool.direct_cost + observation_cost(tool.expected_observation_tokens)
        budget.spent += tool_cost + extra_reasoning_cost
        budget.tool_calls += 1
        budget.reasoning_tokens += 300
        p_success = min(0.99, p_success + tool.expected_success_lift)

Expected output:

policy_search  -> CALL
order_lookup   -> CALL
web_search     -> SKIP

The agent does not skip web_search because web search is bad.

It skips it because the expected marginal lift is too small after the first two tools already improved the state. The later the tool appears in the loop, the more it has to justify itself.

That is the agentic version of EV-positive play.

Add a Context Diet

The fastest way to waste money in tool-using agents is to paste raw tool output back into the model.

Do not do this:

text
Tool returned 9,000 tokens.
Append all 9,000 tokens to context.
Ask the reasoning model what matters.

Do this instead:

text
Tool returned 9,000 tokens.
Compress to structured evidence.
Append only the 300-700 tokens needed for the next decision.

A simple observation contract:

def compress_order_observation(raw_order: dict) -> dict:
    return {
        "order_id": raw_order["id"],
        "status": raw_order["status"],
        "delivered_at": raw_order.get("delivered_at"),
        "is_digital": raw_order["product_type"] == "digital",
        "return_window_days": raw_order.get("return_window_days"),
        "prior_refunds": raw_order.get("prior_refunds", 0),
    }

The goal is not to hide information from the model. The goal is to preserve decision-relevant information while removing token noise.

A reasoning model does not need the entire order JSON. It needs the variables that change the decision.

Add a Stop-Loss

Poker has bankroll management. Agentic AI needs the same instinct.

Add stop-loss rules:

def stop_loss_triggered(
    *,
    budget: AgentBudget,
    p_success: float,
    min_required_success: float,
) -> bool:
    if budget.spent >= budget.max_total_cost:
        return True

    if budget.reasoning_tokens >= budget.max_reasoning_tokens:
        return True

    if budget.tool_calls >= budget.max_tool_calls and p_success < min_required_success:
        return True

    return False

Then the runtime policy becomes:

text
if stop_loss_triggered:
  escalate, ask a clarifying question, or give a bounded answer

This is not a failure mode. It is a product decision.

Sometimes the EV-positive answer is:

I need one more piece of information before I can answer safely.

That sentence can be cheaper and better than another hidden reasoning loop.

The Agentic EV Trace

Production agents should emit a cost trace for every request.

{
  "request_id": "req_123",
  "route": "agentic_support",
  "business_value_estimate": 0.8,
  "initial_p_success": 0.58,
  "final_p_success": 0.88,
  "tool_calls": [
    {
      "name": "policy_search",
      "direct_cost": 0.002,
      "observation_tokens": 450,
      "decision": "called"
    },
    {
      "name": "order_lookup",
      "direct_cost": 0.015,
      "observation_tokens": 900,
      "decision": "called"
    },
    {
      "name": "web_search",
      "direct_cost": 0.035,
      "observation_tokens": 3000,
      "decision": "skipped",
      "reason": "marginal_ev_below_threshold"
    }
  ],
  "total_cost": 0.061,
  "estimated_ev": 0.583
}

Without this trace, cost tuning becomes vibes.

With this trace, cost tuning becomes engineering:

The point is not to make the system cheap everywhere.

The point is to spend expensive reasoning where it compounds.

Practical Tuning Rules

Start with retrieval before tools. If a small retrieval step resolves the uncertainty, do not invoke a full agent.

Pass summaries, not transcripts. Tool observations should be structured, minimal, and decision-relevant.

Use confidence deltas. A tool should estimate how much it improves the answer, not merely whether it can be called.

Cap loops by cost, not only by count. Two expensive tool calls can be worse than five cheap ones.

Separate planning from execution. Let a cheaper model classify and route when possible. Use the reasoning model when the expected value clears the threshold.

Cache stable observations. Policies, product metadata, shipping rules, and public documentation should not be re-read expensively for every request.

Measure final EV, not model cleverness. A beautiful chain of thought that spends the margin is not a win.

The Next Layer of AI Engineering

The first phase of GenAI engineering was about making models answer.

The second phase was about grounding them with retrieval.

The third phase is about controlling the economics of autonomous action.

Reasoning models are powerful. Tools are powerful. Agents are powerful.

But power is not the same thing as profit.

The EV-positive agent does not ask:

text
What else can I call?

It asks:

text
What is the cheapest next action that changes the expected outcome enough to matter?

That is the next step in the EV+ of GenAI series: from measuring inference to governing agency.

Not less intelligence.

Better bet sizing.

Dan Stativa

Build agents that know when to spend