benchmark claude-code performance

Benchmark: grepai vs grep on Claude Code

Name: grepai
Author: Yoan Bernabeu

A controlled benchmark comparing semantic search with grepai versus traditional grep in Claude Code, showing 27.5% cost savings and 97% input token reduction.

Yoan Bernabeu January 21, 2026

Disclaimer: This benchmark was conducted by the grepai maintainer. While we’ve aimed for methodological rigor (5 runs with consistent results), we encourage users to run their own tests on their codebases.

TL;DR

We ran a controlled benchmark comparing grepai (semantic search) versus traditional grep in Claude Code on the Excalidraw codebase (155,000+ lines of TypeScript code).

Results:

-27.5% on API billing ($6.78 → $4.92)
-55% tool calls (139 → 62)
-97% input tokens (51,147 → 1,326)
-71% cache creation tokens (563,883 → 162,289)

Methodology

Test Environment

We compared two identical clones of the Excalidraw repository:

Session 1 (Baseline): Standard Claude Code without grepai
Session 2 (With grepai): Claude Code + grepai with Ollama embeddings (nomic-embed-text model) on MacBook Pro M3 Pro

The Five Test Questions

We posed identical questions across both sessions in the same order. These questions were designed to reflect real developer exploration patterns—describing what the code does rather than searching for known function names:

“Locate the exact mathematical function used to determine if a user’s cursor is hovering inside a ‘diamond’ shape.”
“Explain how the application calculates the intersection point when an arrow is attached to an ellipse.”
“Find the algorithm responsible for simplifying or smoothing the points of a ‘freedraw’ line after the user releases the mouse.”
“Identify the code responsible for snapping dragged elements to the grid.”
“How does the codebase handle sending an element ‘backward’ in the z-order?”

Metric Collection

Data was extracted directly from Claude Code’s JSON logs located at ~/.claude/projects/<project-hash>/. The analysis included both main session logs and subagent logs found in <session-uuid>/subagents/ directories.

Understanding Claude Code’s Token Economics

Before diving into results, it’s important to understand how Claude Code bills API usage. The API differentiates between four token categories with distinct costs (Claude Opus 4.5 pricing):

Token Type	Cost per Million	Description
`input_tokens`	$5.00	Fresh tokens processed for the first time
`cache_read_input_tokens`	$0.50	Previously cached tokens being reused (90% discount)
`cache_creation_input_tokens`	$6.25	New tokens being cached for future reuse (25% premium)
`output_tokens`	$25.00	Tokens generated by Claude

The prompt caching system stores frequently used context (system prompts, conversation history, previously read files). Each new subagent in Claude Code starts with a fresh context that must be cached, incurring the 1.25× premium on cache creation.

Results

Token Metrics

Metric	Baseline	grepai	Change
Subagents launched	5	0	-100%
Tool calls	139	62	-55%
`input_tokens`	51,147	1,326	-97%
`cache_read_input_tokens`	5,973,161	7,775,888	+30%
`cache_creation_input_tokens`	563,883	162,289	-71%
`output_tokens`	476	347	-27%

Cost Breakdown

Category	Baseline	grepai	Difference
`input_tokens` cost	$0.26	$0.01	-97%
`cache_read` cost	$2.99	$3.89	+30%
`cache_creation` cost	$3.52	$1.01	-71%
`output_tokens` cost	$0.01	$0.01	-27%
Total billed cost	$6.78	$4.92	-27.5%

Tool Usage Breakdown

Tool	Baseline	grepai
Bash (including grepai)	41	9
Grep	37	20
Glob	13	0
Read	43	30
Task (subagents)	5	0

Why the Difference?

The Subagent Problem

Without grepai, Claude Code’s typical workflow looks like this:

Question → Task(subagent_type: Explore) → Multiple iterations:
  ├─ Grep("pattern") returns 40+ files
  ├─ Read files sequentially to filter
  ├─ Launch additional searches
  └─ Each subagent = separate context = cache_creation charges

With grepai, the workflow simplifies to:

Question → Bash: grepai search "semantic query" → Targeted results
  ├─ Directly identifies relevant files
  ├─ Minimal iteration needed
  └─ No subagent spawning = no new cache contexts

The baseline scenario launched 5 subagents, requiring 563,883 cache_creation tokens. The grepai scenario eliminated subagent launches entirely, reducing this to 162,289 tokens—a savings of $2.51 on cache creation alone.

Why Cost Savings Don’t Match Token Reductions

The -97% reduction in fresh input_tokens translated to only -27.5% cost savings because cache_read_input_tokens dominated total token consumption in both scenarios. Cache read operations cost 10× less than fresh input tokens:

Baseline fresh input tokens: ~$0.26 of $6.78 total (3.8%)
Majority of cost came from cache operations: ~$6.51 of $6.78

The Glob Elimination

A critical insight: grepai reduced Glob tool calls from 13 to 0. The Glob tool searches for files matching patterns (e.g., **/*.ts) but returns dozens of results requiring sequential reads to filter. Semantic search eliminates this trial-and-error phase by returning only pertinent files immediately.

Limitations and Caveats

We want to be transparent about what this benchmark does and doesn’t show:

What We Measured

Actual tokens billed by the API
Financial cost differences
Number of tool invocations

What We Did NOT Measure

Answer quality (both approaches found correct solutions)
Real-world execution time (varies by server load)
Reproducibility across different codebases
Performance with smaller repositories

Non-Deterministic Behavior

Claude Code can take different paths for identical questions. While results were reproducible across five repetitions of this experiment, general behavior varies. The baseline using 5 subagents while grepai used none was a significant factor—different runs might show different subagent counts.

Benchmark conducted on the Excalidraw codebase (155,000+ lines of TypeScript).

Original article (in French) by the grepai maintainer: grepai Benchmark