Joyjeet Sarkar

Prompting Technique for Browser Automation

Joyjeet Sarkar — Tue, 07 Apr 2026 13:26:18 GMT

Brief

I asked Claude Code to crawl a SaaS helpdesk platform, visit every page recursively, and save full-page screenshots. It was a reasonable ask — I wanted a visual inventory of every screen in our instance. Claude ran for 3 hours, made 898 tool calls, and consumed 174.8 million tokens. The actual useful output? About 108K tokens — Claude’s responses and the screenshots it saved. The other 99.94% was the system re-reading the same growing conversation history on every single turn.

This post breaks down exactly why this happened, what Claude inferred from my prompt, the strategy it adopted, and four alternative approaches that would have cut the cost by 74% to 99.96%.

Thanks for reading! Subscribe for free to receive new posts and support my work.

The original prompt

use chrome cdp to visit all pages in https://[redacted].helpdesk.com/
and take screenshots of each page. Recursively watch for the pages.
If you have any confusion, ask. Dont fill anything, or "request demo"
or purchase anything. If your click navigates you away from
[redacted].helpdesk.com then ignore that page.

After a false start (Claude wasn’t saving the screenshots), I clarified:

yes, save the screenshots, otherwise whats the point?
Restart everything and save full DOM screenshots

And later expanded scope:

there are lots of action buttons in the settings pages as well.
are you clicking them as well? take two screenshots in this case.
one of the page, then second when the action buttons is clicked

What Claude inferred

From these prompts, Claude understood the goal as: exhaustively crawl every page within this helpdesk SaaS instance, capture full-page screenshots of each, and expand scope to include interactive elements like settings action buttons.

The key word was “recursively” — Claude interpreted this as a depth-first crawl of the entire site. It would visit a page, find all links, visit each linked page, find more links, and so on. For a SaaS platform with dashboards, ticket views, forums, knowledge base, admin settings, and sub-settings, this expanded into hundreds of unique pages.

The strategy Claude adopted

Claude chose the simplest possible approach: a single long-running session where it manually browsed every page one by one, clicking through the UI, taking screenshots, and saving them to disk.

The session flow looked like this, repeated hundreds of times:

Navigate to a page (or click a link)
Wait for load
Take a GIF screenshot via the gif_creator tool
Save the screenshot to disk via Bash
Look for more links on the page
Navigate to the next one

This produced 898 assistant turns and 683 tool calls:

| Tool              | Calls | Purpose                                 |
| ----------------- | ----- | --------------------------------------- |
| `computer`        | 235   | Clicking/interacting with page elements |
| `gif_creator`     | 212   | Recording screenshots/GIFs              |
| `Bash`            | 98    | Saving files to disk                    |
| `navigate`        | 93    | Navigating to pages                     |
| `javascript_tool` | 30    | DOM manipulation                        |
| Other             | 15    | Tabs, tasks, reads, writes              |

Why this strategy was expensive

The quadratic context problem

Every turn in a Claude conversation re-sends the entire conversation history. In the API, this is how the protocol works — the full message list is sent each time, and caching reduces the cost of re-reading unchanged portions, but the tokens still count.

As Claude browsed more pages, each tool call result (including screenshot data, page content, navigation confirmations) accumulated in the conversation. The context grew steadily:

| Turn | Context size per turn |
| ---- | --------------------- |
| 1    | ~20K tokens           |
| 100  | ~65K tokens           |
| 300  | ~134K tokens          |
| 500  | ~213K tokens          |
| 700  | ~290K tokens          |
| 898  | ~370K tokens          |

The total cost is the sum of all context sizes across all turns. This grows quadratically — it’s not 898 x 370K (the final size), it’s 898 x ~195K (the average size). That’s how you get to 175M tokens.

Token breakdown

| Type         | Tokens | %     | Note                               |
| ------------ | ------ | ----- | ---------------------------------- |
| Cache read   | 170.9M | 97.8% | Re-reading prior context each turn |
| Cache create | 3.8M   | 2.2%  | New content added to context       |
| Output       | 108K   | 0.06% | Claude’s actual responses          |
| Input        | 1.2K   | ~0%   | Uncached inputs                    |

97.8% of tokens were the system re-reading conversation history. Claude’s actual work — deciding what to click, where to navigate, what to save — was 108K tokens of output. The rest was overhead.

Why this task specifically triggers the problem

Crawling a website is a high-turn, low-interdependence task. Each page visit is essentially independent — Claude doesn’t need to remember what page 47 looked like to screenshot page 48. But the single-session approach forced Claude to carry the full history of all previous pages on every turn, as if it needed that context. It didn’t.

A better strategy

The fundamental problem was that Claude treated a parallelizable, stateless task as a sequential, stateful one. The prompt didn’t give it a reason to do otherwise — “visit all pages recursively” naturally reads as “start browsing and keep going.”

Here are four alternative approaches, ranked by token savings, along with the prompt you’d use to steer Claude toward each one.

Strategy 1: Script-first — have Claude write a crawler, not be the crawler

Instead of Claude manually browsing 898 times, ask it to write a script that does the browsing. The script runs outside Claude with zero token cost.

The prompt

Write a Node.js Playwright script that:
1. Opens https://[redacted].helpdesk.com/ (I'll handle login manually first)
2. Crawls all internal links recursively, up to depth 3
3. Takes a full-page screenshot of each unique page
4. Saves screenshots to screenshots/ with descriptive filenames based on the URL path
5. Writes a manifest.json mapping URLs to screenshot filenames
6. For settings pages, also clicks each action button and captures the resulting state
7. Skips any links that navigate away from [redacted].helpdesk.com
8. Doesn't fill forms, click "request demo", or purchase anything

Run the script after writing it. If it errors, fix and retry.

What happens

Claude writes ~200 lines of Playwright code in one turn. Maybe spends 3-5 turns refining it after test runs. The script then crawls the entire site in 10-20 minutes with zero token cost. Total session: maybe 10 turns, ~50K tokens.

If you want Claude to analyze the screenshots afterward, start a new session:

Read manifest.json and review the screenshots in screenshots/.
Summarize all the pages and features you find.

Token comparison

Original:              174.8M tokens
Script approach:       ~70K tokens (writing + debugging the script)
Savings:               99.96%

When this works best

When the task is mechanical and predictable. Crawling a standard SaaS dashboard with known navigation patterns is exactly this. There’s no visual judgment needed mid-crawl — Claude doesn’t need to see a page to decide whether to screenshot it.

When this doesn’t work

If the site has unpredictable UI that requires visual judgment (e.g., “screenshot anything that looks like a bug”), or if anti-bot protections make Playwright difficult to use. But for internal SaaS tools behind a login, scripts work perfectly.

Strategy 2: Top agent + sub-agents — coordinator maps the site, workers capture pages

Split the work into a lightweight coordinator that collects URLs, then independent workers that each handle a small batch with fresh context.

The prompt

I need screenshots of every page in https://[redacted].helpdesk.com/.
Do this in two phases:

Phase 1: Navigate the site and extract all internal URLs using JavaScript
(document.querySelectorAll('a[href]')). Visit each top-level section to find
sub-pages. Write the full deduplicated URL list to urls.json grouped by section.
Don't take screenshots in this phase.

Phase 2: Read urls.json. For each section, spawn a sub-agent that visits those
URLs, takes full-page screenshots via Bash (use Playwright CLI), and saves them
to screenshots/{section}/. Run sub-agents in parallel where possible.

For settings pages with action buttons, spawn additional sub-agents that click
each button and capture the result.

What happens

Phase 1 — Claude browses the site with JavaScript-only link extraction. No screenshots, no GIFs. Maybe 50 turns, context stays under 30K. Outputs a structured urls.json:

{
  "pages": [
    {"url": "/a/dashboard", "section": "dashboard"},
    {"url": "/a/tickets", "section": "tickets"},
    {"url": "/a/tickets/filters/all", "section": "tickets"},
    {"url": "/a/admin/general", "section": "settings"},
    ...
  ]
}

Phase 2 — Claude spawns sub-agents via the Agent tool. Each sub-agent gets a batch of 5-10 URLs. Each starts with a fresh, empty context — no history from the coordinator or other agents. Multiple sub-agents can run in parallel.

Each sub-agent’s context stays small (~30K tokens/turn for ~40 turns = ~1.5M tokens per agent).

Token math

Original: sum of (context_at_turn_i) for 898 turns
        ≈ 195K average × 898 turns = 175M tokens

With sub-agents:
Discovery:  25K avg × 50 turns   =   1.25M
Agent A:    25K avg × 40 turns   =   1.0M
Agent B:    25K avg × 40 turns   =   1.0M
...x10 agents...
Total: ~1.25M + (10 × 1.0M)     =  ~11M tokens
Savings: ~94%

The key insight: 10 short conversations are dramatically cheaper than 1 long conversation, even if the total number of turns is the same, because context doesn’t accumulate across agents.

Limitations

Sub-agents spawned via the Agent tool don’t have access to Chrome MCP tools by default — they use Bash, Read, Write, Grep, etc. The sub-agents would need to use Bash to run Playwright commands, or the top agent would need to do the Chrome work itself in batches.
If the site requires login and cookies, each sub-agent may need its own auth flow (or you share cookies via a file).
Coordinating between agents requires writing intermediate files (urls.json, progress tracking). This is simple but needs to be explicit in your prompts.

Strategy 3: Batched sessions with state file — you are the scheduler

Keep Claude as the browser operator, but break the work across multiple independent sessions that share progress via a file on disk.

The prompts

Session 1:

Navigate https://[redacted].helpdesk.com/. List every top-level section and
sub-page you can find. Write them to crawl-state.json with status "pending".
Don't take screenshots yet.

Session 2 (new claude session):

Read crawl-state.json. Visit the first 15 pages with status "pending".
Take a full-page screenshot of each and save to screenshots/.
Update each page's status to "done" in the JSON file.

Session 3, 4, 5... (repeat):

Read crawl-state.json. Visit remaining pages with status "pending".
Take screenshots, save to screenshots/, update status to "done".

What happens

Each session starts with a fresh context. The state file provides continuity without the token cost of carrying conversation history. Claude reads crawl-state.json at the start, does its batch of work, updates the file, and exits.

{
  "pages": [
    {"url": "/a/dashboard", "status": "done", "section": "main"},
    {"url": "/a/tickets", "status": "done", "section": "main"},
    {"url": "/a/admin/general", "status": "pending", "section": "settings"},
    {"url": "/a/admin/email", "status": "pending", "section": "settings"},
    ...
  ]
}

Token math

Original:     1 session  × 898 turns × 195K avg context = 175M tokens
Batched:      1 session  × 50 turns  × 25K avg context  =  1.25M  (mapping)
            + 5 sessions × 60 turns  × 35K avg context  = 10.5M   (capture)
Total:                                                   ≈ 12M tokens
Savings:      ~93%

Trade-offs

Manual orchestration: You have to start each session yourself and tell it what batch to work on. You are the scheduler.
Login state: If the site requires login, you may need to log in at the start of each session (or keep the browser open so cookies persist).
State file can get out of sync: If a session crashes mid-batch, some pages might be visited but not marked “done.” You’d need to check for existing screenshots before re-visiting.

When to use this

When you want Claude doing the browsing (maybe you need visual judgment or the site is tricky) but you’re willing to manage multiple sessions. A good middle ground between full automation and script-first.

Strategy 4: Reduce per-turn bloat — optimize the single-session approach

If you want to stick with a single session, you can still cut the cost significantly by controlling what enters the conversation context.

The prompt

Use chrome CDP to visit all pages in https://[redacted].helpdesk.com/ and
take screenshots. Rules:

- Save screenshots by running Playwright CLI commands via Bash, not via
  gif_creator. The screenshots should go to disk directly, not through
  our conversation.
- Navigate to pages by URL whenever possible. Don't click through menus
  to reach pages you already know the URL for.
- Don't summarize progress. Don't list what you've already done. Only
  speak if you hit an error or need my input.
- After every 50 pages, use /compact to reset the conversation context.
- Write visited URLs to visited.txt so you can resume after compacting.

What each rule does

4a. Bash screenshots instead of GIF recording

The original session made 212 gif_creator calls. GIF recording captures multiple frames and embeds image data into the conversation context. Every subsequent turn replays all that image data.

Instead, running a Playwright screenshot command via Bash:

npx playwright screenshot --full-page "https://example.com/dashboard" screenshots/dashboard.png

This saves the image to disk without it ever entering Claude’s conversation. Claude only sees “Screenshot saved” — a few tokens instead of thousands.

Estimated savings: Removing GIF data from context could reduce per-turn context by 30-50%, which compounds across 898 turns. ~50M tokens saved.

4b. Direct URL navigation instead of UI clicking

The session made 235 computer clicks and 93 navigate calls. Clicking through menus costs multiple turns:

Clicking through a menu:
  Turn 1: Click "Admin" in sidebar       → +500 tokens to context
  Turn 2: Wait for menu to expand        → +300 tokens
  Turn 3: Click "Email Settings"         → +500 tokens
  Turn 4: Wait for page to load          → +300 tokens
  = 4 turns, ~1,600 tokens added

Direct navigation:
  Turn 1: navigate to /admin/email       → +400 tokens to context
  = 1 turn, ~400 tokens added

4x fewer turns, 4x less context growth per page. ~30M tokens saved across the session.

4c. Suppress progress reporting

Every time Claude says “I’ve now visited 15 pages, here’s what I found so far...” that text enters the conversation and is replayed on every subsequent turn. A 500-token progress report at turn 100 costs 500 x 798 = 399K tokens in cache reads over the remaining turns.

~10M tokens saved.

4d. Periodic context reset via /compact

If the conversation gets long, /compact summarizes the history and resets the context. Even one mid-session reset at turn 450 would save roughly:

Without reset: turns 451-898 replay ~190K-370K context each
With reset:    turns 451-898 replay ~20K-190K context each
Savings:       ~40M tokens from this one reset alone

Combined impact

| Optimization                        | Estimated savings                |
| ----------------------------------- | -------------------------------- |
| Bash screenshots instead of GIFs    | ~50M tokens                      |
| Direct navigation instead of clicks | ~30M tokens                      |
| Suppress progress reports           | ~10M tokens                      |
| One mid-session `/compact`          | ~40M tokens                      |
| **Combined**                        | **~130M tokens (74% reduction)** |

Even without restructuring the approach, being deliberate about what enters the conversation context could have cut the cost from 175M to ~45M tokens.

Recommendation matrix

| Strategy                  | Token savings | Effort                             | Best when...                                  |
| ------------------------- | ------------- | ---------------------------------- | --------------------------------------------- |
| 1. Script-first           | ~99.96%       | Low — one prompt                   | Task is mechanical and predictable            |
| 2. Top agent + sub-agents | ~94%          | Medium — structured prompts        | You want Claude browsing but need parallelism |
| 3. Batched sessions       | ~93%          | Medium — manual session management | You want interactive control per batch        |
| 4. Reduce bloat           | ~74%          | Low — prompt changes only          | You want the simplest change                  |

For mechanical tasks (crawling, scraping, bulk screenshots): Strategy 1 is the clear winner. There’s no reason for Claude to be in the loop during the crawl. Write the script once, run it, done.

For tasks requiring judgment mid-browse (e.g., “explore this site and identify UX issues”): Strategy 2 gives you the best balance of intelligence and efficiency.

The general rule: If Claude doesn’t need to see page N to decide what to do on page N+1, the pages are independent work units. Don’t process them sequentially in a single growing conversation. Use a script, spawn parallel agents, or batch into separate sessions. The cost of a long conversation isn’t linear — it’s quadratic. That’s the trap.

Thanks for reading! Subscribe for free to receive new posts and support my work.

The AI-Native Vertical Integration Thesis

Joyjeet Sarkar — Sat, 04 Apr 2026 19:05:33 GMT

The AI-Native Vertical Integration Thesis

Core Claim: A fully vertically integrated AI-native company will outperform traditionally structured competitors by an order of magnitude — not through any single advantage, but through compounding gains across every layer of the business.

The Flywheel

Layer 1 — Developer Velocity (Verifiable)

Individual developers produce more, faster. AI-augmented coding, testing, and debugging are instrumented and measurable — commits, PR throughput, cycle time are all observable. This isn’t a productivity claim taken on faith. The AI tooling itself generates the audit trail. Verifiability matters because it makes the advantage demonstrable to investors, recruits, and customers.

Layer 2 — Review & Deployment Throughput

Speed at the developer level is worthless if review, QA, and release remain traditional bottlenecks. An AI-native company attacks this layer with the same intensity — AI-assisted code review, automated test generation, AI-augmented CI/CD. The critical insight: Layer 1 must not create a new bottleneck. Layer 2 has to scale with it, or the system stalls.

Layer 3 — Product Judgment as a Structural Function

More shipping capacity is not inherently more value. Speed without product taste produces bloat faster. This layer is the governor on the flywheel — a deliberate, embedded function that determines what gets built, not just how fast. AI-native companies can afford to experiment cheaply (see Layer 5), but someone has to decide which experiments matter. This is the layer most AI productivity frameworks ignore, and where most will fail.

Layer 4 — Feature Velocity & Market Expansion

With Layers 1-3 functioning, the company ships more of the right things. It can serve adjacent market segments without proportionally growing headcount. Competitors see a company operating at a pace that seems impossible for its team size. Timelines compress. The same deliverables that took quarters now take weeks.

Layer 5 — Combinatorial Exploration

This is the true source of exponential divergence. When building a feature costs 10x less in time and resources, the company can afford to try 10x more things. Most experiments will fail. But absolute hit rate goes up dramatically. The company discovers market opportunities that competitors cannot economically afford to explore. This is not linear speed — it is combinatorial advantage.

Layer 6 — Customer Impact & Retention

Better product, faster iteration on feedback, more responsiveness. Usage increases, retention improves, expansion revenue grows. More usage generates more signal, which feeds back into Layer 3 (what to build) and Layer 1 (how to build it). The flywheel closes.

The Hidden Structural Advantages

Organizational Redesign

The deepest advantage is not technical — it is organizational. An AI-native company has a fundamentally different shape: fewer people, flatter hierarchy, less coordination overhead. A 15-person AI-native team competing against a 150-person traditional team isn’t just cheaper. It is faster at deciding — fewer stakeholders, fewer meetings, fewer approval chains. The speed gain from organizational simplicity may exceed the speed gain from tooling.

Compounding Institutional Knowledge

When AI is embedded in the development process, institutional knowledge accumulates differently. Coding patterns, architectural decisions, and testing strategies get encoded in prompts, rules, and configurations. New hires ramp faster. Key-person risk drops. This creates a durable, invisible moat.

Talent Arbitrage

AI-native companies can hire fewer, more senior people who are comfortable working alongside AI, rather than building large teams of mid-level engineers. The cost structure, culture, and output quality all shift.

Honest Constraints

Partial adoption fails. A company that AI-augments developers but leaves review, deployment, or product decisions traditional will simply relocate the bottleneck. Vertical integration is the thesis — not just AI adoption.

Cost shifts, not just falls. AI tooling carries real costs — API spend, infrastructure, workflow maintenance, the debugging tax on AI-generated code at scale. The defensible claim is that cost per unit of value delivered falls. Total spend may stay flat or rise because the company is doing far more.

10x requires org redesign. Bolting AI onto a traditional organization yields perhaps 2-3x. The order-of-magnitude claim requires rethinking team structure, decision-making, and hiring — not just tooling. This distinction must be explicit.

The product judgment gap is existential. Without embedded product taste and customer understanding, the flywheel spins into waste. Speed amplifies whatever direction a company is already heading — including the wrong one.

Summary

The AI-native vertical integration thesis is not a technology argument. It is a systems argument. Each layer — developer velocity, deployment throughput, product judgment, market expansion, combinatorial exploration, and customer impact — must be AI-native, or the chain breaks. The companies that understand this will not simply be faster versions of their competitors. They will be structurally different organisms, operating at a pace and cost structure that traditional organizations cannot match without fundamental reinvention.