The Right Way for AI Agents to Understand a Web Page

When an AI agent needs to interact with a web page, the usual approaches are wrong.

Screenshot + vision model: The agent takes a screenshot and asks a vision model to describe the UI. This works but burns tokens parsing pixels into intent that was already in the DOM as structured data.

Raw DOM: Pass the full HTML to the model. A typical page is 50–200KB of HTML. After tokenization, that's 15,000–60,000 tokens — most of it irrelevant noise from style attributes, tracking scripts, and wrapper divs.

Manual selector guessing: The agent tries #submit, then .submit-btn, then button[type=submit], failing forward until something clicks. Fine for a demo, wrong for production.

There's a better primitive: ask for the structured element map directly.

What `/inspect` returns

PageBolt's /inspect endpoint visits a URL and returns only what matters for interaction:

const res = await fetch('https://pagebolt.dev/api/v1/inspect', {
  method: 'POST',
  headers: { 'x-api-key': process.env.PAGEBOLT_API_KEY, 'Content-Type': 'application/json' },
  body: JSON.stringify({ url: 'https://yourapp.com/signup' })
});

const map = await res.json();

Response:

{
  "elements": [
    {
      "tag": "button",
      "role": "button",
      "text": "Create account",
      "selector": "#signup-submit",
      "attributes": { "type": "submit" },
      "rect": { "x": 120, "y": 520, "width": 200, "height": 44 }
    },
    {
      "tag": "input",
      "role": "textbox",
      "text": "",
      "selector": "#email",
      "attributes": { "type": "email", "placeholder": "Email address", "required": true },
      "rect": { "x": 120, "y": 360, "width": 320, "height": 40 }
    },
    {
      "tag": "input",
      "role": "textbox",
      "text": "",
      "selector": "#password",
      "attributes": { "type": "password", "placeholder": "Password", "required": true },
      "rect": { "x": 120, "y": 420, "width": 320, "height": 40 }
    }
  ],
  "forms": [
    { "selector": "#signup-form", "action": "/api/signup", "method": "post" }
  ],
  "links": [
    { "text": "Terms of service", "selector": "footer a[href='/terms']", "href": "/terms" }
  ],
  "headings": [
    { "level": "h1", "text": "Start your free trial", "selector": "h1.page-title" }
  ]
}

This is what a structured page understanding looks like. No pixels. No DOM noise. An agent receives this and immediately knows what selectors to use — without a control loop.

Using it before automation

The pattern that matters: inspect first, then act.

// Step 1: understand the page
const inspectRes = await fetch('https://pagebolt.dev/api/v1/inspect', {
  method: 'POST',
  headers: { 'x-api-key': process.env.PAGEBOLT_API_KEY, 'Content-Type': 'application/json' },
  body: JSON.stringify({ url: 'https://yourapp.com/signup' })
});
const map = await inspectRes.json();

// Step 2: pull selectors from the elements array
const emailInput  = map.elements.find(e => e.attributes?.type === 'email');
const passwordInput = map.elements.find(e => e.attributes?.type === 'password');
const submitBtn   = map.elements.find(e => e.tag === 'button' && e.attributes?.type === 'submit');

// Step 3: execute with confidence — no guessed selectors
const res = await fetch('https://pagebolt.dev/api/v1/sequence', {
  method: 'POST',
  headers: { 'x-api-key': process.env.PAGEBOLT_API_KEY, 'Content-Type': 'application/json' },
  body: JSON.stringify({
    steps: [
      { action: 'navigate', url: 'https://yourapp.com/signup' },
      { action: 'fill', selector: emailInput.selector, value: 'user@example.com' },
      { action: 'fill', selector: passwordInput.selector, value: 'securepassword' },
      { action: 'click', selector: submitBtn.selector },
      { action: 'screenshot' }
    ]
  })
});

No guessed selectors. No vision model overhead. One inspect call gives the agent a reliable map; the sequence runs against verified selectors.

Why this matters for agents at scale

Agents that interact with the same app repeatedly shouldn't re-parse the DOM on every run. Cache the inspect result. Pages don't change their core form structure on every deploy — inspect once per deployment, store the map, run sequences against it.

At 1,000 automations/day, that's 1,000 fewer DOM-parsing steps, 1,000 fewer vision model calls, and a dramatic reduction in selector failures caused by visual ambiguity.

The right way to give an AI agent eyes on a page isn't a screenshot. It's a structured map of what's there and how to reach it.

The Right Way for AI Agents to Understand a Web Page

What /inspect returns

Using it before automation

Why this matters for agents at scale

100 requests/month, no credit card

What `/inspect` returns