The Right Way for AI Agents to Understand a Web Page
Most AI agents read the DOM or take a screenshot to understand a page. Neither is efficient. The /inspect endpoint returns a structured element map — exactly what an agent needs to act, with zero visual parsing.
When an AI agent needs to interact with a web page, the usual approaches are wrong.
Screenshot + vision model: The agent takes a screenshot and asks a vision model to describe the UI. This works but burns tokens parsing pixels into intent that was already in the DOM as structured data.
Raw DOM: Pass the full HTML to the model. A typical page is 50–200KB of HTML. After tokenization, that's 15,000–60,000 tokens — most of it irrelevant noise from style attributes, tracking scripts, and wrapper divs.
Manual selector guessing: The agent tries #submit, then .submit-btn, then button[type=submit], failing forward until something clicks. Fine for a demo, wrong for production.
There's a better primitive: ask for the structured element map directly.
What /inspect returns
PageBolt's /inspect endpoint visits a URL and returns only what matters for interaction:
const res = await fetch('https://pagebolt.dev/api/v1/inspect', {
method: 'POST',
headers: { 'x-api-key': process.env.PAGEBOLT_API_KEY, 'Content-Type': 'application/json' },
body: JSON.stringify({ url: 'https://yourapp.com/signup' })
});
const map = await res.json();
Response:
{
"elements": [
{
"tag": "button",
"role": "button",
"text": "Create account",
"selector": "#signup-submit",
"attributes": { "type": "submit" },
"rect": { "x": 120, "y": 520, "width": 200, "height": 44 }
},
{
"tag": "input",
"role": "textbox",
"text": "",
"selector": "#email",
"attributes": { "type": "email", "placeholder": "Email address", "required": true },
"rect": { "x": 120, "y": 360, "width": 320, "height": 40 }
},
{
"tag": "input",
"role": "textbox",
"text": "",
"selector": "#password",
"attributes": { "type": "password", "placeholder": "Password", "required": true },
"rect": { "x": 120, "y": 420, "width": 320, "height": 40 }
}
],
"forms": [
{ "selector": "#signup-form", "action": "/api/signup", "method": "post" }
],
"links": [
{ "text": "Terms of service", "selector": "footer a[href='/terms']", "href": "/terms" }
],
"headings": [
{ "level": "h1", "text": "Start your free trial", "selector": "h1.page-title" }
]
}
This is what a structured page understanding looks like. No pixels. No DOM noise. An agent receives this and immediately knows what selectors to use — without a control loop.
Using it before automation
The pattern that matters: inspect first, then act.
// Step 1: understand the page
const inspectRes = await fetch('https://pagebolt.dev/api/v1/inspect', {
method: 'POST',
headers: { 'x-api-key': process.env.PAGEBOLT_API_KEY, 'Content-Type': 'application/json' },
body: JSON.stringify({ url: 'https://yourapp.com/signup' })
});
const map = await inspectRes.json();
// Step 2: pull selectors from the elements array
const emailInput = map.elements.find(e => e.attributes?.type === 'email');
const passwordInput = map.elements.find(e => e.attributes?.type === 'password');
const submitBtn = map.elements.find(e => e.tag === 'button' && e.attributes?.type === 'submit');
// Step 3: execute with confidence — no guessed selectors
const res = await fetch('https://pagebolt.dev/api/v1/sequence', {
method: 'POST',
headers: { 'x-api-key': process.env.PAGEBOLT_API_KEY, 'Content-Type': 'application/json' },
body: JSON.stringify({
steps: [
{ action: 'navigate', url: 'https://yourapp.com/signup' },
{ action: 'fill', selector: emailInput.selector, value: 'user@example.com' },
{ action: 'fill', selector: passwordInput.selector, value: 'securepassword' },
{ action: 'click', selector: submitBtn.selector },
{ action: 'screenshot' }
]
})
});
No guessed selectors. No vision model overhead. One inspect call gives the agent a reliable map; the sequence runs against verified selectors.
Why this matters for agents at scale
Agents that interact with the same app repeatedly shouldn't re-parse the DOM on every run. Cache the inspect result. Pages don't change their core form structure on every deploy — inspect once per deployment, store the map, run sequences against it.
At 1,000 automations/day, that's 1,000 fewer DOM-parsing steps, 1,000 fewer vision model calls, and a dramatic reduction in selector failures caused by visual ambiguity.
The right way to give an AI agent eyes on a page isn't a screenshot. It's a structured map of what's there and how to reach it.
Get Started Free
100 requests/month, no credit card
Inspect any page and get back a structured element map your agent can act on immediately.
Get Your Free API Key →