Extract Clean Markdown from Any URL in 3 Lines
You're building an AI agent. It needs to read web pages. So you pass raw HTML to your LLM — and burn 96% of your tokens on scripts, ads, and navigation menus. There's a better way.
The Problem: HTML Noise
When you feed raw HTML to an LLM, you're giving it:
- Scripts and stylesheets (ignored)
- Navigation menus and footers (ignored)
- Ads and tracking pixels (ignored)
- 10KB+ of boilerplate (wasted tokens)
- 2KB of actual content (what you need)
Your agent pays for all 50KB but can only use 2KB. That's 96% waste — and that waste compounds across every request your agent makes.
The Solution: /extract
PageBolt's /extract endpoint takes a URL, pulls the main content, converts it to clean Markdown, and returns it. One call. No HTML parsing. No noise.
const response = await fetch('https://pagebolt.dev/api/v1/extract', {
method: 'POST',
headers: {
'x-api-key': process.env.PAGEBOLT_API_KEY,
'Content-Type': 'application/json'
},
body: JSON.stringify({ url: 'https://example.com/blog/article' })
});
const { markdown, title, byline } = await response.json();
// markdown: "# Article Title\n\nClean content..."
That's it. Three lines. The URL becomes Markdown your LLM can actually use.
Real Example: AI Research Summarizer
An AI agent that summarizes research papers. You feed it URLs. It extracts and understands content.
const Anthropic = require('@anthropic-ai/sdk');
const client = new Anthropic();
async function summarizeResearchPaper(paperUrl) {
// Step 1: Extract clean Markdown
const extractRes = await fetch('https://pagebolt.dev/api/v1/extract', {
method: 'POST',
headers: {
'x-api-key': process.env.PAGEBOLT_API_KEY,
'Content-Type': 'application/json'
},
body: JSON.stringify({ url: paperUrl })
});
const { markdown, title, byline } = await extractRes.json();
// Step 2: Pass clean Markdown to Claude
const message = await client.messages.create({
model: 'claude-opus-4-6',
max_tokens: 1024,
messages: [{
role: 'user',
content: `Summarize in 3 bullet points:\n\nTitle: ${title}\nAuthor: ${byline || 'Unknown'}\n\n${markdown}`
}]
});
return message.content[0].text;
}
const summary = await summarizeResearchPaper('https://arxiv.org/abs/2406.12345');
console.log(summary);
The Token Math
Token efficiency directly impacts cost and speed for any agent processing web content at scale:
| Approach | Input Tokens | Cost per Request |
|---|---|---|
| Raw HTML → LLM | ~50,000 | ~$1.50 |
| /extract → LLM | ~2,000 | ~$0.06 |
25x cheaper. Same output quality. And the LLM understands Markdown better than raw HTML — accuracy improves too.
What /extract Returns
{
"markdown": "# Article Title\n\nArticle content in clean Markdown...",
"title": "Article Title",
"byline": "Author Name",
"excerpt": "First paragraph or meta description...",
"wordCount": 1200,
"siteName": "Example",
"lang": "en",
"url": "https://example.com/blog/article"
}
Everything you need: content, metadata, and source context — ready to pipe directly into your agent's context window.
Use Cases
- Research aggregator — Extract from 100 papers, summarize trends
- Competitive intelligence — Extract competitor pages, feed to analysis agent
- Documentation agent — Extract API docs from URLs, answer questions about them
- News digest — Extract articles, summarize daily news for users
- Content curator — Extract blog posts, categorize by topic automatically
- Customer support — Extract help docs, keep support agent current
Pricing
| Plan | Requests/Month | Cost |
|---|---|---|
| Free | 100 | $0 |
| Hobby | 500 | $9/mo |
| Starter | 5,000 | $29/mo |
| Growth | 25,000 | $79/mo |
| Scale | 100,000 | $199/mo |
For agents processing web content regularly, /extract is the most token-efficient way to feed your LLM real-world data. At 5,000 extractions/month, you're paying $29. The token savings at that volume are orders of magnitude larger.
Frequently Asked Questions
Does /extract work on pages behind JavaScript rendering?
Yes. PageBolt runs a real Chromium browser under the hood, so JavaScript-rendered pages, SPAs, and React/Next.js apps are all fully rendered before extraction. You get the same content a real user would see.
What does "clean Markdown" mean exactly?
The endpoint uses Readability (the same algorithm Firefox Reader Mode uses) to identify the main content area, then converts it to Markdown. Navigation menus, ads, sidebars, footers, and cookie banners are stripped. Only the article or page body remains — headings, paragraphs, lists, code blocks, and links.
Can I extract pages that require authentication?
Yes. Pass session cookies via the cookies parameter or HTTP headers via headers. This lets you extract content from paywalled sites, internal tools, or any page that requires a logged-in session.
Give your AI agent clean web content
Start free — 100 extractions/month, no credit card.
Start Free at pagebolt.dev/pricing