Extract Clean Markdown from Any URL — The PageBolt /extract Endpoint

The Problem: HTML Noise

When you feed raw HTML to an LLM, you're giving it:

Scripts and stylesheets (ignored)
Navigation menus and footers (ignored)
Ads and tracking pixels (ignored)
10KB+ of boilerplate (wasted tokens)
2KB of actual content (what you need)

Your agent pays for all 50KB but can only use 2KB. That's 96% waste — and that waste compounds across every request your agent makes.

The Solution: /extract

PageBolt's /extract endpoint takes a URL, pulls the main content, converts it to clean Markdown, and returns it. One call. No HTML parsing. No noise.

const response = await fetch('https://pagebolt.dev/api/v1/extract', {
  method: 'POST',
  headers: {
    'x-api-key': process.env.PAGEBOLT_API_KEY,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ url: 'https://example.com/blog/article' })
});

const { markdown, title, byline } = await response.json();
// markdown: "# Article Title\n\nClean content..."

That's it. Three lines. The URL becomes Markdown your LLM can actually use.

Real Example: AI Research Summarizer

An AI agent that summarizes research papers. You feed it URLs. It extracts and understands content.

const Anthropic = require('@anthropic-ai/sdk');
const client = new Anthropic();

async function summarizeResearchPaper(paperUrl) {
  // Step 1: Extract clean Markdown
  const extractRes = await fetch('https://pagebolt.dev/api/v1/extract', {
    method: 'POST',
    headers: {
      'x-api-key': process.env.PAGEBOLT_API_KEY,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({ url: paperUrl })
  });

  const { markdown, title, byline } = await extractRes.json();

  // Step 2: Pass clean Markdown to Claude
  const message = await client.messages.create({
    model: 'claude-opus-4-6',
    max_tokens: 1024,
    messages: [{
      role: 'user',
      content: `Summarize in 3 bullet points:\n\nTitle: ${title}\nAuthor: ${byline || 'Unknown'}\n\n${markdown}`
    }]
  });

  return message.content[0].text;
}

const summary = await summarizeResearchPaper('https://arxiv.org/abs/2406.12345');
console.log(summary);

The Token Math

Token efficiency directly impacts cost and speed for any agent processing web content at scale:

Approach	Input Tokens	Cost per Request
Raw HTML → LLM	~50,000	~$1.50
/extract → LLM	~2,000	~$0.06

25x cheaper. Same output quality. And the LLM understands Markdown better than raw HTML — accuracy improves too.

What /extract Returns

{
  "markdown": "# Article Title\n\nArticle content in clean Markdown...",
  "title": "Article Title",
  "byline": "Author Name",
  "excerpt": "First paragraph or meta description...",
  "wordCount": 1200,
  "siteName": "Example",
  "lang": "en",
  "url": "https://example.com/blog/article"
}

Everything you need: content, metadata, and source context — ready to pipe directly into your agent's context window.

Use Cases

Research aggregator — Extract from 100 papers, summarize trends
Competitive intelligence — Extract competitor pages, feed to analysis agent
Documentation agent — Extract API docs from URLs, answer questions about them
News digest — Extract articles, summarize daily news for users
Content curator — Extract blog posts, categorize by topic automatically
Customer support — Extract help docs, keep support agent current

Pricing

Plan	Requests/Month	Cost
Free	100	$0
Hobby	500	$9/mo
Starter	5,000	$29/mo
Growth	25,000	$79/mo
Scale	100,000	$199/mo

For agents processing web content regularly, /extract is the most token-efficient way to feed your LLM real-world data. At 5,000 extractions/month, you're paying $29. The token savings at that volume are orders of magnitude larger.

Extract Clean Markdown from Any URL in 3 Lines