Native Markdown Output in Lightpanda

Adrià Arrufat

Adrià Arrufat

Software Engineer

Native Markdown Output in Lightpanda

TL;DR

Lightpanda now converts HTML to markdown natively inside the browser. You can use it from the CLI with --dump markdown or programmatically via a custom CDP command LP.getMarkdown. The conversion happens at the DOM level, after JavaScript execution, so you get the actual rendered content. For AI agents, this means up to 80% fewer tokens per page with zero external dependencies.

HTML Is Expensive for Machines

If you’re building AI agents that browse the web, you already know this problem. You navigate to a page, extract the HTML and feed it to an LLM. Most of that HTML is noise: navigation bars, script tags, style attributes, wrapper divs. None of it carries semantic value, yet it all consumes tokens.

A typical web page might contain 1,000 words of actual content but produce 8,000-10,000 tokens of HTML when you account for all the markup. Markdown preserves the content and its structure (headings, links, lists, tables) while stripping everything else. Cloudflare measured this on their own blog. A post that took 16,180 tokens in HTML dropped to 3,150 tokens in markdown; that’s an 80% reduction.

The usual approach is to do this conversion outside the browser. You fetch the HTML, pipe it through a conversion library, clean up the output, and hope nothing important was lost. That works, but it adds a dependency and another failure point. As well as processing time you didn’t need to spend.

Since we’re building Lightpanda from scratch for machines, we control the whole codebase and wanted to take a different approach: we built the conversion directly into the browser.

How It Works

Lightpanda’s markdown conversion operates on the DOM tree after JavaScript has executed. This is an important distinction. You’re not converting raw HTML source code but the live document, the same one a user would see after all client-side rendering has completed.

The converter walks the DOM and translates each node into its markdown equivalent. Headings become # syntax. Links become [text](url). Lists, tables, code blocks, emphasis: all handled. Elements with no semantic meaning (like <div> and <span> wrappers) are ignored. Script and style elements are excluded entirely.

There are two ways to use it.

CLI: The --dump markdown Flag

If you use Lightpanda’s fetch command, you can now pass --dump markdown to get markdown output directly:

lightpanda fetch --dump markdown https://example.com

This fetches the page, executes JavaScript, and prints the markdown representation to stdout. No browser session, CDP, or scripting. Useful for quick extraction or piping into other tools.

CDP: The LP.getMarkdown Command

For programmatic use, Lightpanda exposes a custom CDP command under its own LP domain. This integrates with your existing Puppeteer or Playwright scripts.

Here is how it looks with Puppeteer:

import puppeteer from 'puppeteer-core'; const browser = await puppeteer.connect({ browserWSEndpoint: 'ws://127.0.0.1:9222', }); const context = await browser.createBrowserContext(); const page = await context.newPage(); const client = page._client(); await page.goto('https://example.com', { waitUntil: 'networkidle0' }); const cmd = await client.send('LP.getMarkdown', {}); console.log(cmd.markdown); await page.close(); await context.close(); await browser.disconnect();

And with Playwright:

import { chromium } from 'playwright-core'; const browser = await chromium.connectOverCDP('ws://127.0.0.1:9222'); const context = await browser.newContext(); const page = await context.newPage(); await page.goto('https://example.com'); const client = await page.context().newCDPSession(page); const cmd = await client.send('LP.getMarkdown'); console.log(cmd.markdown); await page.close(); await context.close(); await browser.close();

The key detail: you call client.send('LP.getMarkdown', {}) and get back an object with a markdown property containing the converted content. No external libraries or post-processing pipeline.

Converting Specific Elements

You do not always need the entire page as markdown. Sometimes you only care about the article body, or a specific section. The LP.getMarkdown CDP command supports converting a sub-portion of the page by targeting specific DOM nodes.

This is useful when you want to skip headers, footers, and sidebars and extract only the content that matters for your AI workflow. Less noise means fewer tokens and better results from your LLM.

Why This Matters for AI Agents

The shift toward markdown as the standard format for AI consumption is well underway. Cloudflare recently launched Markdown for Agents at the CDN layer. Tools like Claude Code already send Accept: text/markdown headers when fetching web content.

However, these approaches require either the origin server or a CDN to support markdown conversion, and many sites do not. When your AI agent navigates to an arbitrary website, you cannot rely on the server providing markdown.

Lightpanda solves this at the browser level. Because the conversion happens inside the browser, it works on any website. The browser fetches the page, executes the JavaScript, builds the DOM, and converts the result to markdown. No cooperation from the server required.

For AI agent architectures, this changes the economics:

  • Token reduction: Up to 80% fewer input tokens per page means lower API costs and more room in your context window for actual reasoning.
  • No external dependencies: The conversion is built into the browser. No separate library to install, update, or debug.
  • Post-JavaScript content: Unlike server-side conversion, this captures dynamically rendered content. SPAs, lazy-loaded data, client-side routing: all visible in the output.
  • Targeted extraction: Convert only the parts of the page you need, reducing noise further.

The LP Domain

The LP.getMarkdown command is the first in a new custom CDP domain we are introducing: LP. This domain will be the home for Lightpanda-specific capabilities that go beyond what standard CDP offers.

We wrote extensively about CDP’s limitations for automation. The protocol was built for debugging, not for machines. The LP domain is where we can build commands that make sense for automation-first use cases, without being constrained by a protocol designed for Chrome DevTools.

Markdown output is the first, but more commands are coming.

Get Started

Try the markdown output today. The quickstart guide will get you running in under 10 minutes. Working examples for both Puppeteer and Playwright are in the demo repo.


FAQ

What format does the markdown output use?

The output follows standard CommonMark markdown syntax. Headings, links, images, lists, tables, code blocks, and emphasis are all supported. Elements without semantic meaning (like layout divs) are omitted.

Does the markdown conversion happen before or after JavaScript execution?

After. The converter walks the live DOM tree, so it captures all dynamically rendered content. This means SPAs and pages that load content via JavaScript produce accurate markdown output.

Can I convert only part of a page to markdown?

Yes. The LP.getMarkdown CDP command supports targeting specific DOM nodes, so you can extract only the content you care about rather than the entire document.

How much does markdown reduce token usage compared to HTML?

It depends on the page. Pages with heavy markup (navigation, footers, ads, script tags) see the largest reductions. Cloudflare measured an 80% reduction on their blog. Content-heavy pages with minimal markup will see smaller but still meaningful reductions.

Does this work with Lightpanda Cloud?

Yes. The LP.getMarkdown command works the same way whether you run Lightpanda locally or connect to Lightpanda Cloud.

Is the LP domain compatible with standard CDP?

The LP domain is a Lightpanda-specific extension. It is not part of the Chrome DevTools Protocol specification. Standard CDP commands continue to work as expected. The LP domain adds new capabilities on top.

Do I need to install any additional libraries?

No. The markdown conversion is built into the Lightpanda browser binary. There are no external dependencies.


Adrià Arrufat

Adrià Arrufat

Software Engineer

Adrià is an AI engineer at Lightpanda, where he works on making the browser more useful for AI workflows. Before Lightpanda, Adrià built machine learning systems and contributed to open-source projects across computer vision and systems programming.