Retrieve an HTML webpage

In this guide, you’ll connect a CDP client to Lightpanda and extract all reference links from a Wikipedia page .

Unlike curl , which only fetches raw HTML, Lightpanda executes JavaScript and runs query selectors directly in the browser — making it suitable for dynamic pages.

Prerequisites

You’ll need Node.js installed on your computer.

Create a hn-scraper directory and initialize a new Node.js project. You can accept all the default values in the npm init prompts.


mkdir hn-scraper && \
  cd hn-scraper && \
  npm init

Install the lightpanda and the puppeteer-core or playwright-core npm packages.

Unlike puppeteer and playwright, puppeteer-core and playwright-core don’t download a Chromium browser.

puppeteer


npm install --save puppeteer-core @lightpanda/browser

This guide uses a local Lightpanda instance — see installation for setup details. The code can be adapted to connect to Lightpanda Cloud.

Create index.js

Create an index.js file. Import the packages, configure the connection, and set up the browser lifecycle:

puppeteer


'use strict'
 
import { lightpanda } from '@lightpanda/browser';
import puppeteer from 'puppeteer-core';
 
const lpdopts = {
  host: '127.0.0.1',
  port: 9222,
};
 
const puppeteeropts = {
  browserWSEndpoint: 'ws://' + lpdopts.host + ':' + lpdopts.port,
};
 
(async () => {
  // Start Lightpanda browser in a separate process.
  const proc = await lightpanda.serve(lpdopts);
 
  // Connect Puppeteer to the browser.
  const browser = await puppeteer.connect(puppeteeropts);
  const context = await browser.createBrowserContext();
  const page = await context.newPage();
 
  // Disconnect Puppeteer.
  await page.close();
  await context.close();
  await browser.disconnect();
 
  // Stop Lightpanda browser process.
  proc.stdout.destroy();
  proc.stderr.destroy();
  proc.kill();
})();

Navigate and extract

Use page.goto to navigate to the Wikipedia page, then run a query selector to extract all external reference links:

puppeteer


  // Go to Wikipedia page.
  await page.goto("https://en.wikipedia.org/wiki/Web_browser");
 
  // Extract all links from the references list of the page.
  const reflist = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.references a.external')).map(row => {
      return row.getAttribute('href');
    });
  });
 
  // Display the result.
  console.log("all reference links", reflist);

Full script

puppeteer


'use strict'
 
import { lightpanda } from '@lightpanda/browser';
import puppeteer from 'puppeteer-core';
 
const lpdopts = {
  host: '127.0.0.1',
  port: 9222,
};
 
const puppeteeropts = {
  browserWSEndpoint: 'ws://' + lpdopts.host + ':' + lpdopts.port,
};
 
(async () => {
  // Start Lightpanda browser in a separate process.
  const proc = await lightpanda.serve(lpdopts);
 
  // Connect Puppeteer to the browser.
  const browser = await puppeteer.connect(puppeteeropts);
  const context = await browser.createBrowserContext();
  const page = await context.newPage();
 
  // Go to Wikipedia page.
  await page.goto("https://en.wikipedia.org/wiki/Web_browser");
 
  // Extract all links from the references list of the page.
  const reflist = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.references a.external')).map(row => {
      return row.getAttribute('href');
    });
  });
 
  // Display the result.
  console.log("all reference links", reflist);
 
  // Disconnect Puppeteer.
  await page.close();
  await context.close();
  await browser.disconnect();
 
  // Stop Lightpanda browser process.
  proc.stdout.destroy();
  proc.stderr.destroy();
  proc.kill();
})();

Run it


node index.js


$ node index.js
🐼 Running Lightpanda's CDP server... { pid: 34389 }
all reference links [
  'https://gs.statcounter.com/browser-market-share',
  'https://radar.cloudflare.com/reports/browser-market-share-2024-q1',
  'https://web.archive.org/web/20240523140912/https://www.internetworldstats.com/stats.htm',
  'https://www.internetworldstats.com/stats.htm',
  'https://www.reference.com/humanities-culture/purpose-browser-e61874e41999ede',
  ...
]