Web Scraping with Playwright in Lambdas

12 Mar, 2024

Web scraping is something all developers know and loathe. But thanks to serverless workers like AWS Lambda functions, it makes web scraping a bit less painful. Hopefully I'll have guided you on how to setup a basic worker for web scraping using a headless browser automation tool, like Playwright.

Playwright (and other similar tools) require a browser executable (in our case a Chromium executable). This is tricky as any serverless worker's execution environment doesn't include any binary of the sort.

But – there are numerous ways to get an executable in your function. The easiest is using the @sparticuz/chromium NPM package.

Security Consideration

Executing a binary from an NPM package should be considered insecure. If your function deals with sensitive information, consider more secure alternatives like using AWS Lambda Layers or container images to include the Chromium executable. Additionally, for web scraping tasks, you might explore architectural alternatives, such as orchestrating tasks using AWS Step Functions, though this would represent a different approach to direct browser automation.

Anywho - onwards.

After importing,

const { chromium: playwright } = require("playwright-core");
const chromium = require("@sparticuz/chromium");

we create our browser and context.

const browser = await playwright.launch({
    headless: true,
    executablePath: await chromium.executablePath(),
    ignoreHTTPSErrors: true,
    proxy: {
      server: process.env.PROXY_HOST,
      username: process.env.PROXY_USERNAME,
      password: process.env.PROXY_PASSWORD,
    },
    args: [
      "--incognito",
      "--disable-extensions",
      "--ignore-certificate-errors",
      "--disable-blink-features=AutomationControlled",
      ...chromium.args,
    ],
    ignoreDefaultArgs: ["--enable-automation"],
  });

  const context = await browser.newContext({
    isIncognito: true,
    userAgent:
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36",
    extraHTTPHeaders: {
      "sec-ch-ua":
        '" Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"',
      "sec-ch-ua-mobile": "?0",
      "sec-ch-ua-platform": '"Windows"',
    },
  });

When creating the browser object, we specify a few things. For the most part the configuration is self-explanatory.

--incognito for starting each session fresh.
--disable-extensions prevents any extensions from being loaded.
--ignore-certificate-errors instructs the browser to ignore TLS/SSL cert. errors.
--disable-blink-features=AutomationControlled to disable the AutomationControlled feature of the Blink rendering engine.
ignoreDefaultArgs: ["--enable-automation"] which tells Playwright to not use the enable automation flag.

args: [
      "--incognito",
      "--disable-extensions",
      "--ignore-certificate-errors",
      "--disable-blink-features=AutomationControlled",
      ...chromium.args,
    ],
ignoreDefaultArgs: ["--enable-automation"],

You can optionally set a proxy to route your traffic through to evade your requests from potentially being blocked by IP.

proxy: {
  server: process.env.PROXY_HOST,
  username: process.env.PROXY_USERNAME,
  password: process.env.PROXY_PASSWORD,
},

Defining the context object requires only setting a few configuration options. Again, most are self-explanatory.

We specify the extraHTTPHeaders and overwrite the default sec-ch-x headers that Playwright defines to ensure no mention of the word headless or automation of any sort.

extraHTTPHeaders: {
  "sec-ch-ua":'" Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"',
  "sec-ch-ua-mobile": "?0",
  "sec-ch-ua-platform": '"Windows"',
},