Web Scraping with Playwright in Lambdas
Web scraping is something all developers know and loathe. But thanks to serverless workers like AWS Lambda functions, it makes web scraping a bit less painful. Hopefully I'll have guided you on how to setup a basic worker for web scraping using a headless browser automation tool, like Playwright.
Playwright (and other similar tools) require a browser executable (in our case a Chromium executable). This is tricky as any serverless worker's execution environment doesn't include any binary of the sort.
But – there are numerous ways to get an executable in your function. The easiest is using the @sparticuz/chromium
NPM package.
Security Consideration
Executing a binary from an NPM package should be considered insecure. If your function deals with sensitive information, consider more secure alternatives like using AWS Lambda Layers or container images to include the Chromium executable. Additionally, for web scraping tasks, you might explore architectural alternatives, such as orchestrating tasks using AWS Step Functions, though this would represent a different approach to direct browser automation.
Anywho - onwards.
After importing,
const { chromium: playwright } = require("playwright-core");
const chromium = require("@sparticuz/chromium");
we create our browser
and context
.
const browser = await playwright.launch({
headless: true,
executablePath: await chromium.executablePath(),
ignoreHTTPSErrors: true,
proxy: {
server: process.env.PROXY_HOST,
username: process.env.PROXY_USERNAME,
password: process.env.PROXY_PASSWORD,
},
args: [
"--incognito",
"--disable-extensions",
"--ignore-certificate-errors",
"--disable-blink-features=AutomationControlled",
...chromium.args,
],
ignoreDefaultArgs: ["--enable-automation"],
});
const context = await browser.newContext({
isIncognito: true,
userAgent:
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36",
extraHTTPHeaders: {
"sec-ch-ua":
'" Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
},
});
When creating the browser object, we specify a few things. For the most part the configuration is self-explanatory.
--incognito
for starting each session fresh.--disable-extensions
prevents any extensions from being loaded.--ignore-certificate-errors
instructs the browser to ignore TLS/SSL cert. errors.--disable-blink-features=AutomationControlled
to disable theAutomationControlled
feature of the Blink rendering engine.ignoreDefaultArgs: ["--enable-automation"]
which tells Playwright to not use the enable automation flag.
args: [
"--incognito",
"--disable-extensions",
"--ignore-certificate-errors",
"--disable-blink-features=AutomationControlled",
...chromium.args,
],
ignoreDefaultArgs: ["--enable-automation"],
You can optionally set a proxy to route your traffic through to evade your requests from potentially being blocked by IP.
proxy: {
server: process.env.PROXY_HOST,
username: process.env.PROXY_USERNAME,
password: process.env.PROXY_PASSWORD,
},
Defining the context object requires only setting a few configuration options. Again, most are self-explanatory.
We specify the extraHTTPHeaders
and overwrite the default sec-ch-x
headers that Playwright defines to ensure no mention of the word headless or automation of any sort.
extraHTTPHeaders: {
"sec-ch-ua":'" Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
},