Lede 2024 Info Session: Advanced Web Scraping with Playwright

Events

Past Event

Lede 2024 Info Session: Advanced Web Scraping with Playwright

February 10, 2024

10:00 AM - 12:00 PM

America/New_York

Pulitzer Hall, 2950 Broadway, New York, NY 10027

Curious about joining Columbia Journalism School's 2024 Lede Program this summer to conquer the world of data journalism? This sample class + info session will give you a taste of what class is like, with Program Director Jonathan Soma!

----------------------------

Scraping dynamic websites with browser automation tools

High-value information is often locked behind hard-to-scrape pages: form fields must be filled out, "next" buttons clicked, content scrolled and scrolled to load all of the results. If page interaction is preventing you from scraping the data you need, Playwright is here to help!

Playwright is a next-generation browser automation tool that allows you to use Python or JavaScript to scrape almost any web page. It can assist in downloading pages of government documents, screenshotting tweets before they get deleted, or even just breaking past the cookie consent banner. Beyond the basics, it can also easily take screenshots, monitor and log network requests, and even fit right into your traditional BeautifulSoup scraping approach.

We'll look at:

- Installing Playwright
- Accessing elements on the page
- Interacting with web pages (clicking, navigating)
- Filling out form fields (text, dropdowns, radio buttons)
- Taking full-page and single-element screenshots
- Sending pages to traditional scraping tools like BeautifulSoup
- Sneaky tricks like CAPTCHA breaking

For those of you familiar with tackling similar problems using Selenium: Playwright is a similar tool with a better interface, better install/upgrade process, and ten times the usability. It might be time to upgrade!

Contact Information

Columbia Journalism School