Curious about joining Columbia Journalism School's 2024 Lede Program this summer to conquer the world of data journalism? This sample class + info session will give you a taste of what class is like, with Program Director Jonathan Soma!
Scraping dynamic websites with browser automation tools
High-value information is often locked behind hard-to-scrape pages: form fields must be filled out, "next" buttons clicked, content scrolled and scrolled to load all of the results. If page interaction is preventing you from scraping the data you need, Playwright is here to help!
We'll look at:
- Installing Playwright
- Accessing elements on the page
- Interacting with web pages (clicking, navigating)
- Filling out form fields (text, dropdowns, radio buttons)
- Taking full-page and single-element screenshots
- Sending pages to traditional scraping tools like BeautifulSoup
- Sneaky tricks like CAPTCHA breaking
For those of you familiar with tackling similar problems using Selenium: Playwright is a similar tool with a better interface, better install/upgrade process, and ten times the usability. It might be time to upgrade!