You can upload files, grab data from a public URL, load Google Sheets document, or just paste in data. But sadly, much of the world’s data is locked into web pages that were never designed to be machine-readable. If you’ve used Google Sheets, you may have used the ImportHTML command to read tables. Workbench Scrape Table module works exactly the same way:
Workbench, A Data Platform For All Reporters
We are building Columbia Journalism's Workbench to be a simpler, more powerful way for journalists to work with data.
Data journalism involves steps such as getting the data, cleaning it, and visualizing it. It also requires other tasks that might be less obvious, like checking for updated data, scraping tables out of web pages, and sending a completed story to your editor for review.
While there are tools for each of these, there's no tool that does all of these together, without requiring any coding . Journalists can sign up for Workbench - currently in private beta - to access a streamlined data workflow that allows you to:
- Scrape data from web pages
- Clean data using an Open Refine-like interface
- Monitor live data for changes
- Use spreadsheet formulas and regular expressions
- Export live charts and data
- Build and share custom processing modules
A data tool that you can’t get data into isn’t much good, which is why Workbench has a variety of Add Data modules.
Here the “4” means scrape the 4th table on the page, which is Wikipedia’s list of the longest rivers:
You can also use HTML Scraper to scrape a list of URLs, saving the HTML for each one. Full xpath support, like Google Sheets’ ImportXML, is coming soon.
The data that journalists work with is messy. Really messy. It comes from literally everywhere and is full of all sorts of errors. If you’ve been working with data for a while, you’ve probably discovered Open Refine. It’s best known for its faceting and clustering features which make it easy to standardize values and fixed types. So, we made our own.
The Refine module shows a list of all values in a column, sorted by how many times each appears. It’s useful both for cleaning and filtering. You can include or exclude any value (perhaps we don’t want the blank rows) or edit all rows with the same value at the same time, by double clicking (we can change every “Seattle” to “Seattle -Tacoma.”)
Refine is the crown jewel of Workbench’s cleanup tools, which also include modules for filtering values, dropping empty row and empty columns, splitting and merging columns, filling empty cells or just editing values.
Monitor data sources for changes
Every Add Data module has an “update button” that checks for new data. Or, you can make Workbench check for you automatically.
Previously loaded data is automatically versioned and saved, so you can go back to it later — for example if the data you were using was deleted! And when Workbench finds that the data has changed, it can notify you by email.
Formulas and regular expressions
Maybe you’re a spreadsheet genius and you know all about how to use formulas to clean or calculate. Or maybe you’ve mastered the art of regular expressions to search through text or extract information. Or perhaps you’d prefer Python? Workbench supports them all.
Here’s an Excel formula to extract the first word of every value in the second column. In Workbench, formulas are automatically “filled” downward, so a formula creates an entire new column, not just one cell. The Formula module also supports Python syntax if that’s what you prefer.
And here’s the same thing using our Regular Expression Extractor:
Export live charts and data
Workbench currently has basic column and line charts you can embed live into your own pages. Just click on the embed icon (</>) to get code you can paste into your HTML pages, or your CMS.
This chart will automatically update whenever the workflow does, and reflect changes in the data source, if you’ve set your Add Data module to do so. Of course, you can simply download a static PNG or SVG file.
In fact, every single module has a live API endpoint which serves CSV or JSON, which you can access from the menu on top right of each module. You can use this to feed the output of Workbench to other systems — perhaps use it to drive a live interactive visualization.
Build and share custom modules
Workbench is designed to get as much as possible done without coding. But there will always be places where only programming will go. You can always just paste in some code to create a custom processing step.
Code is better when it’s shared, and so you can package it up to create a custom module, and import it into the Module Library for others to use. This way, developers can support specific teams and projects, and contribute back to the Workbench community.
Custom modules can even produce custom visualizations! Think of Workbench as a package manager for user-friendly data processing tools.
Try it yourself
Workbench is currently in private beta, and you can sign up here. Or, join us as a developer to create your own modules. Either way, we hope you’ll share our excitement for this powerful new way to work with data.