Even if you missed one of the most influential stories to come out of Columbia Journalism School in 2018, you probably heard about its aftermath.
In January, the New York Times published The Follower Factory, an exposé on the shady business of selling followers on social media that unmasked some 48 million Twitter profiles as fraudulent. Months later, following investigations by the New York state attorney general and others, the social media platform announced the purge of millions of suspicious accounts.
The move rocked the social media world—and cast suspicion on a number of celebrity users as their follower counts dropped dramatically overnight. But this may never have occurred had a group of Columbia data journalism students not sat down for a semester examining information networks and algorithms in Mark Hansen’s Spring 2017 Computational Journalism course.
Here, we speak with Professor Hansen to learn more about how a student project from this course analyzing bots led to the New York Times story and to find out how journalism students can harness the latest computational tools and techniques to inform their reporting.
How did the project of analyzing bots come about?
The focus of Computational Journalism that spring was how the networks that we rely on for information can be fragile. We started the semester looking at trending topics on Twitter: What makes something trend? What do we know about the algorithms being used by different platforms? What do they publish? What do they tell us? We were then able to see examples of how people’s activities could be manipulating those.
For example, the first thing we looked at were two hashtags, #schumershutdown versus #trumpshutdown. There seemed to be a push to try to get one to trend over the other, because if you got it to trend on Twitter it must mean that either Schumer or Trump was responsible.
We also looked at algorithmic recommender systems. These are systems that say, well maybe if you like X, then you’re like Y. We also looked at how information travels across networks and had a guest from Microsoft Research talk to the class about so-called "information cascades" and how information goes viral.
The students got exposure to network science as well as basic machine learning ideas around how Twitter decides, given all the many tweets, what’s the thing being talked about—what’s trending.
How did this lead you to Devumi, the company at the center of the New York Times Piece?
That’s easy. If you Google buy Twitter followers, Devumi is usually at the top of the results. That’s because the head of the company has a background in search engine optimization: getting content to rank high on search engine results.
Their website talked about getting real followers. That was interesting to us after studying Twitter’s trending algorithms because it was clear that if you could get a lot of people to respond to your tweets or use your hashtag in a short period of time, you could start trending, then away you go. Buying followers seemed like maybe the easiest way to do that.
To see for ourselves, we set up a Twitter account called @RosieTuring and bought 2500 followers from Devumi. These accounts looked amazing. Everything matched up, from the profile picture, to the header image, to the bios. For example, the bio would say, “I’m a ski enthusiast” and the background picture would show the same person from the profile photo skiing. Sometimes they would even have links to Instagram or Facebook with even more images that matched. I thought, 'wow, someone’s clearly put a lot of effort into creating these bots. It was crazy!'
You say bots rather than profiles. How did you conclude these weren’t real Twitter users?
At first, I thought maybe they were real people, but the economics didn’t make sense. We weren’t paying enough for 2500 people to create profiles and tweet on our behalf. They’d be getting pennies. I was very suspicious.
As we looked closer, we noticed something strange in many of our follower’s usernames. If their name was say, Brian, the “I” would be replaced with a “1” to create “Br1an.” This type of morphing happened often enough that we began to question it. I tried turning the “1” back into an “I” and ended up on what appeared to be the same profile—well mostly the same. The profile pictures would be shifted a pixel in each direction and the color would be changed slightly. This meant that the simplest algorithm, one comparing images pixel to pixel, would analyze them and conclude these are different images.
Essentially, with a few slight changes to the names and images, a brand new account had been cloned from a real user account. Then these new accounts would start out tweeting a little bit, usually retweeting something random, before they started tweeting Devumi client content. Sometimes they would mix in stuff from the president or other high-profile users so it looked more legitimate.
When did you know you had a story?
At this point, we thought we had a story about the theft of people’s online identities. We weren’t completely clear on what the law in this area was, but it didn’t seem right. After the story was published, the New York state attorney general said this kind of impersonation is illegal and will be pursuing companies that do it.
How did you go back and forth between computation and traditional reporting?
Everything we do in this class is a back and forth. We use a tool called the Jupyter Notebook which allows students to enter code, see the output, make some notes, type some more code and keep going back and forth between what they think is happening and trying it out to see if it holds up. For example, when we noticed the 1's where there should have been I's, we used code to analyze the entire batch of 2500 profiles, find the commonalities and reverse engineer the names to find the real profiles and see which bios matched. At that point, we knew we had something interesting, but weren’t quite sure what it was yet.
How did you move from your data findings to being published in the New York Times?
We continued investigating the other angles of the story and then at the NICAR conference, I ran into my friend at The Times, Gabriel Dance, and I told him about what we’d found. He was already interested in bots and The Times was wanting to explore social media manipulation, so we decided to pool our efforts.
I shared the idea with the class and emphasized that The Times might be interested, but they would have to pitch it. So, we had the students write their pitch and send it to The Times to see what happened.
How long was the reporting process after the class started working with The Times?
It was a very long process. We submitted the pitch back in May 2017 and the story was published in January 2018.
As The Times started to dig into the story, it turned out the head of Devumi was allegedly not a very nice boss. Employees were happy to talk, and as we learned more about how the company worked, it became clear that Devumi doesn’t actually make any bots. They buy them from third-parties on a site called BlackHat World where the cost is a fraction of what Devumi sells them for.
The more we dug, the more we found. Researchers at The Times looked at the Internet Archive to see how the head of the company’s resume had changed over the years. He had advertised going to various universities, graduating in 2000 and getting a degree from M.I.T. that the institute doesn’t offer. In reality, he was 27-years-old and the company’s address in Manhattan doesn’t actually exist. The company is in West Palm Beach, Florida.
The Times also found a lawsuit involving Devumi. They had exported the software to places like the Phillipines and one of his employees there had allegedly used a list of Devumi accounts to create a new company called DevumiBoost.
The story kept morphing and so it took quite a while.
After the story around Devumi became clearer, what affected parties did you reach out to?
It was one thing to speak to people who had their identity stolen. Eventually, harder conversations had to happen with Twitter and with Devumi clients.
We identified a number of artists, politicians, athletes and other high-profile people who appeared to be clients—people like Hilary Rosen, a CNN commentator, or Paul Hollywood, host of the Great British Baking Show. We had to verify what we’d found before contacting them, so that didn’t come until the end, because otherwise the system would be disrupted.
In the case of Paul Hollywood, he took his account down after the Times contacted him.
In your analysis of the reporting, what is the biggest issue?
One of the biggest issues was that Twitter wasn’t incentivized to do anything about the problem. We found is that in many cases when the real accounts are inactive but the bots are twitting quite frequently. So if Twitter gets rid of the bots, their engagement numbers decline.
After our work, we found it easy to spot the fake accounts. Meanwhile, Twitter had access to every sign-up, every action that happens on their platform and could have taken action earlier.
Describe the skills that students need to be successful in your computational journalism class.
Students just have to be good journalists and open to seeing code as a tool that will help their reporting. The school provides a lot of resources that will help them move beyond the code and help if they get stuck. Come looking for help, have faith that you’ll get there.
They’re not going to be expert programmers after a few weeks, but they’ll be exposed to a range of coding tools which will help them be great collaborators. They’ll be able to prototype something that will help inform their reporting