Thing-a-day #8: Python script for Twitter data collection

Daniel Ginsberg is a Doctoral student, Linguistics, Georgetown University, Washington, DC, 2010 – present ‘Sociolinguistics concentration’. You can read more about Daniel’s research by following his blog at and on twitter @NemaVeze


Edit: this post originally misrepresented Naomi Barnes’ take on the CMC literature; corrected version below.

One of my term projects this semester is a seminar paper in forensic linguistics that involves statistical authorship attribution — that is, given a piece of text, can you train a computer to figure out who wrote it? I’m trying to replicate Rico-Sulayes (2011) but where he used forum posts, I’m planning to use tweets.

Why tweets? Well, there are actual forensic contexts where someone sends a text message, and the actual identity of the sender of the message is questionable. A teenager goes missing and then their parents get a text that says “Don’t worry, everything’s fine,” and the question is whether it’s really from their kid — something like that. There are various methods that investigators use to try and figure this out, but it’s difficult to replicate experimentally because people don’t like to share their text messages. So how do you conduct research to improve investigative methods?

I’m hoping to use tweets as a publicly available substitute for texts. They’re of a similar length, which is one of the big problems in quantitative authorship attribution — computational linguists like big data sets, but criminal investigators often get only very small texts to work with. There are differences, of course, but from the methodological perspective of comparing a questioned message to a reference corpus of messages of known authorship, it may not matter so much.

This is my second time collecting social media data for a project. (A lot of researchers start off expecting, as sociologist Naomi Barnes writes of her own experience, that “Not once [would] you find a study that actually believed Facebook status updates are worthy pieces of data,” but as she found, the literature is deep; linguistic-ethnographic studies of computer-mediated communication go back over a decade.) The first time I was studying blog comments, and I cut and pasted them by hand into a text file. This time, I thought I could be a little more sophisticated. So I put together this Python script (thanks to Liz Merkhofer for putting me on track):

import twitter
import random
api = twitter.Api()

phdchat = api.GetSearch(term=’%23phdchat’, per_page=100)
# I'm going to collect tweets from contributors to the #phdchathashtag

allnames = sorted(set([tweet.user.screen_name for tweet in phdchat]))
names = random.sample(allnames, 10)
# this randomly chooses ten out of the people who posted the last 100 tweets on #phdchat

corpus = {}
for name in names:
corpus[name] = api.GetUserTimeline(id=name, count=200)
# this pulls as many tweets as I can get from each person in my list of names

It worked, as far as it goes, but it still needs some thinking. I’m not sure #phdchat is the right search term to use, to ensure rough comparability of tweets across authors. (I don’t want to compare, say, Kim Kardashian to Horse ebooks because it might be too easy of a computational problem, and not representative of the forensic situation.) For example, my script did pull in mainly real grad students, but it also included The Guardian Higher Education Network. TheWarwick Institute of Advanced Study also got pulled in for having retweeted a @GdnHigherEdpost on #phdchat. So I’ve got some more work to do.


2 Responses

  1. […] Cross-posted atSocPhD […]

  2. Thanks for sharing this.

    I am really new to programming – in fact, road blocks in my PhD research relating to the fact that I didn’t know how to program and couldn’t retrieve all the data I wanted to has lead me to take on a computer science degree part-time, as I think knowing how to code well will be invaluable for future research.

    I’m only a semester in, so still really not very good at programming. However, I’d be keen to know more about how you go about this. For instance, what python platform (app?) are you using? I’m more able to understand what I’m looking at when I look at code now, at least!

    The whole Twitter API thing confuses me, too. Any pointers for where to begin — even a good website or blog that lays it out clearly?

Don't be shy, please leave a reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s


Facilitating a narrative across the social sciences

Creativity through collaboration

Facilitating a narrative across the social sciences


Politics, philosophy, history . . . thinking, writing, teaching . . . the blog of Adrian Blau

Nick Hopwood

Education research, PhD study, academic publishing, research methods

Facilitating a narrative across the social sciences

This Sociological Life

A blog by sociologist Deborah Lupton

Dr. Matt Donoghue

Politics and Society (and a bit of Economy) from a Critical Perspective


cars, environment, geography

Geography Directions

The latest journal content and related new stories from the RGS-IBG journals and Geography Compass.

letter from chiapas

Facilitating a narrative across the social sciences

Department of Sociological Studies Blog

Engaging and accessible sociology

Donna Peach

Facilitating a narrative across the social sciences

that space in between

exploring life and that gap where we sometimes find ourselves...

politics of the hap

a life worth living

Welcome to the AAA Blog

Conversations in Anthropology


This site is the global hub for dementia students


dtbarron blog

Notes on a Theory...

Thoughts on politics, law, & social science

Caroline Magennis

Research, teaching and basket-weaving

%d bloggers like this: