Current plan - possibly a bad one - for older Twitter stuff

Aug 01 2013 Published by under dissertation, information analysis

I'm hesitant to post this because I'm not sure how it rates in the terms of service, but here's what seems to be working.

When last I posted, I had overcome some oauth hurdles to successfully pull tweets using the API only to find that the tweets I wanted (#AGU11, #AGU2011, #AGU12, #AGU2012) were not covered by the API. Crap on a banana as my old boss would say.

I did find that if you do a search in the interface and scroll a whole bunch you can actually get all the tweets on the screen from back that far. So I did that and I copied into Excel.

Unfortunately I ended up with something like this:

twitterrawThere are 5 lines for each tweet and only two have any text I want. I also would like fields and I've just got two long messes. What to do?

Open Refine to the rescue, sorta. Open Refine used to be called Google Refine. It just helps with data munging. I kinda didn't really get it until I tried it for this because I assumed I could do all this as easily in Excel, but that's not the case.

I'm tempted to actually post the json file so the steps could be repeated, but so far I haven't actually found that I can create a clean enough script to run from one end to the other. Nevertheless, I'm willing to post if there's interest.  Here's what I've done so far:

  • Upon import, got rid of blank rows
  • transposed to 3 columns
  • deleted the column with "expand" , "view photo", or "view conversation" in it
  • split the first column at @ to get a new column with just the twitter user name
  • split the twitter user name column at the first space to get just the twitter user name and a new column with the date*
  • copied the tweet content to a new column
  • filtered for RT OR MT and starred those - might do something with that later... probably will omit for most of the work
  • split at @, for each of those columns (up to 6!)
  • for each of those columns split at space and limited to 2 columns, then deleted the second column. I tried this GREL fanciness that would have done this in one shot, and the preview looked good, but it said that it wasn't an object that something or other couldn't end up in a cell.

So here are my columns:

Name , Twitter handle, date, tweet, first@, second@, third@....

So it would be quite easy to convert this to one of the UCInet files types that has you list nodes that are connected.  As soon as I do this for the other files and combine the 2011 files and the 2012 files.

*I'm missing a lot of what I would have gotten easily with the old API like a full date time stamp, id, geocodes, what software was used, mentions, etc.

One response so far