My ongoing struggle with the Twitter API, R, … copy paste

I’m posting this in hopes that someone with experience in any/all of the above or maybe Perl, can point out that I’m doing something stupid or have overlooked something obvious. If nothing else, you might read this to see what not to try.

Here’s the issue: it’s totally obvious that I need to look at the other tweets that were sent by #agu10 tweeters (the ones not marked with the hash tag) if I want to understand how Twitter was used at the meeting. But it’s now five months later and there are 860 of them (although I would be fine with looking at the most prolific non-institutional tweeters).

I first looked at the Twitter API and I tried just adding terms to URLs and got the recent timelines for a user at a time but I couldn’t see a way to get a user’s timeline for a set period of time (the conference time period +a week on each end, or so).

I asked two experts and they both said that you couldn’t combine the user timeline with a time period.

Darn. So my next idea was to see if I could actually access someone’s timeline that far back through the regular interface. I tried one of the more prolific tweeters and I could. Ok, so if I can pull down all of their tweets, then I could pick out the ones I wanted. Or, even better, I could also look at the evolution of the social network over time. Did people meet at the meeting and then continue to tweet at each other or are these people only connected during the actual meeting?  Did the network exist in the same way before the meeting?

I was looking for ways to automate this a bit and I noticed that there were things already built for Perl and for R. I used Perl with a lot of handholding to get the commenter network for an earlier paper and I used R for both that same article and in lieu of STATA for my second semester of stats. I’m not completely comfortable with either one and I don’t always find the help helpful. I decided to start with R.

The main package is twitteR by Jeff Gentry. I updated my installation of R and installed and loaded that package and the dependencies. First thing I did was to get my own standard timeline:

testtweets <- userTimeline("cpikas")

Then I typed out the first few to see what I got (like when you’re using DIALOG)


And I saw my tweets in the format:


[1]”username: text”

I checked the length of that and got 18 – the current timeline was 18 items. I tried the same thing substituting user id but that didn’t work. So then I tried to retrieve 500 items and that worked fine, too.

testlonger <- userTimeline ("cpikas", n=500)

Great. Now, let me see the dates so I can cut off the ones I want. Hm. Ok, let’s see, how to get the other columns. What type of object is this anyhow? The manual is no help. I tried some things with object$field. No joy. Tried to edit. no joy – it was upset about the < in the image url. And it was also telling me that the object was of type S4. The manual said it wasn’t but I can’t argue if that’s what it’s reading. I somehow figured out it was a list. I tried object name [[1]][2]  - null. Then I eventually tried


Hrumph. It says 1 slot. So as far as i can tell, it’s a deprecated object type and it didn’t retrieve or keep all of the other information needed to narrow by date.

When googling around, I ran across this entry by Heuristic Andrew on text mining twitter data with R. I didn’t try his method with the xml package yet (may try that). I did try the package that was listed in the comments tm.plugin.webcorpus by Mario Annau. That does get the whole tweet and put the things in slots the right way (object$author), but it looks like you can only do a word search. Oh wait, this just worked:

testTM <- getMeta.twitter('from:cpikas')

But that’s supposed to default to 100 per page, 100 things returned and it only returned 7 for me. I guess the next thing to try is the XML version unless someone reading this has a better idea?

edit: forgot the copy paste. When I tried to just look at the tweets i wanted on screen and then copy them into a text document it crashed firefox. who knows why

Tags: ,

18 responses so far

  • Bob O'H says:

    What type of object is this anyhow? The manual is no help.

    Oh, it is. ?userTimeline gives the help page, and under Value it says "A list of ‘status’ objects". So each tweet is a "status" object, and the whole thing is a list of these.

    What's a status object? Well, the same help page lists these under "See Also":

    ‘getUser’, ‘status’, ‘registerTwitterOAuth’

    So use ?status. At this point things start to get fun. The way to access the elements of a status object is a bit non-standard: there's a pile of get functions (e.g. getScreenName(), getCreated()), and are used like this:

    > testtweets[[1]]$getScreenName()
    [1] "cpikas"

    So, now let's try to get the time the tweet was created:

    > testtweets[[1]]$getCreated()
    [1] NA

    Bugger. So either the field doesn't exist (but I think it would give a NULL), or it exists but isn't filled properly.

    The solution is to email Jeff Gentry (or tweet him!), as he's the package maintainer.

    • Christina Pikas says:

      Ah, so it's supposed to be there, but it's not so it should be reported as a bug? I'll do that. Thanks!

    • Christina Pikas says:

      Oh wait - I just tried >testtweets[[1]]$getCreated() and it worked! yay.

      • Bob O'H says:

        What you got a time? Damn, I'm only getting an NA.

        Ah well, if it's working, that's the main thing. It might be a good tactical move to ask Jeff Gentry further questions - it might encourage him to work on it a bit more (e.g. writing extractor functions).

        • Christina Pikas says:

          > testtweets[[1]]$getCreated()
          [1] "2011-05-14 01:25:41 UTC"

          I'm pretty ignorant about R. what would extractor functions get me?

          • Bob O'H says:

            The bits of the object you want. Functions like coef(): they should be more convenient, and will also be documented.

            (I'm not sure extractor function is the right term)

  • When you do get this sorted out, Christina, I hope you'll write more about the assumptions, mistakes, and bugs that caused problems along the way. In my experience some combination of the 3 are where I get stuck. And this looks like a package that might be useful to me in the future!

  • [...] lots of help from Bob O’Hara (thank you!), I was able to solve my problems. I am looking at the tweets around #AGU10 but it occurred to me that I wanted to know what other [...]

  • Jeff Gentry says:

    Hi Christina ...

    Just came across this via some deep googling. It sounds as if you got what you needed to get, but to echo the comments that Bob makes, please do contact me whenever you have questions, suggestions or anything else. Speaking of, even things that don't make sense in the documentation, pass those comments along as well.

    As interest/usage has grown, I've been trying to transition the whole package from something that was just a goofy experiment on my part into something that's more robust, usable, etc. That transition hasn't been without warts, but user feedback is the best way for me to know where those warts are 🙂


    • Christina Pikas says:

      Thanks for your reply! I'm definitely not an expert R user so more information in the documentation would help me.

      • Jeff Gentry says:

        Out of curiosity, did you read through the vignette? I'm curious if things still didn't make sense after that or if it was purely the man pages. When available (and with CRAN packages, that's not often), the package vignettes can be a great resource. They can also be useless, which might be possible in my case, I dunno 🙂

        • Christina Pikas says:

          I did read through the vignette. I just couldn't figure out how to pull out a piece of the status object. It might be me though. In the blog post after this I wrote what I finally did:

          • Jeff Gentry says:

            Ok, that's good to know. Thanks Christina.

            In that post - on the bit about varying return lengths - the 'n' is supposed to be a maximum (something I know isn't made clear enough in the documentation) and isn't necessarily what you'll get. You shouldn't get *more* than 'n' though. I've been working to make the underlying communication w/ the twitter servers more uniform so things like that should be fixed (or will be son).

  • Kirk Mettler says:

    I was going to recommend that you contact Jeff (Pappa Smurf) Gentry, but I see he already contracted you. The R community in general and the package creators in particular are a really helpful and responsive bunch. If you are interested in more twitter sentiment stuff the Boston Predictive analytics meetup just did three talks on the subject and posted the slides on their site which includes a tutorial by Jeffery Breen.

    • Christina Pikas says:

      Thanks for the pointer - looks interesting and helpful! I didn't want to irritate the community with what I thought was a stupid question, but I've since gotten offers of help from some co-workers who use R. You all are a helpful bunch 🙂

  • hey, it looks like you need to

    x <- userTimeline('friend')
    x.df <- twListToDF(x)
    times <- x.df$created

  • says:

    These are in fact fantastic ideas in concerning blogging.
    You have touched some pleasant points here. Any way
    keep up wrinting.