Archive for the 'dissertation' category

So... um... what if I'm still enjoying it?

Feb 05 2015 Published by under dissertation

Am I supposed to kind of hate my dissertation topic by now? If I don't, does that mean I'm not working on it hard enough (maybe)? I'm doing it wrong? Maybe it's a phase and it will pass.

Making progress. Learning new stuff from my data. Feeling horribly inadequate when watching tweets fly by from another doctoral student dissertating on how scientists use blogs.... (holy moly how many scientists did she actually interview? hundreds? cray-cray... or am I a hater?)

Working on it every chance I get - taking a morning off every week. Staying up late. .. I will have to add more time off. If only we could afford more childcare!

2 responses so far

Trying another CAQDAS, MaxQDA

Jul 12 2014 Published by under dissertation, information analysis

CAQDAS: Computer Assisted Qualitative Data Analysis Software (more).

Previously, I'd used NVIVO and I found it to be miserable and horribly expensive (I didn't pay, but I would have to this time). I really did most of my work offline with colored pencils, note cards, etc. I am fully willing to admit that I didn't know how to use it or that maybe I was doing it wrong, but instead of saving me time, it was costing.

I started trying to use it with my dissertation data and ew. It's fine with my interview transcripts, but tweets and blog posts, just yuck. So then I started coding just using Excel and meh.

So back to the drawing board. I read a bunch of reviews of different products on Surrey's pages: , but it's really hard to tell. I also started looking at prices and oh la la!  I was thinking maybe dedoose, despite their earlier issues, but that's like at least $10/month. Not bad until you think you might do this for a while.

After all that MaxQDA - a German product - seemed like a slightly better choice than some. The student license doesn't expire and is for the full product (they have a semester one, but that doesn't make sense for me) - that's probably the biggest selling point.

So far so good. Importing about a thousand blog posts as individual documents was no big deal. Interview transcripts were even quicker. Adding in my codes was super quick and it was super quick to move one when I stuck it in the wrong hierarchy. I think I'll work with this data for a bit before doing the twitter archives - particularly since I don't know how I might sample.

I'm still on the 30-day trial. Not sure if I need to try to start paying for it with a week or so to spare so the verification can be completed. My university doesn't have an expiration date on our IDs. Not sure if my advisor has to send an e-mail or what.

Yeah, there's something for R (of course), but it doesn't have the features. I was still thinking I might do some more machine learning and other tricks with my data using R which is easy now that it's all in spreadsheets.

One response so far

Quick note: Now on GitHub

Jul 12 2014 Published by under dissertation, information analysis

Scripts mentioned previously are now on GitHub with an MIT license which should hopefully be permissive enough. I can't say that anyone would want to use these, but this also backs up my code even better which is useful.

I'm using RStudio so probably if/when I do more analysis in R, I'll just use Git from there.

The url is, unsurprisingly:

No responses yet

Getting blog post content in a usable format for content analysis

Jul 04 2014 Published by under dissertation

So in the past I've just gone to the blogs and saved down posts in whatever format to code (as in analyze). One participant with lots of equations asked me to use screenshots if my method didn't show them adequately.

This time I have just a few blogs of interest and I want to go all the way back, and I'll probably do some quantitative stuff as well as just coding at the post level. For example just indicating if the post discusses their own work, other scholarly work, a method (like this post!), a book review... , career advice, whatever. Maybe I'll also select some to go deeper but it isn't content analysis like linguists or others do at the word level.

Anyway, lots of ways to get the text of web pages. I wanted to do it in R completely, and I ended up getting the content there, but I found python to work much better for parsing the actual text out of the yucky tags and scripts galore.

I had *a lot* of help with this. I lived on StackOverflow, got some suggestions at work and on friendfeed (thanks Micah!), and got a question answered on StackOverflow (thanks alecxe!). I tried some books in Safari but meh?

I've had success with this on blogger and wordpress blogs. Last time when I was customizing a perl script to pull the commenter urls out every blog was so different from the others that I had to do all sorts of customization. These methods require very little change from one to the next. Plus I'm working on local copies when I'm doing the parsing so hopefully having as little impact as possible (now that I know what I'm doing - I actually got myself blocked from my own blog earlier because I sent so many requests with no user agent)

So using R to get the content of the archive pages. Largest reasonable archive pages possible instead of pulling each post individually, which was my original thought. One blog seemed to be doing an infinite scroll but when you actually looked at the address block it was still doing the blogurl/page/number format.  I made a csv file with the archive page urls in one column and the file name in another. I just filled down for these when they were of the format I just mentioned.

Read them into R. Then had the function:
 UserAgent <- "pick something"
 temp <- getURL(link, timeout = 8, ssl.verifypeer = FALSE, useragent = "UserAgent")
 nameout <- paste(fileName, ".htm", sep="") 
 write (temp,file=nameout)

I ended up doing it in chunks.  if you're doing this function with one it's like:

getFullContent("","archivep1" )

More often I did a few:


So I moved the things around to put them in a folder.

Then this is the big help I got from StackOverflow. Here's how I ended up with a spreadsheet.

from bs4 import BeautifulSoup
import os, os.path

# from
# this is the file to write out to
posts_file = open ("haposts.txt","w")

def pullcontent(filename):

    soup = BeautifulSoup(open(filename))
    posts = []
    for post in soup.find_all('div', class_='post'):
        title = post.find('h3', class_='post-title').text.strip()
        author = post.find('span', class_='post-author').text.replace('Posted by', '').strip()
        content = post.find('div', class_='post-body').p.text.strip()
        date = post.find_previous_sibling('h2', class_='date-header').text.strip()

        posts.append({'title': title,
             'author': author,
             'content': content,
             'date': date})
    #print posts
    posts = str(posts)
    posts_file.write (posts)

# this is from

for filename in os.listdir("files"):

print ("All done!")


So then I pasted it into word, put in some line breaks and tabs and pasted into excel.  I think I could probably go from that file or the data directly into Excel, but this works.

Really very minor tweaking between blogs. Most I don't actually need an author for but I added in the url using something like this:

url = post.find('h2').a.get('href')

The plan is to import this into something like nvivo or atlas.ti for the analysis. Of course it would be very easy to load it in to R as a corpus and then do various textmining operations.

No responses yet

Using R TwitteR to Get User Information

I'm gonna keep stating the obvious, because this took me a few hours to figure out. Maybe not working continuously, but still.

So, I have like more than 6000 tweets from one year of AGU alone, so I'm gonna have to sample somehow. Talking this over with my advisor, he suggested that we have to find some reasonable way to stratify and then do random within the stratification. I haven't worked all the details out yet - or really any of them - but I started gathering user features I could base the decision on. Number of tweets with the hashtag was super quick in Excel. But I was wondering if they were new to Twitter, if they tweeted a lot, and if they had a lot of followers. That's all available through the api and using the TwitteR package by Jeff Gentry.  Cool.

So getUser() is the function to use. I made up a list of the unique usernames in Excel and imported that in. Then I went to loop through.

library("twitteR", lib.loc="C:/Users/Christina/Documents/R/win-library/3.0")
#get the data
 data USERdata<-vector()
 temp<-getUser(USER, cainfo="cacert.pem")
 #test for users 4-6<-sapply(data$user[4:6],userInfo)

But that was sorta sideways... I had a column for each user... sorta weird. Bob O'H helped me figure out how to transpose that and I did, but it was sorta weird.

So then I tried this way:<-function(startno,stopno){
# set up the vectors first
for (i in startno:stopno) {
thing<-getUser(data$user[i], cainfo="cacert.pem")[i]<-data$user[i]

return(data.frame(,created=USER.created, posts=USER.posts,followers=USER.foll, stringsAsFactors=FALSE))

So that was cool, until it wasn't. I mean, turns out that 2% of the users have deleted their accounts or block me or are private or something. So it didn't recover from that error and I tried to test for is.null() and is.NA() but it failed....
So then I went back to the mailing list and there was a suggestion to user try() but eek.
So then I noticed that if you have a pile to look through you're actually supposed to use
lookupUsers(users, includeNA=FALSE, ...)
And I did, and I wanted to keep the NA so that I could align with my other data later... but once again, no way to get the NAs out. And it's an object that's a pile of lists... which I was having trouble wrapping my little mind around (others have no issues).
So I went back and used that command again, and this time said to skip the NA (the not found users). Then I think from the mailing list or maybe from Stack Overflow? I had gotten the idea to use unlist. So here's what I did then:
easy.tweeters.noNA<-lookupUsers(data$user, cainfo="cacert.pem")
#check how many fewer this was
#1247 so there were 29 accounts missing hrm
for (i in 1:1247){holddf<-twListToDF(easy.tweeters.noNA[i])

And that created a lovely dataframe with all kinds of goodies for it. I guess I'll have to see what I want to do about the 29 accounts.

I really would have been happier if it was more graceful with users that weren't found.

Also, not for every single command you have to user the cainfo="cacert.pem" thingy... Every time, every command.

ALSO, I had figured out oauth, but the twitter address went from http:// to https:// and so that was broken, but I fixed it. I hope I don't have to reboot my computer soon! (Yeah, I saved my credentials to a file, but I don't know... )

No responses yet

Using R to expand Twitter URLS

So this should be so simple and obvious that it's not worth a post, but I keep forgetting how to do everything so I'm gonna put this here to remind myself.

Here's where I am. I have a list of 4011 tweets with #agu12 or #agu2012 hashtag. A lot of these are coding as "pointers" - their main function is to direct readers' attention somewhere else. So I got to wondering: where?  Are they directing people to online versions of the posters? Are they just linking to more NASA press releases?  % going to a .edu?

Of course all the URLs are shortened and there are services you can use to expand, but in R, it's already right there in the TwitteR package as


This uses the API . All you have to do is plug in the URL. Cool!

So here was my original plan: find the tweets with urls, extract the urls, expand them, profit! And I was going to do all this in R. But then it got a little ridiculous.
So instead I: used open refine to find all the urls, then assigned IDs to all the records, and then used filtering and copy and pasting to get them all in two columns ID, URL.

Issues: non-printing characters (Excel has a clean command), extra spaces (trim - didn't really work so I did a find and replace), random commas (some needed to be there), random other punctuation (find and replace), #sign

The idea in R was to do a for loop to iterate through each url, expand it, append it to a vector (or concatenate, whichever), then add that to the dataframe and do stats on it or maybe just export to Excel and monkey with it there.

For loop, fine, append - not for love nor money despite the fact that I have proof that I successfully did it in my Coursera class. I don't know. And the API was failing for certain rows. For a couple of rows, I found more punctuation. Then I found the rest of the issues were really all about length. They don't expect shortened urls to be long (duh)!  So then I had to pick a length, and only send ones shorter than that (50) to the api. I finally gave up with the stupid append, and I just printed them to the screen and copied them over to Excel. Also I cheated with how long the for loop had to be - I should have been able to just say the number of rows in the frame but meh.
Anyhow, this worked:

 setwd("~/ mine")
library("twitteR", lib.loc="C:/Users/Christina/Documents/R/win-library/3.0")
#get the data
data <- read.csv("agu12justurl.csv", colClasses = "character")
#check it out
#test a single one
#this was for me trying to append, sigh
full.vec <- vector(mode="character")
#create a vector to put the new stuff in, then I'll append to the data frame, I hope
#check the for loop 
 for (i in 1:200){print(data.sub$url[i])}
#that works
for (i in 1:3){print(decode_short_url(data.sub$url[i]))}
#that works - good to know, though, that if it can't be expanded it comes back null

#appending to the vector is not working, but printing is so will run with that 
for (i in 1:1502){ if(nchar(data$url[i])>50){
 } else {

If anyone wants to tell me what I'm doing wrong with the append, it would be appreciated. I'm sure it must be obvious.

So what's the answer? Not sure. I'll probably do a post on string splitting and counting... OR I'll be back in Open Refine. How do people only ever work in one tool?

3 responses so far

My poster from UM iSchool's Research Day

Mar 11 2014 Published by under dissertation

Pikas research day 2014 poster from Christina Pikas

Comments are off for this post

Keeping up with a busy conference - my tools aren't doing it

I wrote about trying to use TwitteR to download AGU13 tweets. I'm getting fewer and fewer with my calls. I was very excited to try Webometric Analyst from Wolverhampton and described by Kim Holmberg in his ASIST webinar (BIG pptx, BIG wmv).

One of the things Webometric Analyst will do is do repeated searches until you tell it to stop. This was very exciting. But I tried it and alas, I think Twitter thinks I'm abusive or something because it was way throttled. Like I could see the tweets flying up on the screen at but the search was retrieving like 6. I ran the R search mid-day today and got 99 tweets back which covered  5 minutes O_o. I asked for up to 2000, from the whole day, and had it set to retry if stopped.


Comments are off for this post

AGU13 - like whoa

Dec 11 2013 Published by under dissertation

So I said I wouldn't analyze the data from '13 because I'm already underwater, I have plenty, and I need to get done. However, I figured that I had already figured out oauth and using twitteR so no harm in running a couple commands, stashing the data somewhere, and maybe pulling it out if there's a specific question or maybe later when turning my dissertation into an article (should I live so long!).

I thought well, it will give me about 2 weeks worth, but maybe I should give it a try while the conference is still going to make sure everything works ok.  Well crap. I'm getting anywhere from 99 to 1000 tweets per query... and that's covering like at most 3 hours... and I can't seem to fill in the rest. Bummer.

The search has a sinceid but no untilid... and it has since and until for dates - but full days not down to the hour or minute or anything. So I'm really able to get 9pm-midnight GMT. Huh.

I watched Kim Holmberg's fabulous webinar today, so I'm going to try something he suggested to see if that helps. Otherwise, I kinda need to run the search throughout the day, which I can do if I work from home, but I will have missed the most important days of the conference. It's tapering off now. Sigh.

Comments are off for this post

CASCI Talk about dissertation work

Nov 27 2013 Published by under dissertation

I can't say I recommend listening to this as it will be well over an hour and there was a lot of (very helpful) feedback. Maybe flipping through the slides is the best approach.

BTW I think I butchered the description of arsenicscience ... please don't hold that against me!

One response so far

Older posts »