Getting blog post content in a usable format for content analysis

Jul 04 2014 Published by under dissertation

So in the past I've just gone to the blogs and saved down posts in whatever format to code (as in analyze). One participant with lots of equations asked me to use screenshots if my method didn't show them adequately.

This time I have just a few blogs of interest and I want to go all the way back, and I'll probably do some quantitative stuff as well as just coding at the post level. For example just indicating if the post discusses their own work, other scholarly work, a method (like this post!), a book review... , career advice, whatever. Maybe I'll also select some to go deeper but it isn't content analysis like linguists or others do at the word level.

Anyway, lots of ways to get the text of web pages. I wanted to do it in R completely, and I ended up getting the content there, but I found python to work much better for parsing the actual text out of the yucky tags and scripts galore.

I had *a lot* of help with this. I lived on StackOverflow, got some suggestions at work and on friendfeed (thanks Micah!), and got a question answered on StackOverflow (thanks alecxe!). I tried some books in Safari but meh?

I've had success with this on blogger and wordpress blogs. Last time when I was customizing a perl script to pull the commenter urls out every blog was so different from the others that I had to do all sorts of customization. These methods require very little change from one to the next. Plus I'm working on local copies when I'm doing the parsing so hopefully having as little impact as possible (now that I know what I'm doing - I actually got myself blocked from my own blog earlier because I sent so many requests with no user agent)

So using R to get the content of the archive pages. Largest reasonable archive pages possible instead of pulling each post individually, which was my original thought. One blog seemed to be doing an infinite scroll but when you actually looked at the address block it was still doing the blogurl/page/number format.  I made a csv file with the archive page urls in one column and the file name in another. I just filled down for these when they were of the format I just mentioned.

Read them into R. Then had the function:
 
getFullContent<-function(link,fileName){
 UserAgent <- "pick something"
 temp <- getURL(link, timeout = 8, ssl.verifypeer = FALSE, useragent = "UserAgent")
 nameout <- paste(fileName, ".htm", sep="") 
 write (temp,file=nameout)
 }

I ended up doing it in chunks.  if you're doing this function with one it's like:

getFullContent("http://scientopia.org/blogs/christinaslisrant/","archivep1" )

More often I did a few:

mapply(getFullContent,archiveurls$url[1:20],archiveurls$name[1:20])

So I moved the things around to put them in a folder.

Then this is the big help I got from StackOverflow. Here's how I ended up with a spreadsheet.

from bs4 import BeautifulSoup
import os, os.path

# from http://stackoverflow.com/questions/24502139/using-beautifulsoup-to-pull-multiple-posts-from-a-single-blog-archive-page-with
# this is the file to write out to
posts_file = open ("haposts.txt","w")

def pullcontent(filename):

    soup = BeautifulSoup(open(filename))
    posts = []
    for post in soup.find_all('div', class_='post'):
        title = post.find('h3', class_='post-title').text.strip()
        author = post.find('span', class_='post-author').text.replace('Posted by', '').strip()
        content = post.find('div', class_='post-body').p.text.strip()
        date = post.find_previous_sibling('h2', class_='date-header').text.strip()

        posts.append({'title': title,
             'author': author,
             'content': content,
             'date': date})
    #print posts
    posts = str(posts)
    posts_file.write (posts)

# this is from http://stackoverflow.com/questions/13014862/parse-each-file-in-a-directory-with-beautifulsoup-python-save-out-as-new-file?rq=1

for filename in os.listdir("files"):
    pullcontent(filename)

print ("All done!")

posts_file.close()

So then I pasted it into word, put in some line breaks and tabs and pasted into excel.  I think I could probably go from that file or the data directly into Excel, but this works.

Really very minor tweaking between blogs. Most I don't actually need an author for but I added in the url using something like this:

url = post.find('h2').a.get('href')

The plan is to import this into something like nvivo or atlas.ti for the analysis. Of course it would be very easy to load it in to R as a corpus and then do various textmining operations.

Comments are off for this post