Archive for the 'information analysis' category

I'm a coding fool... use of the nuggets mentioned last post

Jul 18 2014 Published by under information analysis, Uncategorized

Self-efficacy FTW... may I never crash!

Last post I mentioned determining if a string is an element of a character vector and opening Excel files in R. Here's what I was doing and how it worked.

I had a directory of xls files downloaded from a research database, one per organization, showing the high level subject categories in which they published. The subject categories were actually narrower than what I needed (think like condensed matter, high energy, amo, chemical physics and I need "physics"). I needed to rank the organizations by articles published in these composite categories that included maybe 1o or so of the categories from the database.

Originally, I was going to open all the files and just count them up, but whoa, that would suck. R for the win!

First, run RStudio with either 32- or 64- bit depending on which Java you have installed.

Next, get the list of files. I had saved them in a directory with other things, too, so I needed to search by pattern. I had already set my working directory to the data directory, for good or bad.

fileList<-list.files(pattern = "subject")

Get the list of their categories for my composite category (physics here). This was just a list:

physics <- read.delim("physics.txt", header=F)

Make sure words are characters and not factors, numbers are just numbers. By trial and error and inspection I figured out that there was a non-numeric character in the count field.

Here's the function:
countPhysics<- function(file){
  #this creates a vector to hang on to the numbers of counts
  phys.hold <- vector(mode="numeric")
  #this is so i can make sure i just have numbers in the count field
  pattern<-"[[:digit:]]+"
  physfile<-read.xlsx(file,1,startRow=4)
  #this finds matching records and then keeps just the part we want
  m<-regexpr(pattern,physfile$Count)
  physfile$Count<-as.numeric(regmatches(physfile$Count,m))
  #one of these days i'll just import right the first time instead of this part
  physfile$Analysis.Value<-as.character(physfile$Analysis.Value)
  for (j in 1:length(physfile$Count)){
    if (is.element(physfile$Analysis.Value[[j]],physics)) {
      phys.hold[j]<- (physfile$Count[[j]])}
    else phys.hold[j]<-0
  }
  total<- sum(phys.hold)  
  return(c(file,total))
  rm(phys.hold)
}

So you run this like so:

physicsResult <-sapply(fileList,countPhysics)

I transposed it and then pasted it into an excel file I was working on but this is essentially the answer. I did the same thing for the other categories separately, but obviously I should have checked each line for matching each/all of my categories before moving to the next line and then outputted a frame. Oh well.

Comments are off for this post

Another little R nugget

Jul 16 2014 Published by under information analysis

Actually 2 - I'm on a tear. Using what I've learned for work!

First:

is.element(thing,vector)

pretty cool, huh? and useful. Found on StackOverflow. Other ideas there, too.

Second:

Opening an Excel file. Now what you're probably thinking is: "just save as csv or tab delimited and open that the normal way"... well yes, but I've got like 50 files and they were exports from the database as .xls and I didn't open them each while exporting to re-save.

So opening is one thing but I need a couple of columns out of the first sheet and there's some junk at the top that I don't need.

Package is xlsx (pdf). I had just installed Python (thought I had it at work, but I guess not) and was preparing to do it there, but then I was thinking that surely R would have a way.

The issue in my case with my work computer is that it's a 64-bit, but I have 32-bit java, 32-bit office, etc. I was running R in RStudio as 64-bit. I tried to do everything to get Java updated and 32-bit (but still not killing other things I needed). Finally - duh- just pointed RStudio at the 32-bit version of R installed and then RJava ran just peachy and it's required for xlsx.

Here's the command I'm using so far:
file<-read.xlsx("analysis.xls",1,startRow=4, colClasses=c("character", "numeric"))

1 is the sheet index. The colClasses is pretty slick, too. You can also name sheets and only grab certain columns.

So now to iterate through all the files in a directory, opening, counting up the articles listed for the categories I have in another vector and reporting that out. Should be cool. Or not.  And I promised to have the results - however they're done - Friday and I'm going to be out tomorrow. Back to it!

One response so far

Trying another CAQDAS, MaxQDA

Jul 12 2014 Published by under dissertation, information analysis

CAQDAS: Computer Assisted Qualitative Data Analysis Software (more).

Previously, I'd used NVIVO and I found it to be miserable and horribly expensive (I didn't pay, but I would have to this time). I really did most of my work offline with colored pencils, note cards, etc. I am fully willing to admit that I didn't know how to use it or that maybe I was doing it wrong, but instead of saving me time, it was costing.

I started trying to use it with my dissertation data and ew. It's fine with my interview transcripts, but tweets and blog posts, just yuck. So then I started coding just using Excel and meh.

So back to the drawing board. I read a bunch of reviews of different products on Surrey's pages: http://www.surrey.ac.uk/sociology/research/researchcentres/caqdas/ , but it's really hard to tell. I also started looking at prices and oh la la!  I was thinking maybe dedoose, despite their earlier issues, but that's like at least $10/month. Not bad until you think you might do this for a while.

After all that MaxQDA - a German product - seemed like a slightly better choice than some. The student license doesn't expire and is for the full product (they have a semester one, but that doesn't make sense for me) - that's probably the biggest selling point.

So far so good. Importing about a thousand blog posts as individual documents was no big deal. Interview transcripts were even quicker. Adding in my codes was super quick and it was super quick to move one when I stuck it in the wrong hierarchy. I think I'll work with this data for a bit before doing the twitter archives - particularly since I don't know how I might sample.

I'm still on the 30-day trial. Not sure if I need to try to start paying for it with a week or so to spare so the verification can be completed. My university doesn't have an expiration date on our IDs. Not sure if my advisor has to send an e-mail or what.

Yeah, there's something for R (of course), but it doesn't have the features. I was still thinking I might do some more machine learning and other tricks with my data using R which is easy now that it's all in spreadsheets.

One response so far

Quick note: Now on GitHub

Jul 12 2014 Published by under dissertation, information analysis

Scripts mentioned previously are now on GitHub with an MIT license which should hopefully be permissive enough. I can't say that anyone would want to use these, but this also backs up my code even better which is useful.

I'm using RStudio so probably if/when I do more analysis in R, I'll just use Git from there.

The url is, unsurprisingly: https://github.com/cpikas

Comments are off for this post

Sizing bars in a bar chart in R

Jun 24 2014 Published by under bibliometrics, information analysis

Yet another stupid thing... but I did it so here's to remembering how.

I wanted to show of all the places my place of work (MPOW) published in the past 5 years, what their impact factors were and how many in each venue. (yes, caveat the IF but this is in response to a request)

So I have a citation manager collection with the articles we've written, collected through database alerts in all the major databases. I exported that and cleaned up the journal names in VantagePoint (not affiliated, yadda, yadda... use Open Refine if you don't have VP), and then laboriously a co-worker and I went through and added the IFs. Then I created a shortened name for each journal (woulda been easier if I kept the official abbr) by first replacing journal with j, transactions with trans, proceedings with proc, letters with let, etc. Then using the Excel

=if (LEN(A2)>25,LEFT(A2,25),A2)

Then copied values and then saved as CSV: short name, number of articles published, IF.

Here's how I graphed it.

w <- mpow2010.2014jnl$total.articles
pos <- 0.5 * (cumsum(w) + cumsum(c(0, w[-length(w)])))
x <- c (1:257)
y <- mpow2010.2014jnl$Impact.Factor
my.labs <- mpow2010.2014jnl.s$Jnl.Short

p<-ggplot() + 
  geom_bar(aes(x = pos, width = w, y = y, fill = x ), stat = "identity") + 
  scale_x_continuous(labels = my.labs, breaks = pos) 

p + ylab("Impact Factor") + xlab("Journal") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), legend.position = "none")

The labels were a wonky afterthought. not elegant... and I wanted to get rid of the legend. Note I just made a blank vector for x and then added the labels later. It worked... not pretty.

I would like to thank the kind folks on stackoverflow:

Here it is - intentionally smushed so I don't give away too much. I hope. I can remove image if there's an issue, lmk, don't sue.

Where MPOW published 2010-2014 June

Where MPOW published 2010-2014 June

I ended up going back and zooming in on pieces. I shoulda made a function so I could just look at whatever part I wanted.. sigh.

 

Comments are off for this post

Current plan - possibly a bad one - for older Twitter stuff

Aug 01 2013 Published by under dissertation, information analysis

I'm hesitant to post this because I'm not sure how it rates in the terms of service, but here's what seems to be working.

When last I posted, I had overcome some oauth hurdles to successfully pull tweets using the API only to find that the tweets I wanted (#AGU11, #AGU2011, #AGU12, #AGU2012) were not covered by the API. Crap on a banana as my old boss would say.

I did find that if you do a search in the interface and scroll a whole bunch you can actually get all the tweets on the screen from back that far. So I did that and I copied into Excel.

Unfortunately I ended up with something like this:

twitterrawThere are 5 lines for each tweet and only two have any text I want. I also would like fields and I've just got two long messes. What to do?

Open Refine to the rescue, sorta. Open Refine used to be called Google Refine. It just helps with data munging. I kinda didn't really get it until I tried it for this because I assumed I could do all this as easily in Excel, but that's not the case.

I'm tempted to actually post the json file so the steps could be repeated, but so far I haven't actually found that I can create a clean enough script to run from one end to the other. Nevertheless, I'm willing to post if there's interest.  Here's what I've done so far:

  • Upon import, got rid of blank rows
  • transposed to 3 columns
  • deleted the column with "expand" , "view photo", or "view conversation" in it
  • split the first column at @ to get a new column with just the twitter user name
  • split the twitter user name column at the first space to get just the twitter user name and a new column with the date*
  • copied the tweet content to a new column
  • filtered for RT OR MT and starred those - might do something with that later... probably will omit for most of the work
  • split at @, for each of those columns (up to 6!)
  • for each of those columns split at space and limited to 2 columns, then deleted the second column. I tried this GREL fanciness that would have done this in one shot, and the preview looked good, but it said that it wasn't an object that something or other couldn't end up in a cell.

So here are my columns:

Name , Twitter handle, date, tweet, first@, second@, third@....

So it would be quite easy to convert this to one of the UCInet files types that has you list nodes that are connected.  As soon as I do this for the other files and combine the 2011 files and the 2012 files.

*I'm missing a lot of what I would have gotten easily with the old API like a full date time stamp, id, geocodes, what software was used, mentions, etc.

One response so far

Beginning adventures in Python

Jul 03 2013 Published by under information analysis, Uncategorized

I had a little slow period a month or so ago at work (not anymore, for sure!) and I decided it was time to start working on a goal I had set for myself for the year: learn to do some analysis that actually uses the full text of the document vs. just the metadata. Elsewhere I have discussed using Sci2, VantagePoint, bibliometrics, and actually Carrot2 (using the text of the abstract), but I need to go further. I don't aspire to become an expert in natural language processing (NLP) but there are some times I end up having to stop before I want to because I just don't know how to go on.

Anyhoo... first step was to see what I could do in R using the TM package and whatever else. I figured out how to do a word cloud but meh on some of the other tm stuff. I tried a little LDA but my corpus didn't work well with that. When doing the word cloud I realized I really wanted to lemmatize instead of stem. I looked around for ways to do it in R, and there is a WordNet package for R (thanks Greg Laden for pointing it out!) but it just wasn't doing it for me. I had recently worked my way through a bunch of the Python lessons on Code Academy and also bookmarked nltk - the natural language toolkit that works in python so I thought - ah-ha!

The first big deal was installing the stupid thing - language. Argh. I started with Eclipse and PyDev but alas, I am so not willing to figure out how that really works. I got one sample program running but next program it kept running the first one so meh.

I started working my way through the nltk book, and that used the shell mode I guess? where you get immediate responses? Installing packages - I never did figure out how to do that in Perl - it's easy in R but alas... so I gave up on PyDev and installed ActivePython which has a handy-dandy package installer which lo and behold works for people like me who only know enough to be dangerous.

The other things I'm learning: holy cow ignore what your computer is and do everything 32 bit for the love of chocolate. A bunch of problems from installing 64 bit where everything is looking for 32 bit. Uninstall and try again.

I still haven't figured out how to use the programming environment (?) that ships with ActivePython. I really like how RStudio completes things and that's why I wanted to use Eclipse. I'll have to try that next.

Anyway, I hope to take some notes and leave them here for my future recall as it's easy to forget how things worked.

Comments are off for this post

A local way to create a word cloud

Jul 03 2013 Published by under information analysis, Information Science

There's a lot of debate about whether word clouds are useful visualizations, but they do give you some insight into the prevalence of words in some amount of text.

Wordle is a way to do this online but you have to paste your text into a box or provide a url to an RSS or ATOM feed. So that won't work for some content because of sensitivity and it also won't work for batches of files. Here's a way I stumbled upon when I was trying to do some other analysis/visualizations.

Needed:

  • R
  • packages: tm (and all of its many dependencies), wordcloud, RColorBrewer (optional)
  • data

I originally had a 470 page pdf that I wanted to get a feel for. It was divided up into sections, each on a specific program. I ended up saving out each section as rtf by hand although I could have done it programmatically in some other program. I tried saving directly to plain text but that failed. I then used DocFrac to batch convert them to plain text (probably not necessary, but I wasn't sure). So then I had a directory of plain text files.

Load into R the whole directory:

thingy <- Corpus(DirSource("plaintext"), readerControl = list(reader = readPlain()))

make a copy in case of emergencies, but mostly to use at a later step:

thingy.orig <- thingy

Then I did a bunch of clean up tasks pretty typical for this type of work (there is an easier way to do it but this way works. I didn't take into account the order of doing this, which I probably should have)

#remove stop words
exchanger <- function(x) removeWords(x, stopwords("english"))
thingy<- tm_map(thingy, exchanger)

#stem
exchanger <- function(x) stemDocument(x)
thingy<- tm_map(thingy, exchanger)

#lower case
exchanger <- function(x) tolower(x)
thingy<- tm_map(thingy, exchanger)

# remove numbers
exchanger <- function(x) removeNumbers(x)
thingy<- tm_map(thingy, exchanger)

# using grep sub to remove punctuation
# I used this instead of the built in because I wanted it to leave a space where the punctuation had been
exchanger <- function(x) gsub('[[:punct:]]', ' ', x)
thingy<- tm_map(thingy, exchanger)

In addition to the regular stop words, I removed a list of words that were particular to this text.

#custom list
mystop <- c(" "," ")
exchanger <- function(x) removeWords(x, mystop)
thingy<- tm_map(thingy, exchanger)

Next you should take a peek at at least one specific document.

inspect (thingy[3])

And you can add more terms to the custom list or do a gsub or whatever.

Then you need to build a term document matrix.

# create term document matrix
dtm <- DocumentTermMatrix(thingy)

And it needs to be in matrix format. I understand there's a limit to how big these can be. There are some options to make them smaller.

dtm.mat <- as.matrix(dtm)

Then you need to get a vector of all the words and it's nice to have them in order of decreasing frequency. Plus you need the frequencies.

v <- sort(colSums(dtm.mat), decreasing=TRUE)
myNames <- names(v)

Then you pick your colorbrewer color scheme and draw your cloud:

pal <- brewer.pal(12,"Paired")
#add in seed to make it reproducible
set.seed (300)
wordcloud(myNames,v,c(8,.3),7,,TRUE,TRUE,.15,pal)

Now here's the issue. Our words are still stemmed. Well, there is a stem completion function which goes through and finds the most prevalent complete version of that word in a dictionary. In the examples, they often use the original corpus as the dictionary... the problem is that it's very, very, very slow. In fact, it can take days. I have some options I'm considering to get around this:

  1. try the wordnet package
  2. use python nltk to lemmatize prior to doing the r stuff (lemmas are real words unlike the stems)
  3. change the script to make it not iterate so much

Stemcompletion ran for a week on my computer before it was forced to reboot from an update... I haven't had a chance to try the other methods yet.

One response so far

« Newer posts