I'm a coding fool... use of the nuggets mentioned last post

Jul 18 2014 Published by under information analysis, Uncategorized

Self-efficacy FTW... may I never crash!

Last post I mentioned determining if a string is an element of a character vector and opening Excel files in R. Here's what I was doing and how it worked.

I had a directory of xls files downloaded from a research database, one per organization, showing the high level subject categories in which they published. The subject categories were actually narrower than what I needed (think like condensed matter, high energy, amo, chemical physics and I need "physics"). I needed to rank the organizations by articles published in these composite categories that included maybe 1o or so of the categories from the database.

Originally, I was going to open all the files and just count them up, but whoa, that would suck. R for the win!

First, run RStudio with either 32- or 64- bit depending on which Java you have installed.

Next, get the list of files. I had saved them in a directory with other things, too, so I needed to search by pattern. I had already set my working directory to the data directory, for good or bad.

fileList<-list.files(pattern = "subject")

Get the list of their categories for my composite category (physics here). This was just a list:

physics <- read.delim("physics.txt", header=F)

Make sure words are characters and not factors, numbers are just numbers. By trial and error and inspection I figured out that there was a non-numeric character in the count field.

Here's the function:
countPhysics<- function(file){
  #this creates a vector to hang on to the numbers of counts
  phys.hold <- vector(mode="numeric")
  #this is so i can make sure i just have numbers in the count field
  pattern<-"[[:digit:]]+"
  physfile<-read.xlsx(file,1,startRow=4)
  #this finds matching records and then keeps just the part we want
  m<-regexpr(pattern,physfile$Count)
  physfile$Count<-as.numeric(regmatches(physfile$Count,m))
  #one of these days i'll just import right the first time instead of this part
  physfile$Analysis.Value<-as.character(physfile$Analysis.Value)
  for (j in 1:length(physfile$Count)){
    if (is.element(physfile$Analysis.Value[[j]],physics)) {
      phys.hold[j]<- (physfile$Count[[j]])}
    else phys.hold[j]<-0
  }
  total<- sum(phys.hold)  
  return(c(file,total))
  rm(phys.hold)
}

So you run this like so:

physicsResult <-sapply(fileList,countPhysics)

I transposed it and then pasted it into an excel file I was working on but this is essentially the answer. I did the same thing for the other categories separately, but obviously I should have checked each line for matching each/all of my categories before moving to the next line and then outputted a frame. Oh well.

Tags: ,

Comments are off for this post