Archive for the 'information analysis' category

Which are the bestest? Top articles from a diverse organization - part 1

Apr 01 2015 Published by under bibliometrics, information analysis

In which Christina goes into the weeds, yet not really thoroughly enough... anyhoo.

So MPOW is approaching an anniversary and we're looking at retrospectives of all sorts. What are the top articles we've published in the literature? What do you mean by top? Ok, so let's say that top means most cited - just for argument's sake. Is it really fair to compare a biomed article to an aerospace engineering article? An article published last week (ok, if in a special issue it might come complete with citations attached) with one published 5 years ago? 10? 20? Review articles with .... you see where this is going.

I had thought to normalize by 5 or 10 year periods and use the subject categories in WoS. But... 1) there are a lot of them 2) they overlap 3) argh.  And things like Acoustics, for example. JASA covers biomed like hearing stuff and it covers underwater sound... but they're not cited the same... at...all.  The acoustics category covers medical journals, physics journals, and maybe some math and engineering (I'd have to look again to be sure).

At the same time, the nice folks there on SIGMETRICS had a argument starting last weekend and going through the beginning of the week on various normalization schemes. One of the complaints against the impact factor is that it's an average and averages don't work on skewed distributions. And the WoS categories suck.

So... what I'm trying to do now is both fractional counting (and I'm checking to make sure I know what that is, but  I think you don't get credit for 1 citation you get credit for 1/(total things cited by citing article) so like a citation from a review article is worth a lot less than one from a regular article because it may be like +1/200 vs. +1/30). And then I'm normalizing by percentile. Not even normal percentile but this Hazen(1914) percentile. Tricky.

I'll be sure to share the script once I've got it. So far the method looks like:

  1. Find my org, relevant time period, articles only in WoS.
  2. Sort by cited, pull off the most cited or all the ones cited more than x or something. Save them down in plain text full record (probably don't need citations?)
  3. Then for each of the top, click on Times Cited. Export them all down in Tab del Windows UTF-8
  4. Move them over to data folder
  5. Run R script (to be shared when I'm sure it's right) to get the new TCs and stick them into the file from 2

*note: if your thingy was cited more than 500 times, you can't export them all at once. Also this would not be practical if you have someone with like thousands of citations. If you do, I would just take the plunge and call that one of the best. We only had 5 over 500.

Next, I'll put them into the ISI.exe script and then the i3 script from here.  See what happens.

As for normalizing by year. I was thinking about maybe omitting a couple of years or so and then doing 5 year bins 3 times and then doing 10 year bins. Not sure. Willing to take advice. It's a 75 year history, but there was a similar paper done in 1986 so I only agreed to go back to 1980. Before a certain time - no longer necessarily 1973 - the affiliation/address aren't there. One very nice retiree I had the pleasure to meet just died and I found that he was listed in Garfield's top cited articles. His work on polar gases is not coming up in the search so it's definitely not complete that far back.

Comments are off for this post

Using more of the possible dimensions in a network graph

Jan 30 2015 Published by under information analysis

When doing bibliometrics, or social network analysis or any kind of network graph, there are only so many different ways to convey information.

  • Size of nodes
  • Shape of nodes (including pictures)
  • Color of nodes
  • Border of nodes (or multiple borders)
  • Labels (node or edge)
  • Edge weight
  • Edge color
  • Arrows
  • Shading areas around/behind nodes
  • Layout or arrangement of nodes

Of these, I almost always size nodes by degree (connections to other nodes), do thickness of lines by their weight, and do some sort of energy or spring layout.

If I do some sort of clustering or community detection or even want to call out components, I'll do that with node color.

My normal things are easy in any package that will graph networks. I was working on a project where we were looking at the maturity of a particular part of an industry. As part of these, we wanted to know if the necessary component systems were available from multiple suppliers and if those suppliers had relationships with different system integrators and if their things were operational or were just for lab or testing purposes.

We could have done a graph for each sub system but they wanted this graph to really just be one slide in a fairly small deck. I tried various approaches in Gephi and NetDraw and wasn't excited. So back to R and iGraph.  In the end (anonymized) :

igraphgraph

Resulting graph - minus any labels.

I used:

  • node shape for if a component or a system integrator
  • color for type of component
  • size for degree
  • line dashed or dotted for if it was in operation or not

I really wanted to show different shapes for each category but igraph only has like 6 default ones and they don't look all that different from each other. NetDraw has more. I tried to use raster images - but I'm on a windows machine and I found all that very confusing.

One unfortunate thing about this graph is that I had to list companies multiple times if they had offerings in multiple categories.

Customer seemed to like it.

I'm not going to take the time to anonymize all the code but here are some key pieces - ask if there's anything I figured out that you don't immediately see how to do.
I started with a spreadsheet (3 of us librarians were adding data)
nodetable tab:
id label category

edgetable tab:
source target yes/no notes

These I imported into gephi (super easy)... and then tried all sorts of stuff... and then exported into graphml
#read in the graph
g<-read.graph("g.graphml", format="graphml")


#shape nodes. these work, but you can't have n/a. so there has to be a default. also, there is an easier way
for (i in 1:101)ifelse(V(g)[i]$Category=='category', V(g)[i]$shape<-'circle', V(g)[i]$shape<-'square')

#color of nodes - a simple number will draw from the palette. see below
for (i in 1:101)if(V(g)[i]$Category=="category"){V(g)[i]$color<-1}

#calculate and keep the degree. i use it again for label placement (not shown) and to bold some labels (not shown)
V(g)$degree<-degree(g, mode="all")

#when I tested the graphing, the isolates were all mixed in and messed up all the labels.
#subgraph to show isolates separately
gi<-induced.subgraph(g,V(g)$degree==0)
gnoni<-induced.subgraph(g,V(g)$degree!=0)

#make dotted lines for not operational
for (i in 1:76) ifelse (E(gnoni)[i]$"operational"=="yes", E(gnoni)[i]$edge.lty<-1,E(gnoni)[i]$edge.lty<-2)

#prettier colors
library("RColorBrewer", lib.loc="~/R/win-library/3.1")
mypalette<-brewer.pal(6,"Paired")
palette(mypalette)

#legend definitions
colors < - c('gray40', 1,2,3,4,5,6) labels <- vector of categories


#plot graph keep device open
plot.igraph(gnoni, layout=layout.fruchterman.reingold, edge.arrow.size=0.1, edge.color="black", vertex.size=V(gnoni)$degree, vertex.label.dist=V(gnoni)$vertex.label.dist, vertex.label.color="black", vertex.label.family="sans",edge.curved=TRUE, vertex.label.cex=0.8, edge.lty=E(gnoni)$edge.lty)

#put legends on - isolates are just shown as a legend so they are neatly lined up
#could have been done by plotting points

legend("bottomright",legend=labels, fill=colors, border="black", cex=0.7, inset=c(-0.1,0))
legend("topleft", legend=V(gi)$label, pch=19, col=V(gi)$color, cex=0.7, bty="n", y.intersp=0.5)
legend("topright", legend=c("Yes", "No"), lty=c(1,2), cex=0.7,inset=c(-0.02,0)) Continue Reading »

2 responses so far

I'm a coding fool... use of the nuggets mentioned last post

Jul 18 2014 Published by under information analysis, Uncategorized

Self-efficacy FTW... may I never crash!

Last post I mentioned determining if a string is an element of a character vector and opening Excel files in R. Here's what I was doing and how it worked.

I had a directory of xls files downloaded from a research database, one per organization, showing the high level subject categories in which they published. The subject categories were actually narrower than what I needed (think like condensed matter, high energy, amo, chemical physics and I need "physics"). I needed to rank the organizations by articles published in these composite categories that included maybe 1o or so of the categories from the database.

Originally, I was going to open all the files and just count them up, but whoa, that would suck. R for the win!

First, run RStudio with either 32- or 64- bit depending on which Java you have installed.

Next, get the list of files. I had saved them in a directory with other things, too, so I needed to search by pattern. I had already set my working directory to the data directory, for good or bad.

fileList<-list.files(pattern = "subject")

Get the list of their categories for my composite category (physics here). This was just a list:

physics <- read.delim("physics.txt", header=F)

Make sure words are characters and not factors, numbers are just numbers. By trial and error and inspection I figured out that there was a non-numeric character in the count field.

Here's the function:
countPhysics<- function(file){
  #this creates a vector to hang on to the numbers of counts
  phys.hold <- vector(mode="numeric")
  #this is so i can make sure i just have numbers in the count field
  pattern<-"[[:digit:]]+"
  physfile<-read.xlsx(file,1,startRow=4)
  #this finds matching records and then keeps just the part we want
  m<-regexpr(pattern,physfile$Count)
  physfile$Count<-as.numeric(regmatches(physfile$Count,m))
  #one of these days i'll just import right the first time instead of this part
  physfile$Analysis.Value<-as.character(physfile$Analysis.Value)
  for (j in 1:length(physfile$Count)){
    if (is.element(physfile$Analysis.Value[[j]],physics)) {
      phys.hold[j]<- (physfile$Count[[j]])}
    else phys.hold[j]<-0
  }
  total<- sum(phys.hold)  
  return(c(file,total))
  rm(phys.hold)
}

So you run this like so:

physicsResult <-sapply(fileList,countPhysics)

I transposed it and then pasted it into an excel file I was working on but this is essentially the answer. I did the same thing for the other categories separately, but obviously I should have checked each line for matching each/all of my categories before moving to the next line and then outputted a frame. Oh well.

Comments are off for this post

Another little R nugget

Jul 16 2014 Published by under information analysis

Actually 2 - I'm on a tear. Using what I've learned for work!

First:

is.element(thing,vector)

pretty cool, huh? and useful. Found on StackOverflow. Other ideas there, too.

Second:

Opening an Excel file. Now what you're probably thinking is: "just save as csv or tab delimited and open that the normal way"... well yes, but I've got like 50 files and they were exports from the database as .xls and I didn't open them each while exporting to re-save.

So opening is one thing but I need a couple of columns out of the first sheet and there's some junk at the top that I don't need.

Package is xlsx (pdf). I had just installed Python (thought I had it at work, but I guess not) and was preparing to do it there, but then I was thinking that surely R would have a way.

The issue in my case with my work computer is that it's a 64-bit, but I have 32-bit java, 32-bit office, etc. I was running R in RStudio as 64-bit. I tried to do everything to get Java updated and 32-bit (but still not killing other things I needed). Finally - duh- just pointed RStudio at the 32-bit version of R installed and then RJava ran just peachy and it's required for xlsx.

Here's the command I'm using so far:
file<-read.xlsx("analysis.xls",1,startRow=4, colClasses=c("character", "numeric"))

1 is the sheet index. The colClasses is pretty slick, too. You can also name sheets and only grab certain columns.

So now to iterate through all the files in a directory, opening, counting up the articles listed for the categories I have in another vector and reporting that out. Should be cool. Or not.  And I promised to have the results - however they're done - Friday and I'm going to be out tomorrow. Back to it!

One response so far

Trying another CAQDAS, MaxQDA

Jul 12 2014 Published by under dissertation, information analysis

CAQDAS: Computer Assisted Qualitative Data Analysis Software (more).

Previously, I'd used NVIVO and I found it to be miserable and horribly expensive (I didn't pay, but I would have to this time). I really did most of my work offline with colored pencils, note cards, etc. I am fully willing to admit that I didn't know how to use it or that maybe I was doing it wrong, but instead of saving me time, it was costing.

I started trying to use it with my dissertation data and ew. It's fine with my interview transcripts, but tweets and blog posts, just yuck. So then I started coding just using Excel and meh.

So back to the drawing board. I read a bunch of reviews of different products on Surrey's pages: http://www.surrey.ac.uk/sociology/research/researchcentres/caqdas/ , but it's really hard to tell. I also started looking at prices and oh la la!  I was thinking maybe dedoose, despite their earlier issues, but that's like at least $10/month. Not bad until you think you might do this for a while.

After all that MaxQDA - a German product - seemed like a slightly better choice than some. The student license doesn't expire and is for the full product (they have a semester one, but that doesn't make sense for me) - that's probably the biggest selling point.

So far so good. Importing about a thousand blog posts as individual documents was no big deal. Interview transcripts were even quicker. Adding in my codes was super quick and it was super quick to move one when I stuck it in the wrong hierarchy. I think I'll work with this data for a bit before doing the twitter archives - particularly since I don't know how I might sample.

I'm still on the 30-day trial. Not sure if I need to try to start paying for it with a week or so to spare so the verification can be completed. My university doesn't have an expiration date on our IDs. Not sure if my advisor has to send an e-mail or what.

Yeah, there's something for R (of course), but it doesn't have the features. I was still thinking I might do some more machine learning and other tricks with my data using R which is easy now that it's all in spreadsheets.

One response so far

Quick note: Now on GitHub

Jul 12 2014 Published by under dissertation, information analysis

Scripts mentioned previously are now on GitHub with an MIT license which should hopefully be permissive enough. I can't say that anyone would want to use these, but this also backs up my code even better which is useful.

I'm using RStudio so probably if/when I do more analysis in R, I'll just use Git from there.

The url is, unsurprisingly: https://github.com/cpikas

Comments are off for this post

Sizing bars in a bar chart in R

Jun 24 2014 Published by under bibliometrics, information analysis

Yet another stupid thing... but I did it so here's to remembering how.

I wanted to show of all the places my place of work (MPOW) published in the past 5 years, what their impact factors were and how many in each venue. (yes, caveat the IF but this is in response to a request)

So I have a citation manager collection with the articles we've written, collected through database alerts in all the major databases. I exported that and cleaned up the journal names in VantagePoint (not affiliated, yadda, yadda... use Open Refine if you don't have VP), and then laboriously a co-worker and I went through and added the IFs. Then I created a shortened name for each journal (woulda been easier if I kept the official abbr) by first replacing journal with j, transactions with trans, proceedings with proc, letters with let, etc. Then using the Excel

=if (LEN(A2)>25,LEFT(A2,25),A2)

Then copied values and then saved as CSV: short name, number of articles published, IF.

Here's how I graphed it.

w <- mpow2010.2014jnl$total.articles
pos <- 0.5 * (cumsum(w) + cumsum(c(0, w[-length(w)])))
x <- c (1:257)
y <- mpow2010.2014jnl$Impact.Factor
my.labs <- mpow2010.2014jnl.s$Jnl.Short

p<-ggplot() + 
  geom_bar(aes(x = pos, width = w, y = y, fill = x ), stat = "identity") + 
  scale_x_continuous(labels = my.labs, breaks = pos) 

p + ylab("Impact Factor") + xlab("Journal") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), legend.position = "none")

The labels were a wonky afterthought. not elegant... and I wanted to get rid of the legend. Note I just made a blank vector for x and then added the labels later. It worked... not pretty.

I would like to thank the kind folks on stackoverflow:

Here it is - intentionally smushed so I don't give away too much. I hope. I can remove image if there's an issue, lmk, don't sue.

Where MPOW published 2010-2014 June

Where MPOW published 2010-2014 June

I ended up going back and zooming in on pieces. I shoulda made a function so I could just look at whatever part I wanted.. sigh.

 

Comments are off for this post

Current plan - possibly a bad one - for older Twitter stuff

Aug 01 2013 Published by under dissertation, information analysis

I'm hesitant to post this because I'm not sure how it rates in the terms of service, but here's what seems to be working.

When last I posted, I had overcome some oauth hurdles to successfully pull tweets using the API only to find that the tweets I wanted (#AGU11, #AGU2011, #AGU12, #AGU2012) were not covered by the API. Crap on a banana as my old boss would say.

I did find that if you do a search in the interface and scroll a whole bunch you can actually get all the tweets on the screen from back that far. So I did that and I copied into Excel.

Unfortunately I ended up with something like this:

twitterrawThere are 5 lines for each tweet and only two have any text I want. I also would like fields and I've just got two long messes. What to do?

Open Refine to the rescue, sorta. Open Refine used to be called Google Refine. It just helps with data munging. I kinda didn't really get it until I tried it for this because I assumed I could do all this as easily in Excel, but that's not the case.

I'm tempted to actually post the json file so the steps could be repeated, but so far I haven't actually found that I can create a clean enough script to run from one end to the other. Nevertheless, I'm willing to post if there's interest.  Here's what I've done so far:

  • Upon import, got rid of blank rows
  • transposed to 3 columns
  • deleted the column with "expand" , "view photo", or "view conversation" in it
  • split the first column at @ to get a new column with just the twitter user name
  • split the twitter user name column at the first space to get just the twitter user name and a new column with the date*
  • copied the tweet content to a new column
  • filtered for RT OR MT and starred those - might do something with that later... probably will omit for most of the work
  • split at @, for each of those columns (up to 6!)
  • for each of those columns split at space and limited to 2 columns, then deleted the second column. I tried this GREL fanciness that would have done this in one shot, and the preview looked good, but it said that it wasn't an object that something or other couldn't end up in a cell.

So here are my columns:

Name , Twitter handle, date, tweet, first@, second@, third@....

So it would be quite easy to convert this to one of the UCInet files types that has you list nodes that are connected.  As soon as I do this for the other files and combine the 2011 files and the 2012 files.

*I'm missing a lot of what I would have gotten easily with the old API like a full date time stamp, id, geocodes, what software was used, mentions, etc.

One response so far

Beginning adventures in Python

Jul 03 2013 Published by under information analysis, Uncategorized

I had a little slow period a month or so ago at work (not anymore, for sure!) and I decided it was time to start working on a goal I had set for myself for the year: learn to do some analysis that actually uses the full text of the document vs. just the metadata. Elsewhere I have discussed using Sci2, VantagePoint, bibliometrics, and actually Carrot2 (using the text of the abstract), but I need to go further. I don't aspire to become an expert in natural language processing (NLP) but there are some times I end up having to stop before I want to because I just don't know how to go on.

Anyhoo... first step was to see what I could do in R using the TM package and whatever else. I figured out how to do a word cloud but meh on some of the other tm stuff. I tried a little LDA but my corpus didn't work well with that. When doing the word cloud I realized I really wanted to lemmatize instead of stem. I looked around for ways to do it in R, and there is a WordNet package for R (thanks Greg Laden for pointing it out!) but it just wasn't doing it for me. I had recently worked my way through a bunch of the Python lessons on Code Academy and also bookmarked nltk - the natural language toolkit that works in python so I thought - ah-ha!

The first big deal was installing the stupid thing - language. Argh. I started with Eclipse and PyDev but alas, I am so not willing to figure out how that really works. I got one sample program running but next program it kept running the first one so meh.

I started working my way through the nltk book, and that used the shell mode I guess? where you get immediate responses? Installing packages - I never did figure out how to do that in Perl - it's easy in R but alas... so I gave up on PyDev and installed ActivePython which has a handy-dandy package installer which lo and behold works for people like me who only know enough to be dangerous.

The other things I'm learning: holy cow ignore what your computer is and do everything 32 bit for the love of chocolate. A bunch of problems from installing 64 bit where everything is looking for 32 bit. Uninstall and try again.

I still haven't figured out how to use the programming environment (?) that ships with ActivePython. I really like how RStudio completes things and that's why I wanted to use Eclipse. I'll have to try that next.

Anyway, I hope to take some notes and leave them here for my future recall as it's easy to forget how things worked.

Comments are off for this post

A local way to create a word cloud

Jul 03 2013 Published by under information analysis, Information Science

There's a lot of debate about whether word clouds are useful visualizations, but they do give you some insight into the prevalence of words in some amount of text.

Wordle is a way to do this online but you have to paste your text into a box or provide a url to an RSS or ATOM feed. So that won't work for some content because of sensitivity and it also won't work for batches of files. Here's a way I stumbled upon when I was trying to do some other analysis/visualizations.

Needed:

  • R
  • packages: tm (and all of its many dependencies), wordcloud, RColorBrewer (optional)
  • data

I originally had a 470 page pdf that I wanted to get a feel for. It was divided up into sections, each on a specific program. I ended up saving out each section as rtf by hand although I could have done it programmatically in some other program. I tried saving directly to plain text but that failed. I then used DocFrac to batch convert them to plain text (probably not necessary, but I wasn't sure). So then I had a directory of plain text files.

Load into R the whole directory:

thingy <- Corpus(DirSource("plaintext"), readerControl = list(reader = readPlain()))

make a copy in case of emergencies, but mostly to use at a later step:

thingy.orig <- thingy

Then I did a bunch of clean up tasks pretty typical for this type of work (there is an easier way to do it but this way works. I didn't take into account the order of doing this, which I probably should have)

#remove stop words
exchanger <- function(x) removeWords(x, stopwords("english"))
thingy<- tm_map(thingy, exchanger)

#stem
exchanger <- function(x) stemDocument(x)
thingy<- tm_map(thingy, exchanger)

#lower case
exchanger <- function(x) tolower(x)
thingy<- tm_map(thingy, exchanger)

# remove numbers
exchanger <- function(x) removeNumbers(x)
thingy<- tm_map(thingy, exchanger)

# using grep sub to remove punctuation
# I used this instead of the built in because I wanted it to leave a space where the punctuation had been
exchanger <- function(x) gsub('[[:punct:]]', ' ', x)
thingy<- tm_map(thingy, exchanger)

In addition to the regular stop words, I removed a list of words that were particular to this text.

#custom list
mystop <- c(" "," ")
exchanger <- function(x) removeWords(x, mystop)
thingy<- tm_map(thingy, exchanger)

Next you should take a peek at at least one specific document.

inspect (thingy[3])

And you can add more terms to the custom list or do a gsub or whatever.

Then you need to build a term document matrix.

# create term document matrix
dtm <- DocumentTermMatrix(thingy)

And it needs to be in matrix format. I understand there's a limit to how big these can be. There are some options to make them smaller.

dtm.mat <- as.matrix(dtm)

Then you need to get a vector of all the words and it's nice to have them in order of decreasing frequency. Plus you need the frequencies.

v <- sort(colSums(dtm.mat), decreasing=TRUE)
myNames <- names(v)

Then you pick your colorbrewer color scheme and draw your cloud:

pal <- brewer.pal(12,"Paired")
#add in seed to make it reproducible
set.seed (300)
wordcloud(myNames,v,c(8,.3),7,,TRUE,TRUE,.15,pal)

Now here's the issue. Our words are still stemmed. Well, there is a stem completion function which goes through and finds the most prevalent complete version of that word in a dictionary. In the examples, they often use the original corpus as the dictionary... the problem is that it's very, very, very slow. In fact, it can take days. I have some options I'm considering to get around this:

  1. try the wordnet package
  2. use python nltk to lemmatize prior to doing the r stuff (lemmas are real words unlike the stems)
  3. change the script to make it not iterate so much

Stemcompletion ran for a week on my computer before it was forced to reboot from an update... I haven't had a chance to try the other methods yet.

One response so far

« Newer posts