Archive for the 'information analysis' category

Getting article metadata from MS Academic: some R code

As promised, I went back and did this myself instead of relying on a partner in crime (earlier referred to as an SME but he outed himself). It's funny because I had his code, but he did things differently than I do them so I needed to do it myself.

First mostly successful run I ended up with about 44% of the rows missing the metadata. I discovered fairly quickly that using TM's removePunctuation was, of course (in retrospect), closing up instead of leaving a space for intraword dashes. You can have it ignore those, but you can't have it go ahead and leave a space. I first did some finding and replacing in Excel but that got me down to 32%. Then I was like, duh, just do the gsub for [[:punct:]] and see if that's better. I hope I haven't used my quota!

Here's the code. Sign up for your key here. Also note: not affiliated, not endorsing.

#microsoft academic to try to find affiliations for article titles

library("httr", lib.loc="~/R/win-library/3.3")
library("tm", lib.loc="~/R/win-library/3.3")
library("jsonlite", lib.loc="~/R/win-library/3.3")

setwd("~/DataScienceResearchInitiative")

#don't forget the following or you will regret it
options(stringsAsFactors = FALSE)

# api info https://dev.projectoxford.ai/docs/services/56332331778daf02acc0a50b/operations/565d753be597ed16ac3ffc03

# https://api.projectoxford.ai/academic/v1.0/evaluate[?expr][&model][&count][&offset][&orderby][&attributes]

#key:
msakey1<-"put yours here"

apiurl<-"https://api.projectoxford.ai/academic/v1.0/evaluate?expr="
searchexpr<-"Ti='example'"
apiattrib<-"Ti,Y,AA.AuN,AA.AfN,C.CN,J.JN,E"

#test on one to see how it works
testcite <- GET(apiurl, 
         query = list(expr = searchexpr,count = 1, attributes = apiattrib), add_headers("Ocp-Apim-Subscription-Key"= msakey1))

#get the json out into usable format
#could look for errors first
testcite$status_code

#comes out raw so need to make into text
testciteContent <- rawToChar(testcite$content)

test<-fromJSON(testciteContent)
test$entities$AA
test$entities$AA[[1]]$AuN
#this will get a ; separated vector
paste(test$entities$AA[[1]]$AuN, collapse = ';')

test$entities$AA[[1]]$AfN
test$entities$J$JN
test$entities$Y
test$entities$Ti

# initiate a dataframe
# for each title, go out and search using that title
# could add in a warn_for_status(r)  when status is not 200 (happy)
# if status !200 go to the next one,  if status =200
# extract ti, y, authors (paste), affil (paste), jn, cn, and out of entities VFN, V, FP LP DOI D
# write them to the data frame
#1904 is the length of my article title list

CitesOut<- data.frame(ti = rep(NA,1904),
                      y = integer(1904),
                      au = rep(NA,1904),
                      af = rep(NA,1904),
                      jn = rep(NA,1904),
                      cn = rep(NA,1904),
                      vfn = rep(NA,1904),
                      v = rep(NA,1904),
                      fp = rep(NA,1904),
                      lp = rep(NA,1904),
                      doi = rep(NA,1904),
                      abs = rep(NA,1904),
                      stringsAsFactors = FALSE)
  
getMScites<- function(citeNo){
  apiurl<-"https://api.projectoxford.ai/academic/v1.0/evaluate?expr="
  searchexpr<- paste0("Ti='",TitlesToFindf[citeNo],"'")
  apiattrib<-"Ti,Y,AA.AuN,AA.AfN,C.CN,J.JN,E"
  holding<-GET(apiurl,
               query = list(expr = searchexpr,count = 1, attributes = apiattrib), 
               add_headers("Ocp-Apim-Subscription-Key"= msakey1))
  print(paste("cite number", citeNo,"status is:", holding$status_code))
  print(holding$headers$`content-length`)
  holdingContent <- rawToChar(holding$content)
  holdC<-fromJSON(holdingContent)
  cciterow<-data.frame(
    ti=ifelse(is.null(holdC$entities$Ti),NA,holdC$entities$Ti),
    y=ifelse(is.null(holdC$entities$Y),NA,as.integer(holdC$entities$Y)), 
    au=ifelse(is.null(holdC$entities$AA[[1]]$AuN),NA,paste(holdC$entities$AA[[1]]$AuN, collapse = ';')),
    af=ifelse(is.null(holdC$entities$AA[[1]]$AfN),NA,paste(holdC$entities$AA[[1]]$AfN, collapse = ';')),
    jn=ifelse(is.null(holdC$entities$J$JN),NA,holdC$entities$J$JN),
    cn=ifelse(is.null(holdC$entities$C$CN),NA,holdC$entities$C$CN))
  print(cciterow)
  if(is.null(holdC$entities$E)){
    eciterow<-data.frame(
      vfn=NA,
      v=NA,
      fp=NA,
      lp=NA,
      doi=NA,
      abs=NA)
  } else {
    holdE<-fromJSON(holdC$entities$E)
    eciterow<-data.frame(
      vfn=ifelse(is.null(holdE$VFN),NA,holdE$VFN),
      v=ifelse(is.null(holdE$V),NA,holdE$V),
    fp=ifelse(is.null(holdE$FP),NA,holdE$FP),
    lp=ifelse(is.null(holdE$LP),NA,holdE$LP),
    doi=ifelse(is.null(holdE$DOI),NA,holdE$DOI),
    abs=ifelse(is.null(holdE$D),NA,holdE$D)
    )
  }
  print(eciterow)
  citerow<-cbind(cciterow,eciterow, stringsAsFactors=FALSE)
  print("this is citerow")
  print(citerow)
  return(citerow)
} 

#troubleshooting
apiurl<-"https://api.projectoxford.ai/academic/v1.0/evaluate?expr="
searchexpr<- paste0("Ti='",TitlesToFindf[4],"'")
apiattrib<-"Ti,Y,AA.AuN,AA.AfN,C.CN,J.JN,E"
troubleshoot<-GET(apiurl,
               query = list(expr = searchexpr,count = 1, attributes = apiattrib), 
               add_headers("Ocp-Apim-Subscription-Key"= msakey1))

troubleshoot$status_code
troubleshoot$headers$`content-length`

troubleshootcontent<-rawToChar(troubleshoot$content)  
troubleC<-fromJSON(troubleshootcontent)
troubleE<-fromJSON(troubleC$entities$E)

#prepare title list
## IMPORTANT - all the titles have to be lower case and there can't be any punctuation
TitlesToFind <- read.delim("~/DataScienceResearchInitiative/TitlesToFind.csv", header=FALSE)

TitlesToFindl<-apply(TitlesToFind,1,tolower)

TitlesToFindf<- gsub("[[:punct:]]"," ",TitlesToFindl)

head(TitlesToFindf)

#use the sys.sleep so you don't get an error for too many requests too quickly
for (i in 21:1904){
  temp<-getMScites(i)
  CitesOut[i,]<-temp
  Sys.sleep(2)}
write.csv(CitesOut,"MSdsCites.csv")

length(which(is.na(CitesOut$ti)))
length(which(is.na(CitesOut$abs)))

missCites<-which(is.na(CitesOut$ti))

for (i in 1:length(missCites)) {
  temp<-getMScites(missCites[i])
  CitesOut[missCites[i],]<-temp
  Sys.sleep(2)
}

edited: to fix formatting. also the missing cites were writing to the wrong rows.. sigh.

No responses yet

Retrieving article metadata from Microsoft Academic Scholar

Nov 19 2016 Published by under bibliometrics, information analysis

In the ongoing saga of doing information analysis and bibliometrics of some sort in computer science... now I need affiliations. As a reminder, I did the first bit of this work in Inspec* because it has high quality metadata but then I discovered after reviewing results with SMEs that it totally was missing a bunch of important conferences in the field - most notably some big ones from ACM. So I searched DBLP using their API , ACM's Guide to the Computing Literature, ArXiv, and CiteSeer and found a bunch more interesting articles. I de-duplicated with the Inspec set and then did topic modelling using whatever I had (used abstract, title, and keywords). Well, ACM doesn't export abstracts and DBLP doesn't even have them.

And then I got all turned around after linking the article titles back to the topics, working with the SMEs to name and select the interesting topics.... so, oops... now I had a list of ~2000 titles alone and no other information but I had actually needed to give a list of top organizations and top venues for these interesting topics... Uh-oh.

Of course Google Scholar doesn't have an API. Aminer does, but a quick check had it returning 0 results for my first few titles through the web interface. CiteSeer, I don't even know. What to do?  Ah-ha Microsoft Academic Search* does have an API but it's not all that comprehensive yet... oh wait - it's actually IS quite good in computer science.

Ideally, there would already be an rOpenSci package to search it but the only package I found was for using some of the other Microsoft Cognitive Services APIs. The main Academic Knowledge site makes it very easy to sign up to make up to 10k requests a month for free. There's even a console you can use to test your queries separately from your code.
So what's the problem, you ask? Just iterate through searching for each title, pull down JSON for just the fields you need (C.CN, J.JN,AA.AfN), parse into a data frame, then tot them up.... Yet our searches were not getting any results... until we happened on a StackOverflow question ... You need to lower case and remove all punctuation prior to searching.

A SME at work ended up doing the actual coding for this part but I'm going to try to reproduce it on my own to make sure I have it. When I do, I will surely share.

Long story but: 1) it would be lovely to have a package to use for this API 2) MAG does work fine for this purpose for this topic 3) be sure to lower case and remove punctuation as a first step

*no endorsement intended

3 responses so far

Parsing citations for dabblers

Jun 26 2016 Published by under information analysis

Warning! This post is more about questions than answers!

Wouldn't it be nice to be able to grab references from the end of an article - say the article is in pdf, even - and have them in a usable tagged format? I am sure not the only one to consider this. In fact, everyone seems to do it. CiteSeer, ACM digital library, and others.  Deborah Fitchett had success figuring this out.

My incentive is a bit different. I'm looking at a pile of proposals and I want to know what citations they have in common. Everyone cites themselves, of course, but we think there are several schools of thought that we should be able to identify.

My general thought was

  1. extract the bibliography
  2. parse
  3. label each citation with firstauthorlastnamepubyear - even if there are multiple works by a single author, I think should be good enough? pretty rare to volte-face mid year?
  4. Make a matrix proposers x citation labels
  5. Graph and profit

Several ways to extract the bibliography. I realize now that I probably should have done something with beautiful soup or something and even if not, some tools actually take in a document and find the bibliography. Anyway, I have them.

Now for parsing here is a list of ones that may be around and usable in some fashion (this is a helpful listing)

  • ParsCit - this seems really to be the most popular and the most successful
  • FreeCite - a web service from Brown University libraries
  • Using CrossRef
  • ParaCite / ParaTools from Southampton - perl modules
  • eta: AnyStyle.io - I tried the web interface and it worked pretty well

ParsCit is the most popular so I thought it would give it a go. The page is not terribly hopeful about running it on Windows. Well... so I did request and receive an Ubuntu VM to play with... hoo-boy, the instructions are fairly off-putting and really sort of condescending (if you know how to work in Unix, this is trivial).

So now, instead, I'm playing with using the Brown library service and RCurl to see what I can do. Of course, then I have to deal with what I get back. Meh.

If I get it all figured out, I'll be sure to report back.

 

Edited: Richard Urban reminded me about AnyStyle.io so I added it. I think I will try to get an API key to try to get to it through R using RCurl. Because of time constraints... right now I'm just extracting the network using Excel 🙁

No responses yet

Notes from International Symposium on Science of Science 2016 (#ICSS2016) - Day 1

This conference was held at the Library of Congress March 22 and 23, 2016. The conference program is at: http://icss.ist.psu.edu/program.html

I had the hardest time remembering the hashtag so you may want to search for ones with more C or fewer or more S.

This conference was only one track but it was jam-packed and the days were pretty long. On the first day, my notes were by hand and my tweets were by phone (which was having issues). The second day I brought a power strip along and then took notes and tweeted by laptop.

One thing I want to do here is to gather the links to the demo tools and data sets that were mentioned with some short commentary where appropriate. I do wish I could have gotten myself together enough to submit something, but what with the dissertation and all. (and then I'm only a year late on a draft of a paper and then I need to write up a few articles from the dissertation and and and and...)
Maryann Feldman SciSIP Program Director

As you would expect, she talked about funding in general and the program. There are dear colleague letters. She really wants to hear from researchers in writing - send her a one-pager to start a conversation. She funded the meeting.

Katy Börner Indiana University

She talked about her Mapping Exhibit - they're working on the next iteration and are also looking for venues for the current. She is interested in information analysis/visualization literacy (hence her MOOC and all her efforts with SCI2 and all). One thing she's trying now is a weather report format. She showed an example.

She did something with the descriptive models of the global scientific food web. Where are sources and where are sinks of citations?

Something more controversial was her idea of collective allocation of funding. Give each qualified PI a pot of money that they *must* allocate to other projects. So instead of a small body of reviewers, everyone in the field would be a reviewer. If the top PI got more than a certain amount. They would have to re-allocate to other projects.

I'm not sure I got this quote exactly but it was something like:

Upcoming conference at National Academy of Science on Modeling Sci Tech Innovations May 16-18.

They have a data enclave at Indiana with research data they and their affiliates can use. (I guess LaRiviere also has and has inherited a big pile o'data? This has been a thought of mine... getting data in format so I could have it lying around if I wanted to play with it).
Filippo Radicchi Indiana University

He spoke about sleeping beauties in science. These are the articles that receive few citations for many years and then are re-discovered and start anew. This is based on this article. Turns out the phenomenon occurs fairly regularly and across disciplines. In some cases it's a model that then is more useful when computing catches up. In other cases it's when something gets picked up by a different discipline. One case is something used to make graphene. He's skeptical one of the top articles in this category is actually being read by people who cite it because it's only available in print in German from just a few libraries! (However, a librarian in the session *had* gotten a copy for a staff member who could read German).

I would love to take his 22M article data set and try the k-means longitudinal. If sleeping beauty is found often, what are the other typical shapes beyond the standard one?

He also touched on his work with movies - apparently using an oft-overlooked section of IMDB that provides information on references (uses same framing as x, adopt cinematography style of y, remakes z... I don't know, but relationships).

Carl Bergstrom University of Washington

The first part of his talk reviewed Eigenfactor work which should be very familiar to this audience (well except a speaker on the second day had no idea it was a new-ish measure that had since been adopted by JCR - he should update his screenshot - anyhoo)

Then he went on to discuss a number of new projects they're working on. Slides are here.

Where ranking journals has a certain level of controversy, they did continue on to rank authors (ew?), and most recently articles which required some special steps.

Cooler, I think was the next work discussed.  A mapping technique for reducing a busy graph to find patterns. "Good maps simplify and highlight relevant structures." Their method did well when compared to other method and made it possible to compare over years. Nice graphic showing the emergence of neuroscience. They then did a hierarchical version. Also pretty cool. I'd have to see this in more detail, but looks like a better option than the pruning and path methods I've seen to do similar things. So this hierarchical map thing is now being used as a recommendation engine.  See babe'.eigenfactor.org . I'll have to test it out to see.

Then (it was a very full talk) women vs. men. Men self-cite more. Means they have higher h-index.
Jacob Foster UCLA (Sociology)

If the last talk seemed packed. This was like whoa. He talked really, really fast and did not slow down. The content was pretty heavy duty, too. It could be that the remainder of the room basically knew it all so it was all review. I have read all the standard STS stuff, but it was fast.

He defines science as "the social production of collective intelligence."

Rumsfeld unknown unknowns... he's more interested in unknown knowns. (what do you know but do not know you know... you know? 🙂 )

Ecological rationality - rationality of choices depends on context vs rational choice theory which is just based on rules, not context.

Think of scientists as ants. Complex sociotechnical system. Information processing problem, using Marr's Levels.

  • computational level: what does the system do (e.g.: what problems does it solve or overcome) and similarly, why does it do these things
  • algorithmic/representational level: how does the system do what it does, specifically, what representations does it use and what processes does it employ to build and manipulate the representations
  • implementational/physical level: how is the system physically realised (in the case of biological vision, what neural structures and neuronal activities implement the visual system)

https://en.wikipedia.org/wiki/David_Marr_(neuroscientist)#Levels_of_analysis

Apparently less studied in humans is the representational to hardware. ... ? (I have really, really bad handwriting.)

science leverages and tunes basic information processing (?).. cluster attention.

(incidentally totally weird Google Scholar doesn't know about "american sociological review" ? or ASR? ended up browsing)
Foster,J.G., Rzhetsky,A., Evans, J.A. (2015) Tradition and Innovation in Scientists’ Research Strategies. ASR 80, 875-908. doi: 10.1177/0003122415601618

Scientists try various strategies to optimize between tradition (more likely to be accepted) and innovation (bigger pay offs). More innovative papers get more citations but conservative efforts are rewarded with more cumulative citations.

Rzhetsky,A.,Foster,I.T., Foster,J.G.,  Evans, J.A (2015) Choosing experiments to accelerate collective discovery. PNAS 112, 14569–14574. doi: 10.1073/pnas.1509757112

This article looked at chemicals in pubmed. Innovative was new ones. Traditional was in the neighborhood of old ones. They found that scientists spend a lot of time in the neighborhood of established important ones where they could advance science better by looking elsewhere. (hmmm, but... hm.)

The next bit of work I didn't get a citation for - not even enough to search - but they looked at JSTOR and word overlap. Probabilistic distribution of terms. Joint probability. (maybe this article? pdf). It looked at linguistic similarity (maybe?) and then export/import of citations. So ecology kept to itself while social sciences were integrated. I asked about how different social sciences fields use the same word with vastly different meanings - mentioned Fleck. He responded that it was true but often there is productive ambiguity of new field misusing or misinterpreting another field's concept (e.g., capital). I'm probably less convinced about this one, but would need to read further.

Panel 1: Scientific Establishment

  • George Santangelo - NIH portfolio management. Meh.
  • Maryann Feldman - geography and Research Triangle Park
  • Iulia Georgescu, Veronique Kiermer,Valda Vinson - publishers who, apparently, want what might already be available? Who are unwilling (except PLOS) or unable to quid pro quo share data/information in return for things. Who are skeptical (except for PLOS) that anything could be done differently? that's my take. Maybe others in the room found it more useful.

Nitesh Chawla University of Notre Dame

(scanty notes here - not feedback on the talk)

Worked with Arnet Miner data to predict h-indices.

Paper: http://arxiv.org/abs/1412.4754 

It turns out, that according to them, venue is key. So all of the articles that found poor correlation between JIF and an individual paper's likelihood of being cited.. they say actually a pretty good predictor when combined with researcher's authority. Yuck!

Janet Vertesi Princeton University

Perked up when I realized who she is - she's the one who studied the Rover teams! Her book is Seeing Like a Rover.  Her dissertation is also available online, but everyone should probably go buy the book.  She looked at more a meso level of knowledge, really interested in teams. She found that different teams - even teams with overlapping membership - managed knowledge differently. The way instrument time (or really spacecraft maneuvering so you can use your instrument time) was handled was very different. A lot had to do with the move in the '90s for faster...better... cheaper (example MESSENGER). She used co-authoring networks in ADS and did community detection. Co-authorship shows team membership as same casts of characters writing. This field is very different from others as publications are in mind while the instruments are being designed.

She compared Discovery class missions - Mars Exploration Rover - collectivist, integrated; everyone must give a go ahead for decisions; Messenger - design system working groups (oh my handwriting!)

vs. Flagship - Cassini - hierarchical, separated. Divided up sections of spacecraft. Conflict and competition. Used WWII as a metaphor (?!!). No sharing even among subteams before release.  Clusters are related to team/instrument.

New PI working to merge across - this did show in evolution of network to a certain extent.

Galileo is another flagship example. breaks down into separate clusters. not coordinated.

Organization of teams matters.

I admitted my fan girl situation and asked about the engineers. She only worked with scientists because she's a foreign national (may not mean anything to my readers who aren't in this world but others will be nodding their heads).  She is on a team for an upcoming mission so will see more then. She also has a doctoral student who is a citizen who may branch off and study some of these things.
Ying Ding Indiana University

She really ran out of time in the end. I was interested in her presentation but she flew past the meaty parts.

Ignite Talks (15s per slide 2min overall or similar)

  • Filippo Menczer - http://scholarometer.indiana.edu/ - tool to view more information about authors and their networks. Browser extension.
  • Caleb Smith,
  • Orion Penner - many of us were absolutely transfixed that he dropped his note pages on the floor as he finished. It was late in the day!  He has a few articles on predicting future impact (example). On the floor.
  • Charles Ayoubi,
  • Michael Rose,
  • Jevin West,
  • Jeff Alstott - awesome timing, left 15 for a question and 15 for its answer. Audience didn't play along.

Lee Giles Penn State University

It was good to save his talk for last. A lot going on besides keeping CiteSeer up and running. They do make their data and their algorithms freely available (see: http://csxstatic.ist.psu.edu/about ) . This includes extracting references. They also are happy to add in new algorithms that make improvements and work in their system. They accept any kind of document that works in their parsers so typically journal articles and conference papers.

RefSeer - recommends cites you should add

TableSeer - extracts tables (didn't mention and there wasn't time to ask... he talked a lot about this for chemistry... I hope he's working with the British team doing the same?)

Also has things to extract formulas, plots, and equations. Acknowledgements. Recommend collaborators (0 for me, sniff.) See his site for links.

 

 

2 responses so far

Citation Manager Frustration

Nov 20 2015 Published by under information analysis

I've used most of the major citation managers: RefWorks, Flow, EndNote (web and client), Zotero, Mendeley, ProCite (remember that one?). I've looked at Papers. I've dabbled with BibTex using a couple of different tools. I've watched the videos on the new RefWorks 3 coming out. I've given training on RefWorks and EndNote... and I'm frustrated.

Almost all of these have the model that you are one person, with one field of research, who will continue to use the same somewhat limited number of references over the many years.

As a librarian, I like to compile references for people into RefWorks collections and then turn them over. I've done this by setting up new accounts for them. This will not be possible in the new version in which there's only one account per e-mail.  Sharing folders doesn't work because I don't want my dissertation and professional work database to be crammed with thousands of unrelated articles.

I have tried to collaborate with people in Zotero and I can't seem to get rid of a thousand or so articles that were relevant to a project from several years ago.

Further, they only get worse at deduplication. I use a citation manager to compile references and deduplication for bibliometrics. RefWorks chokes when your database gets above a few thousand. Plus, it's not configurable. You can't say ignore title, just look at year, authors, publication or something like that. Or only look at title. You can't review duplicates, find that they're part 1 and part 2 or conference presentation and journal paper, and then have them not show up every time. These are really not intended for my uses.

So I'm going to go back to trying client software. First I'll try a EndNote X ( an ancient copy that has been sitting here in a box still sealed). Then I'll probably go to BibTeX, maybe using SVN, and maybe using code for some of the tricks. Why should I have to?

And that's another reason to get my @##% dissertation done while RefWorks and the Word plugin still work on my home computer.

5 responses so far

Which are the bestest? Top articles from a diverse organization - part 1

Apr 01 2015 Published by under bibliometrics, information analysis

In which Christina goes into the weeds, yet not really thoroughly enough... anyhoo.

So MPOW is approaching an anniversary and we're looking at retrospectives of all sorts. What are the top articles we've published in the literature? What do you mean by top? Ok, so let's say that top means most cited - just for argument's sake. Is it really fair to compare a biomed article to an aerospace engineering article? An article published last week (ok, if in a special issue it might come complete with citations attached) with one published 5 years ago? 10? 20? Review articles with .... you see where this is going.

I had thought to normalize by 5 or 10 year periods and use the subject categories in WoS. But... 1) there are a lot of them 2) they overlap 3) argh.  And things like Acoustics, for example. JASA covers biomed like hearing stuff and it covers underwater sound... but they're not cited the same... at...all.  The acoustics category covers medical journals, physics journals, and maybe some math and engineering (I'd have to look again to be sure).

At the same time, the nice folks there on SIGMETRICS had a argument starting last weekend and going through the beginning of the week on various normalization schemes. One of the complaints against the impact factor is that it's an average and averages don't work on skewed distributions. And the WoS categories suck.

So... what I'm trying to do now is both fractional counting (and I'm checking to make sure I know what that is, but  I think you don't get credit for 1 citation you get credit for 1/(total things cited by citing article) so like a citation from a review article is worth a lot less than one from a regular article because it may be like +1/200 vs. +1/30). And then I'm normalizing by percentile. Not even normal percentile but this Hazen(1914) percentile. Tricky.

I'll be sure to share the script once I've got it. So far the method looks like:

  1. Find my org, relevant time period, articles only in WoS.
  2. Sort by cited, pull off the most cited or all the ones cited more than x or something. Save them down in plain text full record (probably don't need citations?)
  3. Then for each of the top, click on Times Cited. Export them all down in Tab del Windows UTF-8
  4. Move them over to data folder
  5. Run R script (to be shared when I'm sure it's right) to get the new TCs and stick them into the file from 2

*note: if your thingy was cited more than 500 times, you can't export them all at once. Also this would not be practical if you have someone with like thousands of citations. If you do, I would just take the plunge and call that one of the best. We only had 5 over 500.

Next, I'll put them into the ISI.exe script and then the i3 script from here.  See what happens.

As for normalizing by year. I was thinking about maybe omitting a couple of years or so and then doing 5 year bins 3 times and then doing 10 year bins. Not sure. Willing to take advice. It's a 75 year history, but there was a similar paper done in 1986 so I only agreed to go back to 1980. Before a certain time - no longer necessarily 1973 - the affiliation/address aren't there. One very nice retiree I had the pleasure to meet just died and I found that he was listed in Garfield's top cited articles. His work on polar gases is not coming up in the search so it's definitely not complete that far back.

Comments are off for this post

Using more of the possible dimensions in a network graph

Jan 30 2015 Published by under information analysis

When doing bibliometrics, or social network analysis or any kind of network graph, there are only so many different ways to convey information.

  • Size of nodes
  • Shape of nodes (including pictures)
  • Color of nodes
  • Border of nodes (or multiple borders)
  • Labels (node or edge)
  • Edge weight
  • Edge color
  • Arrows
  • Shading areas around/behind nodes
  • Layout or arrangement of nodes

Of these, I almost always size nodes by degree (connections to other nodes), do thickness of lines by their weight, and do some sort of energy or spring layout.

If I do some sort of clustering or community detection or even want to call out components, I'll do that with node color.

My normal things are easy in any package that will graph networks. I was working on a project where we were looking at the maturity of a particular part of an industry. As part of these, we wanted to know if the necessary component systems were available from multiple suppliers and if those suppliers had relationships with different system integrators and if their things were operational or were just for lab or testing purposes.

We could have done a graph for each sub system but they wanted this graph to really just be one slide in a fairly small deck. I tried various approaches in Gephi and NetDraw and wasn't excited. So back to R and iGraph.  In the end (anonymized) :

igraphgraph

Resulting graph - minus any labels.

I used:

  • node shape for if a component or a system integrator
  • color for type of component
  • size for degree
  • line dashed or dotted for if it was in operation or not

I really wanted to show different shapes for each category but igraph only has like 6 default ones and they don't look all that different from each other. NetDraw has more. I tried to use raster images - but I'm on a windows machine and I found all that very confusing.

One unfortunate thing about this graph is that I had to list companies multiple times if they had offerings in multiple categories.

Customer seemed to like it.

I'm not going to take the time to anonymize all the code but here are some key pieces - ask if there's anything I figured out that you don't immediately see how to do.
I started with a spreadsheet (3 of us librarians were adding data)
nodetable tab:
id label category

edgetable tab:
source target yes/no notes

These I imported into gephi (super easy)... and then tried all sorts of stuff... and then exported into graphml
#read in the graph
g<-read.graph("g.graphml", format="graphml")


#shape nodes. these work, but you can't have n/a. so there has to be a default. also, there is an easier way
for (i in 1:101)ifelse(V(g)[i]$Category=='category', V(g)[i]$shape<-'circle', V(g)[i]$shape<-'square')

#color of nodes - a simple number will draw from the palette. see below
for (i in 1:101)if(V(g)[i]$Category=="category"){V(g)[i]$color<-1}

#calculate and keep the degree. i use it again for label placement (not shown) and to bold some labels (not shown)
V(g)$degree<-degree(g, mode="all")

#when I tested the graphing, the isolates were all mixed in and messed up all the labels.
#subgraph to show isolates separately
gi<-induced.subgraph(g,V(g)$degree==0)
gnoni<-induced.subgraph(g,V(g)$degree!=0)

#make dotted lines for not operational
for (i in 1:76) ifelse (E(gnoni)[i]$"operational"=="yes", E(gnoni)[i]$edge.lty<-1,E(gnoni)[i]$edge.lty<-2)

#prettier colors
library("RColorBrewer", lib.loc="~/R/win-library/3.1")
mypalette<-brewer.pal(6,"Paired")
palette(mypalette)

#legend definitions
colors < - c('gray40', 1,2,3,4,5,6) labels <- vector of categories


#plot graph keep device open
plot.igraph(gnoni, layout=layout.fruchterman.reingold, edge.arrow.size=0.1, edge.color="black", vertex.size=V(gnoni)$degree, vertex.label.dist=V(gnoni)$vertex.label.dist, vertex.label.color="black", vertex.label.family="sans",edge.curved=TRUE, vertex.label.cex=0.8, edge.lty=E(gnoni)$edge.lty)

#put legends on - isolates are just shown as a legend so they are neatly lined up
#could have been done by plotting points

legend("bottomright",legend=labels, fill=colors, border="black", cex=0.7, inset=c(-0.1,0))
legend("topleft", legend=V(gi)$label, pch=19, col=V(gi)$color, cex=0.7, bty="n", y.intersp=0.5)
legend("topright", legend=c("Yes", "No"), lty=c(1,2), cex=0.7,inset=c(-0.02,0)) Continue Reading »

2 responses so far

I'm a coding fool... use of the nuggets mentioned last post

Jul 18 2014 Published by under information analysis, Uncategorized

Self-efficacy FTW... may I never crash!

Last post I mentioned determining if a string is an element of a character vector and opening Excel files in R. Here's what I was doing and how it worked.

I had a directory of xls files downloaded from a research database, one per organization, showing the high level subject categories in which they published. The subject categories were actually narrower than what I needed (think like condensed matter, high energy, amo, chemical physics and I need "physics"). I needed to rank the organizations by articles published in these composite categories that included maybe 1o or so of the categories from the database.

Originally, I was going to open all the files and just count them up, but whoa, that would suck. R for the win!

First, run RStudio with either 32- or 64- bit depending on which Java you have installed.

Next, get the list of files. I had saved them in a directory with other things, too, so I needed to search by pattern. I had already set my working directory to the data directory, for good or bad.

fileList<-list.files(pattern = "subject")

Get the list of their categories for my composite category (physics here). This was just a list:

physics <- read.delim("physics.txt", header=F)

Make sure words are characters and not factors, numbers are just numbers. By trial and error and inspection I figured out that there was a non-numeric character in the count field.

Here's the function:
countPhysics<- function(file){
  #this creates a vector to hang on to the numbers of counts
  phys.hold <- vector(mode="numeric")
  #this is so i can make sure i just have numbers in the count field
  pattern<-"[[:digit:]]+"
  physfile<-read.xlsx(file,1,startRow=4)
  #this finds matching records and then keeps just the part we want
  m<-regexpr(pattern,physfile$Count)
  physfile$Count<-as.numeric(regmatches(physfile$Count,m))
  #one of these days i'll just import right the first time instead of this part
  physfile$Analysis.Value<-as.character(physfile$Analysis.Value)
  for (j in 1:length(physfile$Count)){
    if (is.element(physfile$Analysis.Value[[j]],physics)) {
      phys.hold[j]<- (physfile$Count[[j]])}
    else phys.hold[j]<-0
  }
  total<- sum(phys.hold)  
  return(c(file,total))
  rm(phys.hold)
}

So you run this like so:

physicsResult <-sapply(fileList,countPhysics)

I transposed it and then pasted it into an excel file I was working on but this is essentially the answer. I did the same thing for the other categories separately, but obviously I should have checked each line for matching each/all of my categories before moving to the next line and then outputted a frame. Oh well.

Comments are off for this post

Another little R nugget

Jul 16 2014 Published by under information analysis

Actually 2 - I'm on a tear. Using what I've learned for work!

First:

is.element(thing,vector)

pretty cool, huh? and useful. Found on StackOverflow. Other ideas there, too.

Second:

Opening an Excel file. Now what you're probably thinking is: "just save as csv or tab delimited and open that the normal way"... well yes, but I've got like 50 files and they were exports from the database as .xls and I didn't open them each while exporting to re-save.

So opening is one thing but I need a couple of columns out of the first sheet and there's some junk at the top that I don't need.

Package is xlsx (pdf). I had just installed Python (thought I had it at work, but I guess not) and was preparing to do it there, but then I was thinking that surely R would have a way.

The issue in my case with my work computer is that it's a 64-bit, but I have 32-bit java, 32-bit office, etc. I was running R in RStudio as 64-bit. I tried to do everything to get Java updated and 32-bit (but still not killing other things I needed). Finally - duh- just pointed RStudio at the 32-bit version of R installed and then RJava ran just peachy and it's required for xlsx.

Here's the command I'm using so far:
file<-read.xlsx("analysis.xls",1,startRow=4, colClasses=c("character", "numeric"))

1 is the sheet index. The colClasses is pretty slick, too. You can also name sheets and only grab certain columns.

So now to iterate through all the files in a directory, opening, counting up the articles listed for the categories I have in another vector and reporting that out. Should be cool. Or not.  And I promised to have the results - however they're done - Friday and I'm going to be out tomorrow. Back to it!

One response so far

Trying another CAQDAS, MaxQDA

Jul 12 2014 Published by under dissertation, information analysis

CAQDAS: Computer Assisted Qualitative Data Analysis Software (more).

Previously, I'd used NVIVO and I found it to be miserable and horribly expensive (I didn't pay, but I would have to this time). I really did most of my work offline with colored pencils, note cards, etc. I am fully willing to admit that I didn't know how to use it or that maybe I was doing it wrong, but instead of saving me time, it was costing.

I started trying to use it with my dissertation data and ew. It's fine with my interview transcripts, but tweets and blog posts, just yuck. So then I started coding just using Excel and meh.

So back to the drawing board. I read a bunch of reviews of different products on Surrey's pages: http://www.surrey.ac.uk/sociology/research/researchcentres/caqdas/ , but it's really hard to tell. I also started looking at prices and oh la la!  I was thinking maybe dedoose, despite their earlier issues, but that's like at least $10/month. Not bad until you think you might do this for a while.

After all that MaxQDA - a German product - seemed like a slightly better choice than some. The student license doesn't expire and is for the full product (they have a semester one, but that doesn't make sense for me) - that's probably the biggest selling point.

So far so good. Importing about a thousand blog posts as individual documents was no big deal. Interview transcripts were even quicker. Adding in my codes was super quick and it was super quick to move one when I stuck it in the wrong hierarchy. I think I'll work with this data for a bit before doing the twitter archives - particularly since I don't know how I might sample.

I'm still on the 30-day trial. Not sure if I need to try to start paying for it with a week or so to spare so the verification can be completed. My university doesn't have an expiration date on our IDs. Not sure if my advisor has to send an e-mail or what.

Yeah, there's something for R (of course), but it doesn't have the features. I was still thinking I might do some more machine learning and other tricks with my data using R which is easy now that it's all in spreadsheets.

One response so far

Older posts »