DBLP > EndNote using R

(by Christina Pikas) Oct 17 2016

I'm doing a study in which I'm mapping the landscape for an area of Computer Science. I did the initial work in Inspec and once I found the best search (hint: use a classification code and then the term), I was pretty happy with the results.  When I showed it to my SMEs, however, they fairly quickly noticed I was missing some big name ACM conferences in the field. I've contacted Inspec about those being missing from the database, but in the mean time, oops!  What else is missing?

The more comprehensive databases in CS are things like ACM Guide to Computing Literature, CiteSeer, and DBLP.... ACM is very difficult to be precise with and you can either export all the references or one at a time... CiteSeer was giving me crazy results... DBLP had good results but once again, export one at a time.
So here's how to use DBLP's API through R and then get the results into EndNote (using X7 desktop)

#getting stuff faster from dblp
options(stringsAsFactors = FALSE)
library("httr", lib.loc="~/R/win-library/3.3")
library("jsonlite", lib.loc="~/R/win-library/3.3")
library("XML", lib.loc="~/R/win-library/3.3")
library("plyr", lib.loc="~/R/win-library/3.3")
library("dplyr", lib.loc="~/R/win-library/3.3")


#http://dblp.org/search/publ/api for publication queries


# Parameter	Description	Default	Example
# q The query string to search for, as described on a separate page.		...?q=test+search
# format The result format of the search. Recognized values are "xml", "json", and "jsonp".	xml	...?q=test&format=json
# h Maximum number of search results (hits) to return. For bandwidth reasons, this number is capped at 1000.	30	...?q=test&h=100
# f The first hit in the numbered sequence of search results (starting with 0) to return. In combination with the h parameter, this parameter can be used for pagination of search results.	0	...?q=test&h=100&f=300
# c Maximum number of completion terms (see below) to return. For bandwidth reasons, this number is capped at 1000.	10	...?q=test&c=0

raw.result<- GET("http://dblp.org/search/publ/api?q=wrangl")

this.raw.content <- rawToChar(raw.result$content)


this.content.frame<- ldply(this.content.list$hits, data.frame)

#update to be sure to use the correct field names - except for author because still need to combine later
#two word ones have to be made into one word - for R - have to edit later
#ReferenceType has to be first to import multiple types in one file others order doesn't matter
content.frame3<- data.frame(ReferenceType = this.content.frame$info.type,
                            Title = this.content.frame$info.title, author = this.content.frame$info.authors.author,
                            author1 = this.content.frame$info.authors.author.1, 
                            author.2 = this.content.frame$info.authors.author.2, 
                            author.3 = this.content.frame$info.authors.author.3, 
                            author4 = this.content.frame$info.authors.author.4, 
                            author5 = this.content.frame$info.authors.author.5, 
                            author6 = this.content.frame$info.authors.author.6, 
                            SecondaryTitle = this.content.frame$info.venue, 
                            Pages = this.content.frame$info.pages, Year = this.content.frame$info.year, 
                             URL = this.content.frame$info.url, 
                            Volume = this.content.frame$info.volume, Number = this.content.frame$info.number, 
                            SecondaryAuthor = this.content.frame$info.author, 
                            Publisher = this.content.frame$info.publisher)

#want to get all authors together and get it basically in the format for TR. 
# first get all authors together separated by ; 
# http://stackoverflow.com/questions/6308933/r-concatenate-row-wise-across-specific-columns-of-dataframe
# example:  data <- within(data,  id <- paste(F, E, D, C, sep="")

content.frame4<- within(content.frame3, Author<- paste(author,author1,author.2, author.3, author4, author5, author6, sep="; " ))

# http://stackoverflow.com/questions/22854112/how-to-skip-a-paste-argument-when-its-value-is-na-in-r
content.frame4$Author<-gsub("NA; ","",content.frame4$Author)


#remove NA from other fields


#now drop unwanted columns using df <- subset(df, select = -c(a,c) )  from http://stackoverflow.com/questions/4605206/drop-data-frame-columns-by-name

content.frame5<-subset(content.frame4, select = -c(author,author1,author.2, author.3, author4, author5, author6))

#add in a gsub for the correct reference types
content.frame5$ReferenceType<-gsub("Conference and Workshop Papers","Conference Paper", content.frame5$ReferenceType)
content.frame5$ReferenceType<-gsub("Parts in Books or Collections","Book Section", content.frame5$ReferenceType)
content.frame5$ReferenceType<-gsub("Books and Theses","Book", content.frame5$ReferenceType)
content.frame5$ReferenceType<-gsub("Journal Articles","Journal Article", content.frame5$ReferenceType)

#need tab delimited no rownames and update column names to have the necessary spaces

correctnames<- c("Reference Type","Title", "Secondary Title", "Pages", "Year",  "URL", "Volume", "Number", "Secondary Author", "Publisher", "Author")

# if only one type of reference specify at top *Generic to top of file also add a vector of correct column names
#write.table(content.frame5,"dblptestnew.txt",append = T, quote=F,sep = "\t",row.names = F,col.names=correctnames, fileEncoding = "UTF-8")

#if multiple types use this one
write.table(content.frame5,"dblp30wrangl.txt", quote=F,sep = "\t",row.names = F,col.names=correctnames, fileEncoding = "UTF-8")

(this is also on Git because WP keeps messing up the code)

After you have this file, import into EndNote using the boilerplate tab delimited, with UTF-8 translation.

3 responses so far

Unpacking Societies Publishing With For Profit Companies

(by Christina Pikas) Aug 06 2016

This week, Jon Tennant went off on a riff on Wiley and the poor experience he had with a particular journal published for a society by Wiley.

First - I'm not affiliated and so very much not endorsing any companies, etc.

Second - I'm on record saying some things are worth paying for and I still feel that way.

I've reviewed for a Wiley-published society journal but not published with one. The ScholarOne interface is like whoa, yuck, but that is, by the way, actually a TR product. Any interactions with the editorial staff have been very professional and pleasant.

I've also been helping a colleague navigate ScholarOne to submit to a Taylor and Francis journal. It has been more than a year and we're still going back and forth with them. E-mails to the editor go unanswered. One reviewer was just like "this isn't science"  and doesn't do any more reviewing. The other has provided detailed feedback which the authors have appreciated.

Over the years, I've seen plenty of organizations think they can just do it all themselves. Why, though, should they not outsource to vendors who already have set-ups? I mean, OJS is just ugly. Free CMS are plentiful, but just because you can put articles online for cheap doesn't mean that they'll work with the rest of the ecosystem.

From what I can tell about what Tennant said, his real problem is with the society and the editors, not with the platform.

The other think to think about is if the society had to pay the intermediate vendors themselves (Atypon, etc) and manage those relationships, would that really be cheaper than an all-in-one package? Maybe? Not sure.

Remember, too, that journals are sometimes expensive because the society sees them as a revenue stream so they can pay expensive executives and lobbyists and maybe a scholarship here or there.

If you're part of a society trying to make the decision, you'll likely have the numbers to help - but I don't think the decision is as obvious as everyone thinks.

No responses yet

Using spiffy WordPress themes in an IE environment

(by Christina Pikas) Jul 22 2016

At MPOW we apparently have "compatibility mode" on by default by group policy. So this disables all the cool HTML5 and does weird things in general with a lot of web pages. If you go to the WordPress plugins, there are a few to show nasty messages to visitors that they have to change or update but that's just super unhelpful for those many visitors who don't actually have a choice.

Anyhoo... I pieced this together from a few different sites... If you go into network dashboard to themes to edit the theme and find the header.php file (back it up first just in case).

Then under <head> make the next line:

<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1">

and update... works. This tells IE not to show compatibility view.

No responses yet

The Dissertation is Now Available for Your Enjoyment

(by Christina Pikas) Jun 30 2016

Or, you know, for bedtime reading.

Christina K. Pikas, PhD, 2016
This dissertation presents a literature-based framework for communication in science (with the elements partners, purposes, message, and channel), which it then applies in and amends through an empirical study of how geoscientists use two social computing technologies (SCTs), blogging and Twitter (both general use and tweeting from conferences). How are these technologies used and what value do scientists derive from them?
Method The empirical part used a two-pronged qualitative study, using (1) purposive samples of ~400 blog posts and ~1000 tweets and (2) a purposive sample of 8 geoscientist interviews. Blog posts, tweets, and interviews were coded using the framework, adding new codes as needed. The results were aggregated into 8 geoscientist case studies, and general patterns were derived through cross-case analysis.
Results A detailed picture of how geoscientists use blogs and Twitter emerged, including a number of new functions not served by traditional channels. Some highlights: Geoscientists use SCTs for communication among themselves as well as with the public. Blogs serve persuasion and personal knowledge management; Twitter often amplifies the signal of traditional communications such as journal articles. Blogs include tutorials for peers, reviews of basic science concepts, and book reviews. Twitter includes links to readings, requests for assistance, and discussions of politics and religion. Twitter at conferences provides live coverage of sessions. Conclusions Both blogs and Twitter are routine parts of scientists' communication toolbox, blogs for in-depth, well-prepared essays, Twitter for faster and broader interactions. Both have important roles in supporting community building, mentoring, and learning and teaching. The Framework of Communication in Science was a useful tool in studying these two SCTs in this domain. The results should encourage science administrators to facilitate SCT use of scientists in their organization and information providers to search SCT documents as an important source of information.

4 responses so far

Parsing citations for dabblers

(by Christina Pikas) Jun 26 2016

Warning! This post is more about questions than answers!

Wouldn't it be nice to be able to grab references from the end of an article - say the article is in pdf, even - and have them in a usable tagged format? I am sure not the only one to consider this. In fact, everyone seems to do it. CiteSeer, ACM digital library, and others.  Deborah Fitchett had success figuring this out.

My incentive is a bit different. I'm looking at a pile of proposals and I want to know what citations they have in common. Everyone cites themselves, of course, but we think there are several schools of thought that we should be able to identify.

My general thought was

  1. extract the bibliography
  2. parse
  3. label each citation with firstauthorlastnamepubyear - even if there are multiple works by a single author, I think should be good enough? pretty rare to volte-face mid year?
  4. Make a matrix proposers x citation labels
  5. Graph and profit

Several ways to extract the bibliography. I realize now that I probably should have done something with beautiful soup or something and even if not, some tools actually take in a document and find the bibliography. Anyway, I have them.

Now for parsing here is a list of ones that may be around and usable in some fashion (this is a helpful listing)

  • ParsCit - this seems really to be the most popular and the most successful
  • FreeCite - a web service from Brown University libraries
  • Using CrossRef
  • ParaCite / ParaTools from Southampton - perl modules
  • eta: AnyStyle.io - I tried the web interface and it worked pretty well

ParsCit is the most popular so I thought it would give it a go. The page is not terribly hopeful about running it on Windows. Well... so I did request and receive an Ubuntu VM to play with... hoo-boy, the instructions are fairly off-putting and really sort of condescending (if you know how to work in Unix, this is trivial).

So now, instead, I'm playing with using the Brown library service and RCurl to see what I can do. Of course, then I have to deal with what I get back. Meh.

If I get it all figured out, I'll be sure to report back.


Edited: Richard Urban reminded me about AnyStyle.io so I added it. I think I will try to get an API key to try to get to it through R using RCurl. Because of time constraints... right now I'm just extracting the network using Excel 🙁

No responses yet

Uneasy or even unhealthy relationships between liaison librarians and developers?

(by Christina Pikas) May 16 2016

Two things I've seen today show breakdowns and "other"-ing between liaison librarians and UX or IT or acquisitions or other parts of the library team. This is about one.

A liaison librarian, bibliographer, collection developer - similar with slight differences - is the pointy end of the spear in academic libraries. They go out and coordinate with the researchers. They may teach classes. They select materials and create resource guides on various subjects. Other librarians work within the library to organize information, negotiate and track licenses, build and maintain systems to provide access to licensed resources, etc.

This article by Tim Broadwater is horrific: Why am I doing this to our usershttp://libux.co/why-am-i-doing-this-to-our-users/ (via Francoeur on Slack)

The author is a UX person who is part of a development team. They're basically jerked around by a committee of liaison librarians who turn their personal experience and their limited interactions with a few users of the system into requirements for the design. The UX person tried to pause the process to study what was needed to no avail and it all went downhill from there. At some point they needed to approve every single design decision.

First, I have to say that while I think there's often some tension, I haven't seen anything this bad places I've been. The developer who was at MPOW certainly had some disagreements with liaison librarians but there was always a tremendous amount of respect in both directions. For one thing, he did see librarians as some of the primary users of the catalog and other tools. Really, who does use the catalog more than librarians? Shouldn't their needs be important, too? The other thing is that he reached out to us to ask for data. Not that he did everything we suggested but if we saw something helpful or had helpful feedback or even many hands make light work in testing.

I have a lot of respect for UX researchers... but they are not infallible, either. For example, a new intranet site makes it impossible to learn how to answer a standard question: "is this particular book available online from our collections?"  "do we have a copy of this book at all?" ... it turns out that the questions they used in testing were not run by any of the librarians. They asked "tell me how you would find an ebook" ... and the users were all very successful in locating a list of ebook collections that was only developed to support research guides. It was never intended to be freestanding. This is not the right question. And, it turns out, the freaking catalog was not linked from anywhere in the original production version of the page. In the old page it was front and center.

So it may seem obvious but it's another case of each side respecting the other's expertise. Also, someone should have stepped on the brakes before the relationships were completely trashed.* Developers: liaison librarians can be your best allies if you can work with them successfully. Librarians: developers, UX team members, software engineers are not peons who must slavishly take all of your suggestions - they are experts, too, and you need to listen!

*author dude surely must be looking for a new job or is he assuming co-workers won't read his article?

No responses yet

Making things besides journal articles available

(by Christina Pikas) Apr 18 2016

Communication is central to science and the vast majority of it happens outside of peer reviewed journal articles. Some informal scholarly communication is intended to be ephemeral but in the past couple of decades, more informal communication is conducted in an online text-based medium in a way that could be captured, saved, searched, and re-used. Often, it isn't.

Libraries have always had a problem with gray literature. Unpublished dissertations, technical reports, conference papers, government documents, maps, working documents... they are one level of difficult to find. Some say, "well, if it's good information it will be in the journal literature" or "if it's worth saving, it will be in the journal literature." But we know that: details are left out of the methods sections, data are not included, negative results are under reported, etc. In some fields conferences are as easy to find as journal articles whereas in other fields they're impossible (and some of that is due to the level of review and importance of the communication to that field).

Practically, if you get the idea for something from a blog post, then you need to attribute the blog post. If the blog post goes missing, then your readers are out of luck.

This is all in lead up to a panegyric on the efforts of John G. Wolbach Library of the Harvard-Smithsonian Center for Astrophysics with ADS, and particularly  Megan Potterbusch, Chloe Besombes, and Chris Erdmann who have been working on a number of initiatives to archive and make this information available, searchable, and citable.

Here is a quick listing of their projects:

Open Online Astronomy Thesis Collection, https://zenodo.org/communities/about/astrothesis/

Information about it is here: http://www.astrobetter.com/blog/2016/04/11/an-open-online-astronomy-thesis-collection

Even if your dissertation is in an institutional repository and is available from the university, this will make it more easy to find. Also, you can link to your datasets and whatnot.

Conference Materials: http://altbibl.io/gazette/open-access-publishing-made-easy-for-conferences/

We have folks who have been very dissatisfied with the existing options for hosting conference proceedings. I know one group went from AIP where they had been for decades, to Astronomical Society of the Pacific, to IoP and still weren't happy. They wanted to make the information available but not super expensive. This may be an option for long term access and preservation.

Informal astronomy communications: https://github.com/arceli/charter

This is more for like blog posts.

Research software: https://astronomy-software-index.github.io/2015-workshop/


All of this is pulled together by ADS (see also ADS labs), which is a freely available research database for Astro and related subjects (we are more interested in planetary science and solar physics at MPOW). PubMed gets all the love, but this is pretty powerful stuff.







No responses yet

Communications Theories - the continuing saga

(by Christina Pikas) Apr 16 2016

The dissertation was accepted by the grad school and is on its way to the institutional repository and PQ to be made available to all (I will link to it as soon as it's available). Yet I still fight the battle to own and, if not ever be a native to theory, then at least be semi-fluent.

Late in the dissertation I identified this book: Theories and Models of Communication (2013). In Cobley P., Schulz P. J. (Eds.). Berlin: De Gruyter. I browsed it a bit on Google Books and then requested it from another library. I'm just getting the chance to look at it more carefully now. A lot is not new, but it is well-organized here.

Chapter 2:

Eadie, W. F., & Goret, R. (2013). Theories and Models of Communication:  Foundations and Heritage. In P. Cobley, & P. J. Schulz (Eds.), Theories and Models of Communication (pp. 17-36). Berlin: De Gruyter.

Communication as a distinct discipline emerged after WWII. Theories and researchers came from psychology, sociology, philosophy, political science... I guess probably engineering and physics, too. Then again, physicists turn up everywhere 🙂

This chapter described 5 broad categories of approaches to communication

  1. communication as shaper of public opinion - this came from WWII propaganda work. Main dudes: Park, Lippman, Lazarsfeld, Lasswell
  2. communication as language use - this is like semiotics. Main dudes: Sassure, Pierce
  3. communication as information transmission - this would be where you find the linear models like Shannon & Weaver as well as updates like Schramm and Berlo. From those came Social Learning/Social Cognitive Theory (Bandura), Uses and Gratifications, Uncertainty Reduction Theory (Berger and Calabrese), and eventually Weick, who we all know from the sensemaking stuff.
  4. communication as developer of relationships - Bateson, Watzlawick "interactional view", Expectancy Violations Theory (Burgoon), Relational Dialectics Theory (Baxter)
  5. communication as definer, interpreter, and critic of culture - this is where you get the critical theory (like critical race theory, etc.). Frankfurt School (Marcuse, Adorno, Horkheimer, Benjamin), Structuralism, Gramsci, Habermas

Chapter 3:

Craig, R. T. (2013). Constructing Theories in Communication Research. In P. Cobley, & P. J. Schulz (Eds.), Theories and Models of Communication (pp. 39-57). Berlin: De Gruyter.

"A scientific theory is a logically connected set of abstract statements from which empirically testable hypotheses and explanations can be derived." (p.39)

"Metatheory articulates and critiques assumptions underlying particular theories or kinds of theories" (p. 40)

He uses words in a different way than I think I learned. Like metatheory - his is like meta about theories, but I think other people may use it like overarching big mama theory with baby theories?

Anyhoo. He says there are these metatheoretical assumptions useful to understand the landscape of communications theories.

  1. about objects that are theorized (ontology)
  2. basis for claims of truth or validity (epistemology)
  3. normative practices for generating, presenting, using theories (praxeology)
  4. values that determine worth of a theory (axiology)

Ontology - what is communication? Basically transmission models or constitutive models.  "symbolic process whereby reality is produced, maintained, repaired, transformed" (Carey, 2009)

His constitutive metamodel of communication theories (these were described better in chapter 2, but reiterated by the author himself in 3)

  1. rhetorical - communication is a practical art
  2. semiotic - intersubjective mediation via signs
  3. phenomenological - experiencing otherness through authentic dialog (or perhaps BS - no it doesn't say that 🙂 )
  4. cybernetic - communications = information processing
  5. sociopsychological - communications = expression, interaction, influence
  6. sociocultural - communications = means to (re)produce social order
  7. critical - discursive reflection on hegemonic ideological forces and their critiques

Theory means something different in physics than it does in sociology. This is due to the objects of study and how and what we can know about them as well as by what values we judge the theory. Two main approaches to constructing theory in comms are: empirical-scientific and critical-interpretive.

Functions of a scientific theory: description, prediction, explanation, and control.

Two kinds of explanation: causal and functional. Communication explanatory principles: hedonistic (pleasure seeking), understanding-driving, consistency-driven, goal-driven, process-driven, or functional (cites Pavitt, 2010).

Criteria to judge quality: empirical support, scope, precision, aesthetic (elegance), heuristic value.

Theory != model|paradigm . Model is a representation, theory provides explanation.  Paradigm is a standard research framework used in a particular field.

Epidemiological assumptions.

  • Realist - underlying causal mechanisms can be known
  • Instrumentalist - scientific concepts correspond to real things and can be useful in making predictions
  • Constructivist - phenomena can't be known independently of our theories  - paradigm determines how empirical data will be interpreted.

A classical issue is level of analysis - do you go biological or psychological or do you go more sociological? Small groups? Societies?

Also do you build the whole theory at once or add to it as you go along to build it up?

Critical-Interpretive - these are like from humanities like rhetoric, textual criticism, etc. "Purpose has been ideographic (understanding historical particulars) rather than nomothetic (discovering universal laws)" p. 49

Interpretive. Methods (praxeology) - conversation analysis, ethnography, rhetorical criticism. Emphasize heuristic functions of theory. Not generalizable causal explanations, but conceptual frames to assist in interpreting new data. It's accepted to use multiple theories to better understand "diverse dimensions of an object" instead of insisting on one right path. Carbaugh and Hastings 1992 4 phases of theory construction

  1. developing a basic orientation to communication
  2. conceptualizing specific kinds of communicative activity
  3. formulating the general way in which communication is patterned within a socioculturally situated community
  4. evaluating the general theory from the vantage point of the situated case (p.51)

Critical. purpose of critical theory is social change.

Anyway, more to follow as I hopefully continue on in the book.

No responses yet

Notes from Dan Russell Advanced Skills for Investigative Searching

(by Christina Pikas) Apr 15 2016

This class was held 4/15/2016 at University of Maryland at the Journalism School, hosted by the Future of Information Alliance. Some information is here. Slides are here. Updated Tip Sheet is here.

I've previously taken his MOOC and enjoyed tips on his blog but things change so quickly it was good to get an update.

Of course I didn't bring my laptop so... these are from handwritten notes.

  • Capitalization doesn't matter except for OR when it's crucial. don't use AND it doesn't do anything.
  • Diacriticals do matter. e and é are basically interchangeable but a and å are not. (it does offend native speakers of countries that use these....)
  • If you need to search for emoji you'll have to use Baidu. This is relevant searching for businesses in Japan, for example
  • filetype:   works for any extension. If you're looking for datasets you may use filetype:csv . Regular google searches don't search docs, you'll need to search them separately
  • site:  it's different if you use nyc.gov, www.nyc.gov, or .nyc.gov . To be most general use site:.nyc.gov that . after the : acts like a * if there are subdomains
  • There is no NOT. Instead use -<term>.  No space between the minus and the term.
  • Synonyms are automatic. Use quotes around a single term to search it verbatim (also turns off spell check for that term). If quotes are around a phrase, it does not do a verbatim search.
  • There are no stop words
  • inurl:   ... this is useful if pages have a certain format like profile pages on Google Plus
  • If you want to get an advanced search screen. Click on the gear to select it. Gear is in the upper right hand corner. That's the only way to get limiting by region (region limiting isn't always domain), number search, language search. Some advanced search things can also be gotten by using dropdown boxes after searching or using things like inurl: filetype:
  • related:<url> gets you sites with term overlap (not linking/linked similarity).
  • Google custom search engine  - lets you basically OR a bunch of site: searches to always search across them.

Image Search

  • Tabs across the top of results for topic clusters found
  • Search by image - click on camera and then point to or upload image. Can drag an image in or control click on an image. After search can then add in terms to narrow to domain.
  • Example - find a tool in the basement, take a picture on a white background with it in a normal orientation, then search to find it in catalogs, etc.
  • Crop images to the salient bit.
  • On mobile devices the standard search is actually a google appliance search - not as powerful. Open chrome and search from there if you need more.

Other notes

  • Things are changing all the time because of adversarial arrangements with optimization people.
  • link:   was removed this week.
  • results are an estimate. When you narrow you sometimes get more results because it starts by searching only the first tier of resources. First tier has millions of results in it - and the ones that have been assessed as highest quality. If it doesn't find enough in the first tier - like when you narrow a lot - it will bump down to the second tier with like billions more results
  • consider using alerts.
  • to find any of these services - just Google for them
  • google trends is interesting. can narrow by time or region. Also look for suggestions when searching. Can search for an entity or for search term. remember trends are worldwide
  • Google correlate - example: Spanish tourism authorities want to know what UK tourists are looking for. Find the search for Spain and tourism, and see what keywords use by UK searchers correlate.
  • Country versions are more than just languages. Consider using a different country version to get a different point of view.
  • Wikipedia country versions are useful for national heros and also controversial subjects (example: Armenian genocide)
  • define   (apparently no : needed)

I think all librarians should probably take his class. Good stuff.

No responses yet

Notes from International Symposium on Science of Science 2016 (#ICSS2016) - Day 2

(by Christina Pikas) Mar 30 2016

This day's notes were taken on my laptop - I remembered to bring a power strip! But, I was also pretty tired, so it's a toss up.


Luis Amaral, Northwestern

What do we know now?

Stringer et al JASIST 2010 distribution of number of citations

25% of papers overall in WoS (1955-2006) haven’t been cited at all yet for particular journals (ex. Circulation) there may be no papers that haven’t been cited.

Stringer et all PLoS ONE 2008 – set of papers from a single journal

Discrete log normal distribution – articles published in a journal in a year

Works well for all but large, multidisciplinary journals – Science, Nature, PNAS, but also PRL and JACS

For most journals takes 5-15 years to reach asymptotic state

Moreira et al PLOS ONE 2015 – set of papers from a department. Also discrete log normal.

Also did work on significant movies - citations using IMDB connections section (crowd sourced annotation of remakes, reuse of techniques like framing, references/tributes, etc.)

Brian Uzzi, Northwestern

Age of Information and the Fitness of Scientific Ideas and Inventions

How do we forage for information – given a paper is published every 20 minutes – such that we find information related to tomorrow’s discoveries?

He’s going to show WoS, patents, law and how pattern works.

Foraging with respect to time (Evans 2008, Jones & Weinberg 201?)

Empirical strategies of information foraging some papers reference a tightly packed by year, some high mean age, high age variance…

Average age of information (mean of PY - PY of cited articles)

Low mean age, high age variance is most likely to be tomorrow’s hits (top 5% cited in a field)

Tried this same method in patent office- inventors don’t pick all the citations. Examiner assigns citations. Patents have the same hotspot.


Audience q: immediacy index, other previous work similar…

A: they mostly indicate you want the bleeding edge. Turns out not really you need to tie it to the past.

Cesar Hidalgo, MIT

Science in its Social Context

Randal Collins “the production of socially decontextualized knowledge” “knowledge whose veracity doesn’t depend on who produced it”

But science is produced in a social context

He is not necessarily interested in science for science's sake but rather, how people can do things together better than they can do individually.

What teams make work that is more cited

Several articles show that larger teams produce work that is more cited, but these papers were disputed. Primary criticism: other explanatory factors like larger things are more cited, more connected teams, self-promotion/self-citation with more authors, also cumulative advantage – after get one paper in high impact journal easier to get more in there

Various characteristics – number authors, field, JIF, diversity (fields, institution, geographic, age),

Author disambiguation (used Google Scholar – via scraping)

Connectivity – number of previous co-authorship relationships

Collaboration negative fields vs. collaboration positive fields

On average more connected the team more cited the paper on average. Interaction between JIF and connectivity. Weak but consistent evidence that larger and more connected teams get cited more. Effects of team composition negligible compared to area of publication and JIF.


How do people change the work they do?

Using Scholar 99.8%, 97.6% authors publish in four or more fields… typically closely related fields

Policy makers need to assign money to research fields – what fields are you likely to succeed in?

Typically use citations but can’t always author in fields you can cite (think statistics)

Use career path? Fields that cite each other are not fields authors traverse in their career path.

Q: is data set from Google Scholar sharable?

A: He’s going to ask them and when his paper is out and then will

Guevara et al (under review ) arxiv.org/abs/1602.08409

Data panel

Alex Wade, Microsoft Research – motivation: knowledge graph of scholarly content. Knowledge neighborhood within larger knowledge graph usable for Bing (context, and conversations, scaling up the knowledge acquisition process), Cortana, etc. Can we use approaches from this field (in the tail) for the web scale? Microsoft Academic Graph (MAG). MS academic search is mothballed. Now on Bing platform building this graph – institutions, publications, citations, events, venues, fields of study. >100M publications. Now at academic.microsoft.com  - can see graph, institution box. Pushed back into Bing – link to knowledge box, links to venues, MOOCs, etc. Conversational search… Cortana will suggest papers for you, suggest events. Aka.ms/academicgraph

[aside: has always done better at computer science than any other subject. Remains to be seen if they can really extend it to other fields. Tried a couple of geoscientists with ok results.]

James Pringle, Thomson Reuters – more recent work using the entire corpus. Is the Web of Science up to it? 60 M records core collection. Partnered with regional citation databases (Chinese, SciELO, etc). "One person’s data is another person’s metadata." Article metadata for its own use. Also working with figshare and others. Building massive knowledge graph. As a company interested in mesolevel. Cycle of innovation. Datamining, tagging, visualization… drug discovery…connection to altmetrics… How do we put data in the hands of who needs it. What model to use? Which business model?

Mark Hahnel, Figshare

Figshare for institutions – non-traditional research outputs, data, video … How can we *not* mess this up? Everything you upload can be tracked with a DOI. Linked to GitHub. Tracked by Thomson Reuters data cite database. Work with institutions to help them hold data. Funder mandates for keeping data but where’s the best place?

Funders require data sharing but don’t provide infrastructure.

Findable, interoperable, usable, need an api … want to be able to ask on the web: give me all the information on x in csv and get it. Can’t ask the question if data aren’t available.

Need persistent identifiers. Share beta search.

Daniel Calto, Research Intelligence, Elsevier

Data to share – big publisher, Scopus, also Patent data and patent history,

Sample work: comparing cities, looking at brain circulation (vs. brain drain) – Britain has a higher proportion of publications by researchers only there for 2 years  - much higher than Japan, for example

Mash their data with open public information.

Example: mapping gender in Germany. Women were more productive in physics and astronomy than men. Elsevier Research Intelligence web page full global report coming

Panel question: about other data besides journal citations

Hahnel: all sorts of things including altmetrics

Pringle: usage data  - human interactions, click stream data, to see what’s going on in an anonymous way. What’s being downloaded to a reference manager; also acknowledgements

Calto: usage data also important. Downloading an abstract vs. downloading a full text – interpreting still difficult. How are academic papers cited in patents.


Reza Ghanadan, DARPA

Simplifying Complexity in Scientific Discovery (aka Simplex)


Datafication > knowledge representation > discovery tools

Examples: neuroscience, novel materials, anthropology, precision genomics, autonomy

Knowledge representation

Riq Parra – Air Force Office of Science Research

(like Army RO and ONR) their budget is ~60M all basic research (6-1)

All Air Force 6-1 money goes to AFOSR

40 portfolios – 40 program officers (he’s 1 of 40). They don't rotate like NSF. They are career.

Air Space, Outer Space, Cyber Space.

Some autonomy within agency. Not panel based. Can set direction, get two external reviews (they pick reviewers), talk a lot with the community

Telecons > white papers > submissions > review > funding

How to talk about impact of funding? Mostly anecdotal – narratives like transitions. Over their 65 years they’ve funded 78 Nobel Prize Winners on average 17 years prior to selection

Why he’s here – they do not use these methods to show their impact.  He would like to in spirit of transparency show why they fund what they fund, what impact it has, how does it help the Air Force and its missions.

Ryan Zelnio, ONR

horizon scan to see where onr global should look, where spend attention and money, assess portfolio

global technology awareness quarterly meetings

20-30 years out forecasting

Bibliometrics is one of a number of things they look at. Have qualitative aspects, too.

Need more in detecting emerging technologies

Dewey Murdick, DHS S&T

All the R&D (or most) for the former 22 agencies. More nearer term than an ARPA. Ready within months to a couple years. R&D budget 450M … but divide it over all the mission areas and buy everyone a Snickers.

Decision Support Analytics Mission – for big/important/impactful decisions. Analytics of R&D portfolio.

Establishing robust technical horizon scanning capability. Prototype anticipatory analytics capability.

Brian Pate, DTRA

Awareness and forecasting for C-WMD Technologies

Combat support agency – 24x7 reachback capability. Liaison offices at all US Commands.

6.1-6.3 R&D investments.

Examples: ebola response, destruction of chem weps in Syria, response to Fukushima.

Low probability event with high consequences. No human studies. Work with DoD agencies, DHS, NIH, others.

Move from sensing happening with state actors to anticipatory, predicting, non-state actors.

Deterrence/treaty verification, force protection, global situational awareness, counter wmd

BSVE – biosurveillance architecture, cloud based social self-sustaining, pre-loaded apps

Transitioned to JPEO-CWD – wearable CB exposure monitor

FY17 starting DTRA tech forecasting

Recent DTRA RFI – on identifying emerging technologies.

Audience q: Do you have any money for me?

Panel a: we will use your stuff once someone else pays for it

Ignite talks - random notes




Ethnea - instance based ethnicity, Genni (JCDL 2013), Author-ity (names disambiguated)

Predict ethnicity gender age

MapAffil - affiliation geocoder

Ethnicity specific gender over time using 10M+ pubmed papers


Larramore: Modeling faculty hiring networks


Bruce Weinberg, Ohio State

Toward a Valuation of Research

IRIS (Michigan) – people based approach to valuing research. People are the vectors by which ideas are transmitted, not disembodied publications

- CIC/AAU/Census

Innovation in an aging society – aging biomedical research workforce

Data architecture

  • bibliometric
  • dissertations
  • web searches
  • patents
  • funding
  • star metrics (other people in labs), equipment, vendors
  • tax records
  • business census

Metrics for transformative work

  •  text analytics
  • citation patterns from WoS

Impact distinct from transformative. Mid-career researchers moving more into transformative work.

Some findings not captured in my notes: how women PhD graduates are doing (same positions, paid slightly more, held back by family otherwise). PhD graduates in industry staying in the same state, making decent money (some non-negligible proportion in companies with median salaries >200k ... median.)

John Ioannidis, Stanford

Defining Meta-research: an evolving discipline

- how to perform communicate verify evaluate reward science

- paper in PLOS Biology, JAMA



No responses yet

Older posts »