Archive for the 'bibliometrics' category

Poster for METRICS2017: Methods for Bibliometric Institutional Profiles for the Practitioner

Oct 29 2017 Published by under bibliometrics, Conferences

The poster:

I don't know if is actually clear enough to read? This big PDF should work: Pikas Methods for Bibliometric Institutional Profiles for the Practitioner

The submission has a little more about my motivation in the poster: Pikas Institutional Profiles MET17 (pdf)

The scripts are here: https://github.com/cpikas/institutionalprofiles

Here's a lovely map I had to cut from the poster for size. Viewers may not appreciate that it is actually very unusual for us to collaborate outside of the US.

Affiliations of co-authors, sized by number of articles.

No responses yet

ASIST2017 METRICS PM

Oct 27 2017 Published by under bibliometrics, Conferences

Edwin Henneken, Alberto Accomazzi, Sergio Blanco-Cuaresma, August Muench, Lars Holm Nielsen Asclepias – Capturing Software Citations in Astronomy

Asclepias project. Enabling software citation & discovery workflows. To "promote scientific software into an identifiable, citable, and preservable object. " Adding DOI based software citations to ADS. Tracking events.

Collaborative Codebase (GitHub) > Repository (Zenodo) > software broker (harvests repositry events, software citations

example corner.py (triangle.py) - published in JOSS, 60 regular citations to, but also deposited in Zenodo. Citations to every single version of the software and a total of 100 citations.

Journals need to be able to accept software citations (actual citation to the software and not a related article). Just slapping a doi on it isn't enough.

End to end go from original proposal through all the data, papers, software, etc. and have analytics along the way.

Q: difficult to get people doing the right thing with the repositories? yes - but astro is amenable. long history of linking data

Q2: like bigger world of citing things not papers about things? yes

Eto MasakiIncreasing Source Documents of Rough Co-citation to Expand Co-citation Networks for Scientific Paper Searches

rough co-citation is a generation back from co-citation.

a + b cited together, co-citation... a+c cited together infer relationship with b, this did increase information retrieval retrieved documents that didn't exist in the network.

Pei-Ying Chen (speaker), Erica Hayes, Stefanie Haustein, Vincent Larivière, Cassidy R. Sugimoto -  Politics of platforms: the ideological perspectives of social reference manager users on scholarly communication

Looking at Mendeley and Zotero - hypothesis that Zotero users will be more to open data, etc., and Mendeley will be more traditional bcs using corporate platform.

Mendeley provided a stratified random sample of 26k users, response from about 1200. Zotero was an anonymous link advertised by Zotero at conferences.

In survey they didn't provide a category for librarians so they got a lot of "others"

From both groups: all advocate for open source software, all adopter of new technologies, most advocate for open access.

Majority of both think peer review system is broken and publishers aren't necessary for scholarly communication.

Some similarities and differences, but no real clear support for their hypothesis, as far as I could tell.

Q: try to look at the contents of the library to see if more oa or paywall journals?

 

Eshan Mohammadi, Mike Thelwall, Kristi Holmes - Interpret the meaning of academic tweets: A multi-disciplinary survey

Altmetrics - who uses twitter to communicate scholarly info, does twitter play an important role in communicating scholarly info, why, does it depend on discipline

twitter users who re/tweeted academic publications at least once using altmetric.com 4.5m twitter accounts

looked at personal web page urls 1.7 urls

using webmining, identified emailaddresses

sent online survey to 57k twitter users, got 2000 responses.

most respondents tweeting scholarly information were from the social sciences and humanities

most agree:

  • change way to read and disseminate sci info
  • twitter facilitates knowledge flows
  • reflects research impact
  • share academic findings with the general public

motivations for using and type of content shared depend on discipline, occupation and employment sector

They have a paper under review in a journal so stand by.

 

Philippe Mongeon Is there a Matilda effect in academic patenting?

We know men publish more papers than women and their papers are more cited

Now for patenting. Only about 15% of inventors are women. Patent-paper pairs. Same discovery published in a paper and patent

are women less likely to be inventor than men when we control for: position on the byline, discipline, reputation, contribution

Previous studies: no gender difference (Haeusslet & Sauermann, 2013), female more likely excluded from inventorship (Lissoni et al 2013)

all articles with 2 or more authors in wos 1991-2016, uspto patent applications 1986-2015

papers patents -1 to 5 years of app, all inventors on the actors list

text similarity of title and abstract.

discipline - based on discipline of journals cited by the paper

attribution of gender - based on Wikipedia pages (Berube in preparation)

automatic disambiguation of authors

accumulated number of citations at time of app.

contributions - manual extraction, where there were statements coded conception, analysis, performed...

regression models...

turns out place in author list has much more impact than gender, but gender is significant for all but engineering.

When taking contribution into account (many fewer papers), conception role is important  - which makes sense.

Small effect of gender on the attribution of inventorship, gender gap occurring earlier in the research process

 

No responses yet

ASIST METRICS2017 Workshop AM

Oct 27 2017 Published by under bibliometrics, Conferences

This event was held Friday October 27, 2017

Kate McCain  - Undercounting the gift givers: issues when tallying acknowledgements in life sciences research

ongoing research effort - she originally worked on this 20 years ago but has come back to it recently. Background - model organisms - useful to organize research around. Community databases, stock centers, community databases, community ethos wrt sharing.

Ways to focus research - by journal is often done, but she uses this model organism. She is looking at 1980-2004 during growth phase when there is more sharing because nascent research area. And she is looking at acknowledgements.

Compared to citations - acknowledged most likely to be alive.

Personal ack vs. funding - she's interested in personal ackn. "peer interactive communication"

May be lots of different places: end note, methods section, end of text with no section label, ... No control or standardization of how people are named, what granularity they are thanked for, etc.

WoS mostly gets funding ack, and only secondarily sweeps up some personal ack (if they are in the same block, which is not always the case).

Undercounting big deal: text extraction relying on formal ack section. personal name disambiguation. Sampling or single year studies.

Check her slides to see what she found where. She also categorized types of ack - animals, software, data, editing, etc.

Top 15 individuals listed - first few time periods dominated by University of Oregon - founders and suppliers of fish early on.

She then went through profiles of some individuals with the diversity of how they appeared.

Trends - fewer examples of thanking for research materials - have their own, get from repository, or get from stock center

questions: manually - yes? learn things to help automate - yes, but lots and lots and lots of ways to trip up. Also just picking up surnames is not enough because then get some citations mixed in, named equations/methods, etc.

Reminds me of: http://www.npr.org/2017/03/30/521931310/-thanksfortyping-spotlights-unnamed-women-in-literary-acknowledgements

questions: in the lab outside of the lab. also tracking people who are frequently acknowledged and not often co-authors/cited

questions: comment - collaboration - set up something from PMC data (already coded in XML), but only using ack section and not the Materials & Methods (M&M) section.

 

Isabelle Dorsch - Relative Visibility

How well known. She's comparing personal publication list and information services (like WoS).

Relative visibility (IS) = (d/r)*100
d= in information services, r=publication list

Previous work - Cronin & Stock, and ISSI board study

Issues - finding the personal list, is it up to date and complete, is it structured to be used at all, what types of documents to keep (novels? newspaper articles?), keep in press?

(*discussion of this on SIGMETRICS really found that a combined edited list is probably best, but these aren't universally available - list maintained by information service but updated by author)

Which information service matters (of course)  -  visibility to one field when author publishes in multiple. Conference paper coverage, book coverage, etc.

questions: new author - only two publications - 100% (they only looked at established authors). Very dependent on the database

Judit Bar-Ilan - CiteScore vs JIF and Other Journal Indicators

Criticisms of JIF but still heavily used. Standard definition. Criticisms like lack of transparency. Things in the numerator not included as "citable items" in the denominator. Also now offer a 5year JIF

Citescore - publication window 3 years. They count all items so no numerator/denominator coverage mismatch. Transparent - can see all the citations that are covered. Freely available. Some criticism that covers too many different document types

EigenFactor, SJR, pagerank type indicators - more weight to more influential sources

Article Influence - normalized - average journal is 1.

She looked at correlations - for those sources that appear on most sources.

Quite high - CS-JIF 0.94,

HOWEVER - Lancet is 5 in JIF, 314 in CS - so huge differences and she suspects due to notes, editorials, etc.

Top 20 by CS are almost all review journals (Annual Review of... , Progress in... )

Eigenfactor doesn't include journal self-citation, and doesn't correlate as well with others.

Note also that even though high correlation, there are these big differences.

question: comment - real correlation between size of journal and JIF, Eigenfactor is the only one that corrects for this.

--

Student papers

Zhao, Mao, & Kun Lu (speaking, not student) - An Exploratory Study on Co-word Network Simulation

Network centrality and other network measures for co-word network. Are they correlated. Are there differences in disciplines in these measures. Looking at generative process of a co-word network.

Q: co-word can mean 3 different things: words that appear in the text, co-descriptor - uses carefully assigned things, keywords plus - is another thing separately (not controlled, but titles of articles cited). Are you simulating second hand natural language assigned things.

Antoine Archambault, Philippe Mongeon (speaking), Vincent Larivière  - The concentration of journal use in Canadian universities

As Canadian universities have to cut big packages due to budgetary issues.

Evaluating - downloads statistics from the 28 universities (~300 Excel files, 5M lines), references (articles written by authors at these universities citing these journals) perceived importance of journals (what journals do you perceive important to your research, your teaching) 23 of 28 universities, 5,500 participants (of which 3k from their university so actually disappointing response)

Cleaning important journals - title disambiguation, manual validation, classification by major disciplinary area (AH, SS, BM, NSE) - WoS, NSF, Ulrich's, Google, also verified research journal and not newsletter, etc.

47k unique journals.

Priority journals - 80/20 rule - anything in top of 80% downloads, references, mentions (10% of subscriptions account for 80% of any of these measures)

66% of the 47k journals are not in the top anywhere.

Broke out by publishers - Springer 80% of publications were not in anyone's top. Sage only 22% were not in anyone's

Only 41.6% overlap of core journals between universities

Correlation of cites, downloads, mentions (cites are super lengthy for institutions to do themselves can they just use downloads?) - answer is no. Have to use the 3 measures, not completely correlated.

Q: can you some sort of demand driven acquisition

Q: are there libraries of record - keep even if don't use locally

Q: combining visibility presentation earlier with this.

Christine Meschede Cross-Metric Compatibility of Altmetrics: Consistency of the Metrics from PlumX and Altmetric.com

(posting before notes - battery going right now - will try to update)

No responses yet

Metaknowledge (python) and Bibliometrix (R) - more or less comprehensive bibliometrics packages for standard data science environments

Aug 28 2017 Published by under bibliometrics, Uncategorized

I thought for sure I had mentioned Metaknowledge here before but I can't find it so I must have mis-remembered. ...

There are tons of tools for bibliometrics and a lot of people really just code their own for simplicity sake even if they eventually visualize their results using an off the shelf network analysis tool or other. Sci2,VOSviewer, and CiteSpace are all close to comprehensive, freely available, and pretty easy to use tools. What need is there for another product? If you want to use the rest of your workflow or experiment with new algorithms that are not available in the above, then these two packages are good options.

When I was doing the longitudinal clustering for citation trajectories, I inadvertently saved the 6,666 (I know!) records from 1980-2015 in the regular WoS* format instead of a more useful tab delimited.  I quite easily pulled out the pub year, accession number, times cited, and other simple fields using R. ... it's just now when I actually want to follow up with some natural language processing on the titles and abstracts that I realize my kludge won't actually work for either the title or abstract. So I fooled with it a couple of different ways before heading out to see if there was anything new out there for processing these since they were such a hassle to get in the first place. It turns out there is a new, fairly comprehensive R package: Bibliometrix. I had already experimented with Metaknowledge in Python. The extensive instructions (paywall) are very helpful, but I really just wanted to stay in R.

What follows is a general intro to these two tools and my observations.

Bibliometrix

http://www.bibliometrix.org/

This package appears to be quite new with recent releases. The first thing I tried - reading in a directory full of WoS export files was like magic. In a snap, I had a dataframe with everything in the right column.

Literally:

filenames <- list.files("directory", full.names=TRUE)

getWoSdf<-function(filename){
 holdrecs<-readLines(filename)
 recsdf<-isi2df(holdrecs)
 return(recsdf)
}

WoSall<- ldply(filenames, getWoSdf)

Seems like BibTeX files are preferred over this format, but it was plenty quick for the 500 records per file I had. A nice feature is that it tells you every hundred records that it's making progress.

A nice thing is that there are pre-built basic summary/descriptive functions. It exports the standard networks but it also does co-word with a pretty neat visualization.

Multiple Correspondence Analysis (MCA) using keywords

This tool doesn't do anything to help you clean the data or pick out any weirdnesses. The visualizations shown aren't super pretty, but it's quite easy to use another R graphing tool with the data.

MetaKnowledge

http://networkslab.org/metaknowledge/

I worked through the journal article but using my own WoS data. For WoS data, everything worked as expected and I was able to quickly get really nice results.  You can also download a Jupyter notebook with their sample data to work through the process. A neat thing you don't see every day is that it will break down by male/female by guessing using a popular algorithm.  It also does Reference Publication Year Spectroscopy (meh) and besides extracting all of the standard networks you might want, it also has ways to extract text for text mining.

Some negatives based on my brief experiments. I couldn't get Scopus data to work for whatever reason. Also, it doesn't really facilitate any sort of cleaning. Data sets that come out of WoS still have issues. The graph examples were not very pretty and some sort of error came out of the way they had you do one graphing tool. It's quite easy to export data or just substitute your favorite graph because there are a million.

 

* no endorsement intended, for real.

 

2 responses so far

Animating a graph

Apr 21 2017 Published by under bibliometrics, information analysis

What I really want is an interactive network graph that my viewer can click on, slide things, flip switches, etc. I'll get there someday. In the mean time, this was fairly easy to show the evolution of a network over time.

The network is a company network generated in Quid*. I used the time filter and exported a png every 3-5 years for this network. The point being that there has been a lot of growth and in particular clusters of companies.

Here it is:

Probably need to click to enlarge to see the animation

ImageMagick is an awesome open source image tool. There have been other ways to get to it from R or python, but the magick package is the easiest I've seen. The vignettes and a blog post from rOpenSci were helpful.

 

library("magick", lib.loc="~/R/win-library/3.3")
setwd("I:/Christina's/person")
newcomp_files<-list.files("folder", pattern="*.png", full.names=TRUE)
newcomp<-lapply(newcomp_files, image_read)
#need image_join because the above produces a list instead of a magick vector 
#dispose deletes previous image
newcomp_animate<-image_animate(image_join(newcomp), fps = 0.5, dispose = "previous")
image_write(newcomp_animate, 'newcomp_animate_82-.gif')

 
Note: the final version you see also used magick to crop (not great) and to annotate with the year. I slowed the frame rate down quite a bit.
 

I tried morphing and it was pretty slow and really not worth it so much.

*not affiliated and not endorsing

No responses yet

Who are my researchers citing? A quick and dirty way to find out

Apr 21 2017 Published by under bibliometrics, Collection Development

This probably shouldn't warrant a post, but a few very experienced and up-to-date librarians didn't immediately know so probably worth sharing. Also, it turns out that Scopus* has hidden or removed a lot of the useful stuff and made it confusing to even know where to enter the search.**

In Scopus, search for your researchers. In my case, an affiliation search works.***

 

The affiliation search is a tab from the home page. There's no way to know it's a tab. It's just an underlined word. You actually then click around on the page until you find some place to type your query.

From the results list - and mind you I do this exact search regularly so yours might not be so precise - go ahead and click on all the documents. If you're at NIH then crap, because you have more then 10,000 journal articles per year so you have to do some major slicing and dicing. I just limited to 2016 and journal articles just because.

Then you look  for the "..." but then you realize it's grayed out and you can't actually click on it.

So then you click to highlight all, and then you click on "..." and you see view references.

From here, you can list the top sources and, theoretically, analyze them. They're not completely clean though. My set had JGR as well as the spelled out and the specific ones. Likewise with ApJ. So how quick and how dirty is ok? For collections development, you're probably cool with reading off. Otherwise you could export and then use OpenRefine or similar to clean.

* Not affiliated, not endorsing!

** plus - this thing in advanced search in which it is forever putting in codes I do not want ... anyway...

***hey, all the branding for my larger institution is gone? aw come on.

No responses yet

Nebulous Connections Presentation: Using bibliometrics to keep up with the Joneses

Apr 04 2017 Published by under bibliometrics

I presented this today, April 4, 2017 at the SLA Maryland event held at NASA Goddard's Library

No responses yet

Reflection on librarians and bibliometrics

Mar 25 2017 Published by under bibliometrics

I only attended a few of the sessions related to bibliometrics at ACRL2017, but I talked with a bunch of new people and I saw what was presented. This confirmed my view that:

Librarians are sophisticated and nuanced producers and consumers of bibliometrics

Last Fall, as I sat in  a presentation by Ludo Waltman (slides pdf) in which he talked of librarians as "citizen" bibliometricians who are, essentially, only able to uncritically use the JIF and h-index because they are all we understand. We are only able to look up metrics when carefully handed to us by vendors.

I was irate, because I do not see myself in that category at all. Nor do I see my colleagues at CMU, NYU, UIC, Northwestern, ARL, NIH, Cornell, and CWRU that way. But maybe I live in a bubble?

No. It was clear from ACRL that there are librarians everywhere who follow these things, care deeply about them, think critically about them, and who are sophisticated in creating and using metrics.

So I'm blowing a raspberry at your "citizen" bibliometrics. We are not the citizenry; we are the insurgency.   Ok, that's too dramatic. Let's say we're the army of the less-well funded ally that nevertheless has lots of troops in the battle.

2 responses so far

ACRL2017: When Tradition and Reality Collide: Metrics, Impact and Beyond

Mar 25 2017 Published by under bibliometrics, Conferences

Friday morning Abigail Goben, Meg Smith, and I presented at the Association of College and Research Libraries conference in Baltimore. I am not an academic librarian but I do serve researchers. I would say that SLA is probably more appropriate for librarians serving researchers in government, industry, and other settings. This was local, though!

The polls yielded some interesting feedback.

  • Our audience members were overall fairly experienced in metrics, with some experts. They knew most of the terms we threw out
  • Many of their libraries have informal support for metrics with a few libraries having formal support
  • Librarians sometimes have an uneasy role with metrics:
    • Frustrated with inappropriate use or use by uninformed people
    • Difficulty working with researchers and with administration: who should pay for the tools? who should do the work?
    • Librarian as neutral vs. metrics for competition
  • Many organizations do have RIM thingies, but they are mostly at the office of research or provost's office. There is a need for more help in how librarians can work with these offices.

No responses yet

Getting article metadata from MS Academic: some R code

As promised, I went back and did this myself instead of relying on a partner in crime (earlier referred to as an SME but he outed himself). It's funny because I had his code, but he did things differently than I do them so I needed to do it myself.

First mostly successful run I ended up with about 44% of the rows missing the metadata. I discovered fairly quickly that using TM's removePunctuation was, of course (in retrospect), closing up instead of leaving a space for intraword dashes. You can have it ignore those, but you can't have it go ahead and leave a space. I first did some finding and replacing in Excel but that got me down to 32%. Then I was like, duh, just do the gsub for [[:punct:]] and see if that's better. I hope I haven't used my quota!

Here's the code. Sign up for your key here. Also note: not affiliated, not endorsing.

#microsoft academic to try to find affiliations for article titles

library("httr", lib.loc="~/R/win-library/3.3")
library("tm", lib.loc="~/R/win-library/3.3")
library("jsonlite", lib.loc="~/R/win-library/3.3")

setwd("~/DataScienceResearchInitiative")

#don't forget the following or you will regret it
options(stringsAsFactors = FALSE)

# api info https://dev.projectoxford.ai/docs/services/56332331778daf02acc0a50b/operations/565d753be597ed16ac3ffc03

# https://api.projectoxford.ai/academic/v1.0/evaluate[?expr][&amp;model][&amp;count][&amp;offset][&amp;orderby][&amp;attributes]

#key:
msakey1<-"put yours here"

apiurl<-"https://api.projectoxford.ai/academic/v1.0/evaluate?expr="
searchexpr<-"Ti='example'"
apiattrib<-"Ti,Y,AA.AuN,AA.AfN,C.CN,J.JN,E"

#test on one to see how it works
testcite <- GET(apiurl, 
         query = list(expr = searchexpr,count = 1, attributes = apiattrib), add_headers("Ocp-Apim-Subscription-Key"= msakey1))

#get the json out into usable format
#could look for errors first
testcite$status_code

#comes out raw so need to make into text
testciteContent <- rawToChar(testcite$content)

test<-fromJSON(testciteContent)
test$entities$AA
test$entities$AA[[1]]$AuN
#this will get a ; separated vector
paste(test$entities$AA[[1]]$AuN, collapse = ';')

test$entities$AA[[1]]$AfN
test$entities$J$JN
test$entities$Y
test$entities$Ti

# initiate a dataframe
# for each title, go out and search using that title
# could add in a warn_for_status(r)  when status is not 200 (happy)
# if status !200 go to the next one,  if status =200
# extract ti, y, authors (paste), affil (paste), jn, cn, and out of entities VFN, V, FP LP DOI D
# write them to the data frame
#1904 is the length of my article title list

CitesOut<- data.frame(ti = rep(NA,1904),
                      y = integer(1904),
                      au = rep(NA,1904),
                      af = rep(NA,1904),
                      jn = rep(NA,1904),
                      cn = rep(NA,1904),
                      vfn = rep(NA,1904),
                      v = rep(NA,1904),
                      fp = rep(NA,1904),
                      lp = rep(NA,1904),
                      doi = rep(NA,1904),
                      abs = rep(NA,1904),
                      stringsAsFactors = FALSE)
  
getMScites<- function(citeNo){
  apiurl<-"https://api.projectoxford.ai/academic/v1.0/evaluate?expr="
  searchexpr<- paste0("Ti='",TitlesToFindf[citeNo],"'")
  apiattrib<-"Ti,Y,AA.AuN,AA.AfN,C.CN,J.JN,E"
  holding<-GET(apiurl,
               query = list(expr = searchexpr,count = 1, attributes = apiattrib), 
               add_headers("Ocp-Apim-Subscription-Key"= msakey1))
  print(paste("cite number", citeNo,"status is:", holding$status_code))
  print(holding$headers$`content-length`)
  holdingContent <- rawToChar(holding$content)
  holdC<-fromJSON(holdingContent)
  cciterow<-data.frame(
    ti=ifelse(is.null(holdC$entities$Ti),NA,holdC$entities$Ti),
    y=ifelse(is.null(holdC$entities$Y),NA,as.integer(holdC$entities$Y)), 
    au=ifelse(is.null(holdC$entities$AA[[1]]$AuN),NA,paste(holdC$entities$AA[[1]]$AuN, collapse = ';')),
    af=ifelse(is.null(holdC$entities$AA[[1]]$AfN),NA,paste(holdC$entities$AA[[1]]$AfN, collapse = ';')),
    jn=ifelse(is.null(holdC$entities$J$JN),NA,holdC$entities$J$JN),
    cn=ifelse(is.null(holdC$entities$C$CN),NA,holdC$entities$C$CN))
  print(cciterow)
  if(is.null(holdC$entities$E)){
    eciterow<-data.frame(
      vfn=NA,
      v=NA,
      fp=NA,
      lp=NA,
      doi=NA,
      abs=NA)
  } else {
    holdE<-fromJSON(holdC$entities$E)
    eciterow<-data.frame(
      vfn=ifelse(is.null(holdE$VFN),NA,holdE$VFN),
      v=ifelse(is.null(holdE$V),NA,holdE$V),
    fp=ifelse(is.null(holdE$FP),NA,holdE$FP),
    lp=ifelse(is.null(holdE$LP),NA,holdE$LP),
    doi=ifelse(is.null(holdE$DOI),NA,holdE$DOI),
    abs=ifelse(is.null(holdE$D),NA,holdE$D)
    )
  }
  print(eciterow)
  citerow<-cbind(cciterow,eciterow, stringsAsFactors=FALSE)
  print("this is citerow")
  print(citerow)
  return(citerow)
} 

#troubleshooting
apiurl<-"https://api.projectoxford.ai/academic/v1.0/evaluate?expr="
searchexpr<- paste0("Ti='",TitlesToFindf[4],"'")
apiattrib<-"Ti,Y,AA.AuN,AA.AfN,C.CN,J.JN,E"
troubleshoot<-GET(apiurl,
               query = list(expr = searchexpr,count = 1, attributes = apiattrib), 
               add_headers("Ocp-Apim-Subscription-Key"= msakey1))

troubleshoot$status_code
troubleshoot$headers$`content-length`

troubleshootcontent<-rawToChar(troubleshoot$content)  
troubleC<-fromJSON(troubleshootcontent)
troubleE<-fromJSON(troubleC$entities$E)

#prepare title list
## IMPORTANT - all the titles have to be lower case and there can't be any punctuation
TitlesToFind <- read.delim("~/DataScienceResearchInitiative/TitlesToFind.csv", header=FALSE)

TitlesToFindl<-apply(TitlesToFind,1,tolower)

TitlesToFindf<- gsub("[[:punct:]]"," ",TitlesToFindl)

head(TitlesToFindf)

#use the sys.sleep so you don't get an error for too many requests too quickly
for (i in 21:1904){
  temp<-getMScites(i)
  CitesOut[i,]<-temp
  Sys.sleep(2)}
write.csv(CitesOut,"MSdsCites.csv")

length(which(is.na(CitesOut$ti)))
length(which(is.na(CitesOut$abs)))

missCites<-which(is.na(CitesOut$ti))

for (i in 1:length(missCites)) {
  temp<-getMScites(missCites[i])
  CitesOut[missCites[i],]<-temp
  Sys.sleep(2)
}

edited: to fix formatting. also the missing cites were writing to the wrong rows.. sigh.

No responses yet

Older posts »