ASIST2017: Making A Case for Open Research: Implications for Reproducibility and Transparency

(by Christina Pikas) Oct 31 2017

Came in super late (darn traffic - left home, 30 mi away, 2 hours before getting here)

Caught the end of Erik Mitchell and Edward M. Corrado - they did a survey of JASIST authors and the responses were bleak. Suprising only like 25% or so had an IRB? Few shared data or had data management plans. Few shared code. Few really did that much about open access.

John M. Budd: A retraction walks into the bar. Bartender says: what will you have? Retraction says: Nevermind. and doesn't leave the bar.

Retractions - lots. And lots of things that had been cited, the citations were substantive. Marking of retractions is poor. Work to be done and presented next year

Audience discussion:

Q: Is anxiety an issue? some researchers have been attacked for sharing data.

A: Well in qualitative, it isn't appropriate to really talk about reproducible

A: We didn't see this anxiety in our work, but maybe a qualitative study would

Q: Question about the result that said lack of consent was a reason not to share. Audience member was member of a project that had to go through specific consent forms to see if data could be used for new protocols.

(fwiw, I did actually reveal names of my participants in my dissertation research but I went back and re-asked consent giving examples of how it would be done)

Q: IRBs or research sites requiring destruction of data

Q: Works at a DOE national lab -and they have strict requirements for DMPs. Isn't that going to be more the norm now with funder requirements

A: Not evenly held accountable. Different agencies coming online at different points.

A: Someone from DOT - we're just now having funding calls that have this requirement. There are new requirements for PII data and DMP. They are part of the compliance chain at the National Transportation Library. They haven't gotten any data back in for it yet. It will be that you will be ineligible for future funding if you do not provide identifiers (this might be part of one contract broken into blocks). Many if not most large funders of science - Gates, Wellcome, other funders requiring.

Q: if I do a qualitative study of how people like riding buses, would the interview transcripts be deposited and available?

A: Yes - but probably some sort of de-identification, compilation, anonymization, etc. (I added this part).


No responses yet

ASIST2017: Social Media Papers session

(by Christina Pikas) Oct 30 2017

Fei Shu (speaker) & Stefanie Haustein - On the Citation Advantage of Tweeted Papers at the Journal Level

Previous research - twitter exposure leads to an overall increase of citations. Correlation is weak. Low social media impact in countries where Twitter, for example, is limited or blocked.

Research questions - compare normalized citation rate of articles shared on twitter with similar papers from the same year. 22% of WoS papers are tweeted? (talking fast!) This causes problems - so look at journal level, control for journal, discipline, country of origin author. Data Web of Science and . Use the DOI to search both. In Altmetric can see where the tweets originate. They used thresholds to deal with outliers. Used tweets and citations from 2012 to 2015. Since there were some papers with very few tweeted papers, these would be difficult to compare. Used journals with at least 10 tweeted and 10 non-tweeted papers. ... also did threshold with 100 and 100 - in this case of 308 journals, 36% papers tweeted. Tweeted papers receive 68.4% more citations on average than non-tweeted (not corrected). Corrected by journal 30% citation advantage (significant at p<0.05). By discipline - varies - significant in 9 disciplines - not significant in chem, engr, human, math due to sample size. Source countries (based on author institution) - threshold level. Country with top tweeted - Netherlands. Sweden 91% citation advantage.

Citation advantage 30%, in all disciplines but extent varies.

Most tweets are from 6 months after publication

Chris Hubbles (speaker), David W. McDonald, & Jin Ha Lee - F#%@ That Noise: SoundCloud As (A-)Social Media?

SoundCloud is used to share and communicate about music. Has timestamped commenting and allows social interaction among fans woven into the playback feature. Used to distribute music, podcasts, and even some government organizations. "Social Multimedia" . Qualitative content analysis on these comments. Used search API (ID popular tracks) and then track API to pull all the comments for these tracks. Whole year of 2013. 100-200 tracks per day uploaded except for a weird spike. They removed from the sample spoken word. Kept 0-10 miutes, 10-500 comments. Collaboratively coded by authors. Codebook with 39 codes.  58 songs, 5,608 comments. 69% electronic music and hip-hop. Music was uploaded by artists, labels, promotion companies, fans, etc.  Comments were mostly positive. Were full of profanity, caps, emoji, exclamation points. But also about features of the music, and stories of where the music was heard and what it meant. Few of the comments were part of conversation threads. One track had 77 comments with no replies. Uploader replies were almost as common as fan replies.

Similar to what Dana Rotman found with YouTube. The presence of affordances doesn't mean will form community.

The display could be better to support participation.

"A-social party" - expression and not interaction. Broadcasting, graffiti, co-presence, mutually shared experience.

Quan Zhou, Chei Sian Lee (Speaker), & Sei-Ching Joanna Sin - Using Social Media in Formal Learning: Investigating Learning Strategies and Satisfaction

Self-regulated learning (Pintrich, 2000, p453) - "an active constructive process whereby learners set goals ... then monitor regulate... constrained by ...goals... environment". forethought, performance control, self-reflection

survey  - undergrad and grad students, if they used social media for any class, standard scales for learning strategies and satisfaction... n=270

PCA and regression. all 4 learning strategies significant. Goal setting most influential predictor of learning satisfaction. Self-evaluation second (social comparison - is a motivating force, unlike general studies of social media where comparison makes you unhappy). Keep in mind, their students are maybe more highly motivated than some other samples.

Limitations - didn't look at whether use was voluntary or mandatory. one university

q: how did you define social media? big list

q: did you ask for how the social media were used in the class? (no, not really?)

No responses yet

ASIST 2017: Digital Literacy in the Era of Fake News: Key Roles for Information Professionals

(by Christina Pikas) Oct 30 2017

They were having problems with the projector so started with Connaway going through studies they've done related to information literacy. Important to provost and universities - learning doesn't stop when students graduate. How do we get students to use public libraries and use information in every day life decisionmaking.

  • How do people who work with the public in libraries get updated on information literacy
  • What do students know about how search engines work
  • How do people assess information on the web and in social media

Heidi Julien - engage in issues and model approaches

  • social media campaign about facts
  • express views publicly and stand up to confront misinformation
  • educate representatives at all levels of government - these issues are important and institutions like library need to be supported
  • advocate for importance of digital literacy.
  • Aldous Huxley "facts do not cease to exist because they are ignored"
  • (other international infographics and things to share)

Seadle - information professionals provide context and nuanced view.


Alex Kasprak - Science Writer at

Some things they're seeing are more like overblown - Yellowstone volcano may erupt sooner... > we're all gonna die!

More things like autism/vaccination.

Deeper expose on retired scientist who is peddling snake oil cure for cancer.

"50 studies say... " - he's never found one of these in which the studies do support the claim

Recent from B saying 400 articles saying climate change a hoax. Kasprak asked author "how long did it take to prepare" and Delingpole said "as little time as possible" (can share this because Delingpole posted that an "impertinent pup" from Snopes was fact-checking him with this comment).

questions: debunking - is it really useful or is it just giving more attention? Snopes won't necessary solve the problem but serve as a reference and affect the financial viability of these sites. Real world implications when Snopes debunks.

is it really about believing things that are untrue or is it more taking away debate - yes

is the term "fake news" too charged now to use. yes - probably not a useful term

other terms: lazy journalism, hucksterism, pseudoscience, etc. use more precise term

more blame on producers - but can we increase the cost of being wrong (reposting these stories)

Habermas-ian - public sphere as a place for exchange of rational ideas - but with Foucalt hat on if our problem is trying to maintain this notion of civil society in the face of people who are no longer interested in the ideal. The rational argument against an emotional or financial gain... beat our head against the wall.

Seadle response - like both schools of thought. but people aren't rational. behavioral economics. pure number of hits on a website gets you more money. Incentive structures to bring people back to

Julien - we are beating our heads against the wall, multiple cognitive biases, all operating in our own echo chambers - ideal

My q: influence operations by state actors vs. this

Kasprak - the state actors were taking messages already existing or making new messages modeled on existing, and then amplifying, paying to target these. So combating is actually similar, but we're not winning, and there are higher numbers.



No responses yet

Poster for METRICS2017: Methods for Bibliometric Institutional Profiles for the Practitioner

(by Christina Pikas) Oct 29 2017

The poster:

I don't know if is actually clear enough to read? This big PDF should work: Pikas Methods for Bibliometric Institutional Profiles for the Practitioner

The submission has a little more about my motivation in the poster: Pikas Institutional Profiles MET17 (pdf)

The scripts are here:

Here's a lovely map I had to cut from the poster for size. Viewers may not appreciate that it is actually very unusual for us to collaborate outside of the US.

Affiliations of co-authors, sized by number of articles.

No responses yet


(by Christina Pikas) Oct 27 2017

Edwin Henneken, Alberto Accomazzi, Sergio Blanco-Cuaresma, August Muench, Lars Holm Nielsen Asclepias – Capturing Software Citations in Astronomy

Asclepias project. Enabling software citation & discovery workflows. To "promote scientific software into an identifiable, citable, and preservable object. " Adding DOI based software citations to ADS. Tracking events.

Collaborative Codebase (GitHub) > Repository (Zenodo) > software broker (harvests repositry events, software citations

example ( - published in JOSS, 60 regular citations to, but also deposited in Zenodo. Citations to every single version of the software and a total of 100 citations.

Journals need to be able to accept software citations (actual citation to the software and not a related article). Just slapping a doi on it isn't enough.

End to end go from original proposal through all the data, papers, software, etc. and have analytics along the way.

Q: difficult to get people doing the right thing with the repositories? yes - but astro is amenable. long history of linking data

Q2: like bigger world of citing things not papers about things? yes

Eto MasakiIncreasing Source Documents of Rough Co-citation to Expand Co-citation Networks for Scientific Paper Searches

rough co-citation is a generation back from co-citation.

a + b cited together, co-citation... a+c cited together infer relationship with b, this did increase information retrieval retrieved documents that didn't exist in the network.

Pei-Ying Chen (speaker), Erica Hayes, Stefanie Haustein, Vincent Larivière, Cassidy R. Sugimoto -  Politics of platforms: the ideological perspectives of social reference manager users on scholarly communication

Looking at Mendeley and Zotero - hypothesis that Zotero users will be more to open data, etc., and Mendeley will be more traditional bcs using corporate platform.

Mendeley provided a stratified random sample of 26k users, response from about 1200. Zotero was an anonymous link advertised by Zotero at conferences.

In survey they didn't provide a category for librarians so they got a lot of "others"

From both groups: all advocate for open source software, all adopter of new technologies, most advocate for open access.

Majority of both think peer review system is broken and publishers aren't necessary for scholarly communication.

Some similarities and differences, but no real clear support for their hypothesis, as far as I could tell.

Q: try to look at the contents of the library to see if more oa or paywall journals?


Eshan Mohammadi, Mike Thelwall, Kristi Holmes - Interpret the meaning of academic tweets: A multi-disciplinary survey

Altmetrics - who uses twitter to communicate scholarly info, does twitter play an important role in communicating scholarly info, why, does it depend on discipline

twitter users who re/tweeted academic publications at least once using 4.5m twitter accounts

looked at personal web page urls 1.7 urls

using webmining, identified emailaddresses

sent online survey to 57k twitter users, got 2000 responses.

most respondents tweeting scholarly information were from the social sciences and humanities

most agree:

  • change way to read and disseminate sci info
  • twitter facilitates knowledge flows
  • reflects research impact
  • share academic findings with the general public

motivations for using and type of content shared depend on discipline, occupation and employment sector

They have a paper under review in a journal so stand by.


Philippe Mongeon Is there a Matilda effect in academic patenting?

We know men publish more papers than women and their papers are more cited

Now for patenting. Only about 15% of inventors are women. Patent-paper pairs. Same discovery published in a paper and patent

are women less likely to be inventor than men when we control for: position on the byline, discipline, reputation, contribution

Previous studies: no gender difference (Haeusslet & Sauermann, 2013), female more likely excluded from inventorship (Lissoni et al 2013)

all articles with 2 or more authors in wos 1991-2016, uspto patent applications 1986-2015

papers patents -1 to 5 years of app, all inventors on the actors list

text similarity of title and abstract.

discipline - based on discipline of journals cited by the paper

attribution of gender - based on Wikipedia pages (Berube in preparation)

automatic disambiguation of authors

accumulated number of citations at time of app.

contributions - manual extraction, where there were statements coded conception, analysis, performed...

regression models...

turns out place in author list has much more impact than gender, but gender is significant for all but engineering.

When taking contribution into account (many fewer papers), conception role is important  - which makes sense.

Small effect of gender on the attribution of inventorship, gender gap occurring earlier in the research process


No responses yet


(by Christina Pikas) Oct 27 2017

This event was held Friday October 27, 2017

Kate McCain  - Undercounting the gift givers: issues when tallying acknowledgements in life sciences research

ongoing research effort - she originally worked on this 20 years ago but has come back to it recently. Background - model organisms - useful to organize research around. Community databases, stock centers, community databases, community ethos wrt sharing.

Ways to focus research - by journal is often done, but she uses this model organism. She is looking at 1980-2004 during growth phase when there is more sharing because nascent research area. And she is looking at acknowledgements.

Compared to citations - acknowledged most likely to be alive.

Personal ack vs. funding - she's interested in personal ackn. "peer interactive communication"

May be lots of different places: end note, methods section, end of text with no section label, ... No control or standardization of how people are named, what granularity they are thanked for, etc.

WoS mostly gets funding ack, and only secondarily sweeps up some personal ack (if they are in the same block, which is not always the case).

Undercounting big deal: text extraction relying on formal ack section. personal name disambiguation. Sampling or single year studies.

Check her slides to see what she found where. She also categorized types of ack - animals, software, data, editing, etc.

Top 15 individuals listed - first few time periods dominated by University of Oregon - founders and suppliers of fish early on.

She then went through profiles of some individuals with the diversity of how they appeared.

Trends - fewer examples of thanking for research materials - have their own, get from repository, or get from stock center

questions: manually - yes? learn things to help automate - yes, but lots and lots and lots of ways to trip up. Also just picking up surnames is not enough because then get some citations mixed in, named equations/methods, etc.

Reminds me of:

questions: in the lab outside of the lab. also tracking people who are frequently acknowledged and not often co-authors/cited

questions: comment - collaboration - set up something from PMC data (already coded in XML), but only using ack section and not the Materials & Methods (M&M) section.


Isabelle Dorsch - Relative Visibility

How well known. She's comparing personal publication list and information services (like WoS).

Relative visibility (IS) = (d/r)*100
d= in information services, r=publication list

Previous work - Cronin & Stock, and ISSI board study

Issues - finding the personal list, is it up to date and complete, is it structured to be used at all, what types of documents to keep (novels? newspaper articles?), keep in press?

(*discussion of this on SIGMETRICS really found that a combined edited list is probably best, but these aren't universally available - list maintained by information service but updated by author)

Which information service matters (of course)  -  visibility to one field when author publishes in multiple. Conference paper coverage, book coverage, etc.

questions: new author - only two publications - 100% (they only looked at established authors). Very dependent on the database

Judit Bar-Ilan - CiteScore vs JIF and Other Journal Indicators

Criticisms of JIF but still heavily used. Standard definition. Criticisms like lack of transparency. Things in the numerator not included as "citable items" in the denominator. Also now offer a 5year JIF

Citescore - publication window 3 years. They count all items so no numerator/denominator coverage mismatch. Transparent - can see all the citations that are covered. Freely available. Some criticism that covers too many different document types

EigenFactor, SJR, pagerank type indicators - more weight to more influential sources

Article Influence - normalized - average journal is 1.

She looked at correlations - for those sources that appear on most sources.

Quite high - CS-JIF 0.94,

HOWEVER - Lancet is 5 in JIF, 314 in CS - so huge differences and she suspects due to notes, editorials, etc.

Top 20 by CS are almost all review journals (Annual Review of... , Progress in... )

Eigenfactor doesn't include journal self-citation, and doesn't correlate as well with others.

Note also that even though high correlation, there are these big differences.

question: comment - real correlation between size of journal and JIF, Eigenfactor is the only one that corrects for this.


Student papers

Zhao, Mao, & Kun Lu (speaking, not student) - An Exploratory Study on Co-word Network Simulation

Network centrality and other network measures for co-word network. Are they correlated. Are there differences in disciplines in these measures. Looking at generative process of a co-word network.

Q: co-word can mean 3 different things: words that appear in the text, co-descriptor - uses carefully assigned things, keywords plus - is another thing separately (not controlled, but titles of articles cited). Are you simulating second hand natural language assigned things.

Antoine Archambault, Philippe Mongeon (speaking), Vincent Larivière  - The concentration of journal use in Canadian universities

As Canadian universities have to cut big packages due to budgetary issues.

Evaluating - downloads statistics from the 28 universities (~300 Excel files, 5M lines), references (articles written by authors at these universities citing these journals) perceived importance of journals (what journals do you perceive important to your research, your teaching) 23 of 28 universities, 5,500 participants (of which 3k from their university so actually disappointing response)

Cleaning important journals - title disambiguation, manual validation, classification by major disciplinary area (AH, SS, BM, NSE) - WoS, NSF, Ulrich's, Google, also verified research journal and not newsletter, etc.

47k unique journals.

Priority journals - 80/20 rule - anything in top of 80% downloads, references, mentions (10% of subscriptions account for 80% of any of these measures)

66% of the 47k journals are not in the top anywhere.

Broke out by publishers - Springer 80% of publications were not in anyone's top. Sage only 22% were not in anyone's

Only 41.6% overlap of core journals between universities

Correlation of cites, downloads, mentions (cites are super lengthy for institutions to do themselves can they just use downloads?) - answer is no. Have to use the 3 measures, not completely correlated.

Q: can you some sort of demand driven acquisition

Q: are there libraries of record - keep even if don't use locally

Q: combining visibility presentation earlier with this.

Christine Meschede Cross-Metric Compatibility of Altmetrics: Consistency of the Metrics from PlumX and

(posting before notes - battery going right now - will try to update)

No responses yet

Brief notes from Maryland SLA's Storytelling with Data

(by Christina Pikas) Oct 20 2017

This one-day meeting/course/workshop/seminar (?) was held at the University of Maryland (go Terps!) on October 12, 2017. As with all events planned by my local SLA chapter, it was very well organized and run. The speakers were all excellent. Amazingly, the parking was close and pre-paid. The food was great, too.

Keith Marzullo - the dean of the iSchool - gave some welcoming remarks. He was so positive and seemed to really get the point of the day.

The opening keynote was by Ya-Ling Lu from the National Institutes of Health library (not NLM but the campus library). I have mostly heard her speak tag-teaming with Chris Belter on bibliometrics techniques but it was wonderful to have the opportunity to hear a long presentation just by her on visualization. She talked about having a low floor - starting at the beginning - and a high ceiling - keep learning and improving.

She talked about learning design and how choices convey emotion and meaning. Her example was from Picture This: How Pictures Work by Molly Bang

  WorldCat link

It was amazing to see how simple rectangles and triangles, their color, size, and location really told the story.

She also provided examples of developing information products. The first was to celebrate the life and career of someone retiring. She needed data and visualizations and a story for people, research, and leadership.

A second example was graphing how she spends her day to try to find more time for the things she wants to do.

Finally, she skipped over an example of how she successfully fought a traffic ticket using data and visualizations.

Oh, and she often uses Excel for her visualizations - even when she can make them in R or Matlab.


Jessie Sigman from University of Maryland spoke next about using cytoscape and gephi to do graphs showing coverage of agricultural topics across research databases.

Vendor updates were provided by the sponsoring companies: Clarivate, Ebsco, and Cambridge University Press. CUP is doing a neat new thing that's sort of like Morgan & Claypool - it's like a monographic series, but the volumes are 40-70 pages. Peer reviewed and series are edited like journals.

David Durden and Joseph Koivisto of University of Maryland spoke next about the different stories that can be told with repository usage data. So it turns out that D-Space has separate data for the content (say PDF) and the metadata and integrating this mess to get a real, accurate picture of how the system is being used is a bit of a bitch. It's indexed by Solr, but Solr doesn't keep the same index number for the content - it assigns its own. Google Analytics does a lot, but maybe not the right things. RAMP, a project out of the University of Montana, helps with Google data but also has shortcomings. Things based on Google do the best they can to filter out bots. HOWEVER, if it's a bot a professor on campus wrote to analyze data, then that's a great use to track. Also Google doesn't capture the full text downloads.


Brynne Norton from NASA Goddard spoke of a cool visualization using interlibrary loan data. Standard statistics are just like time to get things filled and % requests filled. The data are horribly messy, with some citations lacking even an article title. She compiled the article titles using a series of regex searches and searched them through the Web of Science GUI. Yeah, the GUI. Apparently you can OR about 500 articles at a time! (as an aside: yes, there is indeed a WoS API, but you cannot use it for this purpose. You are only allowed to search for yourself. I know.) Then she loaded into VosViewer and did a topic map. It was really cool and she narrated how it showed certain areas they might consider collecting in.


Sally Gore did the closing keynote and boy is she awesome. I highly recommend librarians sign up for her webinar when SLA schedules it. She was also super encouraging. She spoke of how she figured out how to do these amazing infographics on her own - she even uses PowerPoint and sometimes draws her own icons. She recommended books by Stephanie Evergreen to learn design.  I have more notes, but they're at work and I'm trying to get this published - so I'll add if I find anything else I wanted to note

The closing remarks were actually terrible. The guy who gave them had not actually attended any of the day or really read the descriptions of the speakers. His comments were like on research data management which is irrelevant to the day's topic. Boo.

But then we drank wine and had some more food so it was ok 🙂

No responses yet

Metaknowledge (python) and Bibliometrix (R) - more or less comprehensive bibliometrics packages for standard data science environments

(by Christina Pikas) Aug 28 2017

I thought for sure I had mentioned Metaknowledge here before but I can't find it so I must have mis-remembered. ...

There are tons of tools for bibliometrics and a lot of people really just code their own for simplicity sake even if they eventually visualize their results using an off the shelf network analysis tool or other. Sci2,VOSviewer, and CiteSpace are all close to comprehensive, freely available, and pretty easy to use tools. What need is there for another product? If you want to use the rest of your workflow or experiment with new algorithms that are not available in the above, then these two packages are good options.

When I was doing the longitudinal clustering for citation trajectories, I inadvertently saved the 6,666 (I know!) records from 1980-2015 in the regular WoS* format instead of a more useful tab delimited.  I quite easily pulled out the pub year, accession number, times cited, and other simple fields using R. ... it's just now when I actually want to follow up with some natural language processing on the titles and abstracts that I realize my kludge won't actually work for either the title or abstract. So I fooled with it a couple of different ways before heading out to see if there was anything new out there for processing these since they were such a hassle to get in the first place. It turns out there is a new, fairly comprehensive R package: Bibliometrix. I had already experimented with Metaknowledge in Python. The extensive instructions (paywall) are very helpful, but I really just wanted to stay in R.

What follows is a general intro to these two tools and my observations.


This package appears to be quite new with recent releases. The first thing I tried - reading in a directory full of WoS export files was like magic. In a snap, I had a dataframe with everything in the right column.


filenames <- list.files("directory", full.names=TRUE)


WoSall<- ldply(filenames, getWoSdf)

Seems like BibTeX files are preferred over this format, but it was plenty quick for the 500 records per file I had. A nice feature is that it tells you every hundred records that it's making progress.

A nice thing is that there are pre-built basic summary/descriptive functions. It exports the standard networks but it also does co-word with a pretty neat visualization.

Multiple Correspondence Analysis (MCA) using keywords

This tool doesn't do anything to help you clean the data or pick out any weirdnesses. The visualizations shown aren't super pretty, but it's quite easy to use another R graphing tool with the data.


I worked through the journal article but using my own WoS data. For WoS data, everything worked as expected and I was able to quickly get really nice results.  You can also download a Jupyter notebook with their sample data to work through the process. A neat thing you don't see every day is that it will break down by male/female by guessing using a popular algorithm.  It also does Reference Publication Year Spectroscopy (meh) and besides extracting all of the standard networks you might want, it also has ways to extract text for text mining.

Some negatives based on my brief experiments. I couldn't get Scopus data to work for whatever reason. Also, it doesn't really facilitate any sort of cleaning. Data sets that come out of WoS still have issues. The graph examples were not very pretty and some sort of error came out of the way they had you do one graphing tool. It's quite easy to export data or just substitute your favorite graph because there are a million.


* no endorsement intended, for real.


2 responses so far

Data point on if we need expensive specialty research databases

(by Christina Pikas) Aug 17 2017

***NOTE: Not endorsing any commercial products***

A search for "quantum computing" in the Abstract, Title, Keywords in Scopus yields 6,415 records

A search for DE (quantum computing) in Inspec yields 20,403.   By DE I mean descriptor - "quantum computing" has been in Inspec's thesaurus since 1999. Use the correct formatting for your platform.

One really quick insight: US is top publisher in Scopus and China is in Inspec.

One response so far

Notes from 1.5 days of: Collections as Data: Hack-to-Learn

(by Christina Pikas) Aug 10 2017

You guys - this post has been in draft since May 22, 2017! I'm just posting it...

Collections as Data: Hack-to-Learn was a fabulous workshop put on by the Library of Congress, George Washington University Libraries, and George Mason University Libraries. It was a neat gathering of interesting and talented people, nifty data, and very cool tools.  It didn't hurt either that it was in a beautiful conference room with a view of the Capitol the first day and at the renovated Winston Churchill Center at GWU the second. A lot of it was geared toward metadata librarians and digital humanities librarians, but I felt welcomed. Readers of this blog will know that I really want to bring these tools to more public services/liaison/etc., librarians so it was good.

Unfortunately, I had to leave mid-day on day 2 because of a family emergency 🙁 (everybody is ok) but here are some interesting tidbits to save and share.

Data Sets:

LoC MARC Records

Have you heard that LoC freed a ton of their cataloging data? FREE. Should have always been freely available. Actually this is only up to December 2013 and the remainder are still under paid subscription ... but ... still! People are already doing cool things with it (neat example). We had a part of this that the organizers had kindly already done some editing on.

Phyllis Diller Gag File

This was a sort of poorly formatted csv of several drawers of the file. Hard not to just sit and chuckle instead of analyzing

Eleanor Roosevelt's My Day Columns

Apparently Roosevelt wrote these from the 1930s to her death in 1962. Originally she wrote them 5 days a week but tapered to 3 when her health failed. They are a few paragraphs and more or less dryly list her activities.

End of Term Tumblr Archive (no link)

This was archived as part of the efforts to capture the outgoing administration's stuff before it disappeared. It was a very interesting collection of things from museums to astronauts.


Somewhere in here we covered TEI - I had no idea this existed. How cool. So like when you're doing transcripts of interviews you can, for example, keep the erm, uh, coughs... or ignore depending on the level of analysis?  TEI lets you annotate texts with all sorts of detail and make it linked data for entities, etc.


  • OpenRefine - more detailed use and examples of reconciliation
  • Voyant - very, very cool tool to at least do preliminary analysis of text. NB: installing on my work windows machine was a bit rough. I ended up getting a Linux VM and it works well/easily. The visualizations are great. Limitation in number of texts you can import at a time.
  • MALLET - did you think this one was too hard and required java or some such? Turns out there's a command line one anyone can use. We did topic models for some of the sets. I think I will probably stay with the way I've been doing them in R because seems like they're easier to understand.
  • Gephi - yeah, again, and i still can't get along with it. I have to face that it's just me.
  • Carto - a cool mapping tool

Also, day 2 someone suggested spaCy instead of NLTK for natural language processing in Python. This is another thing I couldn't get working for anything on my windows box from work. I don't know if there is something being blocked or what. Installs and works beautifully on the Linux machine, though.



No responses yet

« Newer posts Older posts »