Notes from 1.5 days of: Collections as Data: Hack-to-Learn

Aug 10 2017 Published by under Conferences, information analysis

You guys - this post has been in draft since May 22, 2017! I'm just posting it...

Collections as Data: Hack-to-Learn was a fabulous workshop put on by the Library of Congress, George Washington University Libraries, and George Mason University Libraries. It was a neat gathering of interesting and talented people, nifty data, and very cool tools.  It didn't hurt either that it was in a beautiful conference room with a view of the Capitol the first day and at the renovated Winston Churchill Center at GWU the second. A lot of it was geared toward metadata librarians and digital humanities librarians, but I felt welcomed. Readers of this blog will know that I really want to bring these tools to more public services/liaison/etc., librarians so it was good.

Unfortunately, I had to leave mid-day on day 2 because of a family emergency 🙁 (everybody is ok) but here are some interesting tidbits to save and share.

Data Sets:

LoC MARC Records

Have you heard that LoC freed a ton of their cataloging data? FREE. Should have always been freely available. Actually this is only up to December 2013 and the remainder are still under paid subscription ... but ... still! People are already doing cool things with it (neat example). We had a part of this that the organizers had kindly already done some editing on.

Phyllis Diller Gag File

This was a sort of poorly formatted csv of several drawers of the file. Hard not to just sit and chuckle instead of analyzing

Eleanor Roosevelt's My Day Columns

Apparently Roosevelt wrote these from the 1930s to her death in 1962. Originally she wrote them 5 days a week but tapered to 3 when her health failed. They are a few paragraphs and more or less dryly list her activities.

End of Term Tumblr Archive (no link)

This was archived as part of the efforts to capture the outgoing administration's stuff before it disappeared. It was a very interesting collection of things from museums to astronauts.

 

Somewhere in here we covered TEI - I had no idea this existed. How cool. So like when you're doing transcripts of interviews you can, for example, keep the erm, uh, coughs... or ignore depending on the level of analysis?  TEI lets you annotate texts with all sorts of detail and make it linked data for entities, etc.

Tools:

  • OpenRefine - more detailed use and examples of reconciliation
  • Voyant - very, very cool tool to at least do preliminary analysis of text. NB: installing on my work windows machine was a bit rough. I ended up getting a Linux VM and it works well/easily. The visualizations are great. Limitation in number of texts you can import at a time.
  • MALLET - did you think this one was too hard and required java or some such? Turns out there's a command line one anyone can use. We did topic models for some of the sets. I think I will probably stay with the way I've been doing them in R because seems like they're easier to understand.
  • Gephi - yeah, again, and i still can't get along with it. I have to face that it's just me.
  • Carto - a cool mapping tool

Also, day 2 someone suggested spaCy instead of NLTK for natural language processing in Python. This is another thing I couldn't get working for anything on my windows box from work. I don't know if there is something being blocked or what. Installs and works beautifully on the Linux machine, though.

 

 

No responses yet

Leave a Reply