Archive for the 'Information Science' category

Government cost recovery gone awry: PACER and NTIS

Aug 27 2014 Published by under information policy, Information Science

(reiterating these are just my personal opinion and do not reflect anything from my place of work - if you know what that is - or anything else)

For many years, the US federal government has tried to cut costs by outsourcing anything that isn't inherently governmental, making sure that government doesn't compete with industry, and requiring cost recovery for government agencies that provide services to other agencies (see A-76 ).

Old examples that might have changed: GPO had to do all printing of history books for military historians, but the quality was ok, the distribution was crap, and the DoD history organizations and readers had to pay a lot of money. So what they did when I worked there was to give the book to a university press that would do a decent job with it. The books were not copyrightable anyway because they were work for hire by a government employee. Everyone was happy. Another old example was that Navy was required to send all records to NARA. But then Navy all the sudden had to pay NARA to keep the documents (I think this has changed - my example is from late 1990s). This was things like deck logs. Hugely important documents.

NTIS has long been caught up in this. Agencies producing technical reports are required by law to send them to NTIS (if they are unlimited distribution). NTIS is required to recover the cost of their administration and archiving by selling the documents. This is hard because first, agencies are not thorough in sending stuff to NTIS (often because their central repository isn't even getting copies - even though required by regulations, instructions, etc.) and second, agencies make these documents available for free from their own sites.  NTIS also has picked up a few bucks here and there doing web and database consulting and licensing their abstracting and indexing database to vendors who resell to libraries. Why pay for it from a third-party vendor? Cross search with your favorite engineering database. Better search tools.

PACER is also caught up in this. There's actually a law that says US Courts has to recover the cost of running the system by charging for access or for documents. They do not want to but there is a law that they must obey.  This is information that really should be freely available and easily accessible. A famous activist tried to download the whole thing and make available, but he was stopped.

The results of forcing these agencies - GPO, NTIS, US Courts - to recover their costs are great and they directly work against the open government we need and deserve. It causes the agencies to cut corners and not have the systems they need. It causes customer agencies and citizens alike to distrust and dislike them.

Now, US Courts has removed large collections of historical documents from PACER because of an IT upgrade. Read the Washington Post article. Various people in Congress are trying to shut NTIS down, again. GPO seems to be ok, for now - lots of cool neat things from them.

Libraries  - like mine - have been burdened by cost recovery, too, and it often signals the beginning of the end. Superficially, makes sense to show how much something is valued and by whom. In practice, you need a lot more accounting systems and controls over the professional workers that prevent them from doing their job. These services are directly in support of strategic requirements (open government and accountability) but are infrastructure. People are blind to infrastructure until it's no longer there.  NTIS, PACER, GPO and others need to stop with this cost recovery business (meaning Congress has to pass a law that removes that requirement) and be funded as infrastructure. Outsource to get needed skills you can't hire in government, but be smart about it.

No responses yet

Fragile knowledge of coding and software craftsmanship

Aug 11 2014 Published by under Information Science

To continue the ongoing discussion, I think my concerns and experiences with informal education in coding (and perhaps formal education offered to those not being groomed into being software engineers or developers) fall into two piles: fragile knowledge and craftsmanship.

Fragile knowledge.

Feynman (yes, I know) described fragile knowledge of physics as learning by rote and by being able to work problems directly from a textbook but not having a deeper understanding of the science that enables application in conditions that vary even slightly from the taught conditions. I know my knowledge of physics was fragile - I was the poster child of being able to pass tests without fully understanding what was going on. I didn't know how to learn. I didn't know how to go about it any other way. I had always just shown up for class, done was I was asked, and been successful. In calculus I was in a class that had discussion sections in which we worked problems in small groups - is this why my knowledge isn't fragile in that or is it that I did have to apply math to physics problems? Who knows.

Looking back, now, it seems like a lot of the informal education I've seen for how to code is almost intentionally aimed at developing fragile knowledge. It's not how to solve problems with code and building a toolkit that has wide application. Showing lots of examples from different programs. It's more like list the n types of data.



There is actually a movement with this name and I didn't take the time to read enough about it to know if it matches my thoughts. Here I'm talking coding environment, code quality, reproducibility, sharing.... Not only solving the problem, but doing it in a way that is efficient, clean, and doesn't open up any large issues (memory leaks, openings for hackers, whatever else). Then taking that solution and making it so that you can figure out what it did in a week or so or so that you could share with someone else who could see what it did. Solving the problem so that you can solve the same problem with new data the same way. My code is an embarrassment - but I'm still sharing, because it's the best I know how to do and at least there's something to start with.

A couple of people suggested the Software Carpentry classes - they sound great. Maybe SLA or ASIST or another professional association could host one of these as a pre-conference workshop? Maybe local (US - Maryland - DC ) librarian groups could host one?  We could probably get enough people.

No responses yet

My current to-read list

Jun 27 2014 Published by under Information Science

I've been keeping a million tabs open at work and a home, because I haven't even had the time to add things to my citation manager... I also have some things in print that I've been carrying back and forth to work every day (boy is my bag heavy!).  Most of these things probably rate a post of their own, but sigh...  Add to that my obsession du jour with screenscraping and text mining using R, Python, and Perl.... and the fact that I'm not good at it so everything takes longer (also would take less time if I actually RTFM instead of just hopped to code and tried it).

So here are some things on my radar (I'm giving no credit to whoever pointed me to these because I honestly don't remember! Sorry):

  • Hadas Shema,  Judit Bar-Ilan,  Mike Thelwall (in press) How is research blogged? A content analysis approach. JASIST. DOI: 10.1002/asi.23239
    She tweeted a link to the pre-print if you don't have access. I got a about 2/3 through this as soon as I saw it announced and then realized I had been working on a very important work thing and dropped it. Very interesting so far.
  • Lisa Federer (2014) Exploring New Roles for Librarians: The Research Informationist.Synthesis Lectures on Emerging Trends in Librarianship. New York: Morgan and Claypool. doi:10.2200/S00571ED1V01Y201403ETL001
    I was first like meh about this (another name) but then I relooked and I'm interested in their version of embedding.
  • Vanessa P. Dennen. (2014) Becoming a blogger: Trajectories, norms, and activities in a community of practice. Computers in Human Behavior 36, 350-358, doi: 10.1016/j.chb.2014.03.028
  • Paige Brown (11 June 2014) How Academics are Using Social Media. From the Lab Bench.
    This and all the linked reports look very interesting.
  • Pablo Moriano,Emilio Ferrara,Alessandro Flammini,Filippo Menczer (2014). Dissemination of scholarly literature in social media.
  • Jeff Seaman and Hester Tinti-Kane (2013) SOCIAL MEDIA FOR TEACHING AND LEARNING. Boston: Pearson Learning.
    This was probably cited in the blog post above.
  • Liu, Y., Kliman-Silver,C.,Mislove,A. (2014) The tweets they are a-changin': Evolution of Twitter Users and Behavior. ICWSM. (google for it - I have the printout)
    This was mentioned by some folks from MPOW who went to the conference. Provides a nice overview.
  • Tenopir, C, Volentine,R., King, DW, (2013) Social Media and scholarly reading. Online Information Review 37, 193-216. doi: 10.1108/oir-04-2012-0062
    I might have actually read this but it's still riding around in my bag
  • Nentwich, M., König, R.. (2012). Cyberscience 2.0: Research in the age of digital social networks. Frankfurt: Campus Verlag.
    This one is time sensitive as I borrowed it from Columbia.
  • Holmberg, K. Thelwall, M(2013) Disciplinary differences in twitter scholarly communication. ISSI Proceedings 2013.  <- that was typed from my handwriting and not checked. google for it. I think I may have read this, but i have it in the stack to read again
  • Thelwall et al (in press) Tweeting links to academic articles. Cybermetrics J (google for preprint)
  • Haustein, et al. Tweeting biomedicine: an analysis of tweets and citations in the biomedical literature. ArXiv 1308.1838
  •  Sayes,E. (2014) Actor–Network Theory and methodology: Just what does it mean to say that nonhumans have agency? Social Studies of Science 44, 134-149.  doi:10.1177/0306312713511867

And this is just on my screen or in my bag. I think the babies tore up 3 articles i had waiting to be read by my couch :(  So far behind!


No responses yet

Auditors, really?

May 13 2014 Published by under Information Science

I'm all for responsible use of our money and we are very, very careful with our limited funds... so really I would have said, "bring it on" to auditors. Of course, I'm not the financial person, so I don't have to deal with them directly.

So here's the crazy bit: they want us to justify each and every one of our subscriptions for 2006 and 2007. What did we pay? What was the going rate? How many downloads were there for ONLY our lab?  Were there cheaper substitutes? Problem is, my place of work is a research lab that's a division of a major university. Only a few of our licenses are only for our lab. The vast majority are managed by the larger institution - as is proper.

In case you didn't know, usage is hardly ever available by IP. Counter compliant statistics, I'm told, are actually less informative than what was available previously.

So what happens if they disallow our licenses from 2007 in 2014? No clue whatsoever.

The only supporting info I could think to offer beyond what everyone else has done (the tech services folks at the parent institution have been awesome), is what sources we cited in those years. To me, that demonstrates that they were useful/used - even if the year is somewhat delayed.

I was pondering approaches, but I was able to do it in a few clicks in Scopus - not super clean, but useful:

  1. Search for affiliation and pick yours off.
  2. Limit by publication year (2007 in this case)
  3. Select all (we only had ~450/year during this time, ymmv if you have a much larger institution)
  4. Under More > view References
  5. In the refine sidebar, at the bottom, in small print > export refine

So I did see things written years after they were cited, but they were mostly things like standards and technical reports, and unpublished things. The publications also needed a little cleaning up  - aj and apj were there as well as the journals listed under their full names.

Dunno. I guess we'll see if the auditors get this - I'm getting the feeling that they have no concept of how science works.

No responses yet

Random observation: Expanding database scope can sometimes be annoying

Apr 28 2014 Published by under Information Science

Observation, rant, whatever.

I sorta look like an idiot - more than usual, anyway - because I can't pinpoint the number of articles MPOW wrote each year from 2009-2013. I mean, I give a number and then a couple of weeks later they ask me to check the research databases again (I've probably mentioned that we didn't have a mandatory comprehensive tracking system internally until this year) and the number has changed.

In the last 4 weeks, ~77 articles were added to this one set of databases we have with our name in the address field for any author. Of those, 45 were published in 2013 or 2014.

They were all from conferences that should be covered by the database... so I guess I don't really have a solution. I'm comparing our research output with other organizations, though, so if a new organization is added, it's hardly fair to do them now and use last month's numbers for MPOW. SIGH.

3 responses so far

C&RL Vote for articles - write in campaign for Taylor!

Apr 22 2014 Published by under Information Science

Ok, so the only thing I'm doing for this campaign is posting here, but anyway.

College & Research Libraries is having a 75th anniversary special issue and they're asking readers to vote for the best articles: .

I don't know about a lot of the choices. I mean FW Lancaster (framed pdf) FTW, of course! And Kilgour, probably. Plus the anxiety one by Mellon (framed pdf) has definitely had an impact.

BUT they forgot the best one ever:

Taylor, R. S. (1968). Question-negotiation and information seeking in libraries. College & Research Libraries, 29(3), 178-194. (pdf)

Luckily there's a write in block. So you know what to do... write it in!


No responses yet

Post I would like to write: New trend for linking external and internal information

I first noticed this in chemistry but now I'm seeing it in engineering, too. Major publishers (?), content vendors, indexes (?) are starting to offer services whereby your company can use their search tool to index your local content and display it side by side with their content OR a way to use their api to pull their content in to your local tool.

That's a common complaint in research labs and in companies. Knowledge management of internal information gets funded and defunded and is cool and then not cool... External information is familiar to people coming out of school... how can you search across both.

We have the artist formerly known as Vivissimo (now IBM something spherical I think) as an intranet search, and I would love to see it use our discovery layer type search as a tab. I don't see why it couldn't.

This deserves analysis and thought - no time. sorry!

2 responses so far

A word on ebook iPhone app usability

Aug 29 2013 Published by under Information Science

I'm not a usability expert although I certainly have read a bunch and seen a bunch of presentations (and know a few experts personally), but there are some basic ideas about understanding your user and the tasks they have to perform with your app or device or site that should be somewhat obvious.

I often read books and articles on my iPhone while nursing/rocking my babies. Maybe it makes me a bad mother but it sure has helped with patience over the past almost 18 months! If they're awake and up to shenanigans, I put the phone away and give them my full attention... but anyway. People are shocked and amazed that I can put up with reading a book on my iPhone. I'm not sure why - it's not a tiny font, I can make the font whatever size I need. I have the phone with me anyway. I don't need a separate light source. I can get new books right there instead of having to connect it to my laptop.

One of the things that is super, super important for an immersive reading experience is the ability to quickly turn pages - without even thinking about it and without losing your train of thought. When you're reading on a small screen, you might have like 4 page turns to every one you would have with a print book so it's something you do a lot. (particularly if you're reading <ahem> trashy bodice ripper romances <ahem> that read very quickly!)

Overdrive is the only app you're supposed to be able to use with the Overdrive license my local publib has. They have made two major mistakes with page turning - it's like they don't really get it? First, a while ago they added automation when you turn a page. So it would look like the corner turning up and going over - what a colossally bad idea! No one turns pages because it's cool - you turn pages to see what happens next. They quickly reversed that and made it an option. In the most recent update they've added a bunch of cool things like synching across platforms (good), but they've now made it a swipe instead of a tap to turn the page... and you can't even swipe from the side because that opens a menu, you have to swipe in the middle... which is hard to do one-handed while holding the device. And it's slow... it has to think about it before turning. So then you have to go back and check what was happening and then go forward again... I had a book on there that I had had on hold for a while and I just gave up on it. I'm going back to reading about Web Corpus Construction in a pdf reader like Good Reader.

Update: This afternoon Overdrive released a new version that fixes the page turning issue. I can only hope that they learned from it this time when they didn't learn from it last time.

One response so far

A local way to create a word cloud

Jul 03 2013 Published by under information analysis, Information Science

There's a lot of debate about whether word clouds are useful visualizations, but they do give you some insight into the prevalence of words in some amount of text.

Wordle is a way to do this online but you have to paste your text into a box or provide a url to an RSS or ATOM feed. So that won't work for some content because of sensitivity and it also won't work for batches of files. Here's a way I stumbled upon when I was trying to do some other analysis/visualizations.


  • R
  • packages: tm (and all of its many dependencies), wordcloud, RColorBrewer (optional)
  • data

I originally had a 470 page pdf that I wanted to get a feel for. It was divided up into sections, each on a specific program. I ended up saving out each section as rtf by hand although I could have done it programmatically in some other program. I tried saving directly to plain text but that failed. I then used DocFrac to batch convert them to plain text (probably not necessary, but I wasn't sure). So then I had a directory of plain text files.

Load into R the whole directory:

thingy <- Corpus(DirSource("plaintext"), readerControl = list(reader = readPlain()))

make a copy in case of emergencies, but mostly to use at a later step:

thingy.orig <- thingy

Then I did a bunch of clean up tasks pretty typical for this type of work (there is an easier way to do it but this way works. I didn't take into account the order of doing this, which I probably should have)

#remove stop words
exchanger <- function(x) removeWords(x, stopwords("english"))
thingy<- tm_map(thingy, exchanger)

exchanger <- function(x) stemDocument(x)
thingy<- tm_map(thingy, exchanger)

#lower case
exchanger <- function(x) tolower(x)
thingy<- tm_map(thingy, exchanger)

# remove numbers
exchanger <- function(x) removeNumbers(x)
thingy<- tm_map(thingy, exchanger)

# using grep sub to remove punctuation
# I used this instead of the built in because I wanted it to leave a space where the punctuation had been
exchanger <- function(x) gsub('[[:punct:]]', ' ', x)
thingy<- tm_map(thingy, exchanger)

In addition to the regular stop words, I removed a list of words that were particular to this text.

#custom list
mystop <- c(" "," ")
exchanger <- function(x) removeWords(x, mystop)
thingy<- tm_map(thingy, exchanger)

Next you should take a peek at at least one specific document.

inspect (thingy[3])

And you can add more terms to the custom list or do a gsub or whatever.

Then you need to build a term document matrix.

# create term document matrix
dtm <- DocumentTermMatrix(thingy)

And it needs to be in matrix format. I understand there's a limit to how big these can be. There are some options to make them smaller.

dtm.mat <- as.matrix(dtm)

Then you need to get a vector of all the words and it's nice to have them in order of decreasing frequency. Plus you need the frequencies.

v <- sort(colSums(dtm.mat), decreasing=TRUE)
myNames <- names(v)

Then you pick your colorbrewer color scheme and draw your cloud:

pal <- brewer.pal(12,"Paired")
#add in seed to make it reproducible
set.seed (300)

Now here's the issue. Our words are still stemmed. Well, there is a stem completion function which goes through and finds the most prevalent complete version of that word in a dictionary. In the examples, they often use the original corpus as the dictionary... the problem is that it's very, very, very slow. In fact, it can take days. I have some options I'm considering to get around this:

  1. try the wordnet package
  2. use python nltk to lemmatize prior to doing the r stuff (lemmas are real words unlike the stems)
  3. change the script to make it not iterate so much

Stemcompletion ran for a week on my computer before it was forced to reboot from an update... I haven't had a chance to try the other methods yet.

One response so far

Instrument bibliographies, data citation, searching for data

Jan 11 2013 Published by under bibliometrics, Information Science

My place of work builds spacecraft and instruments that fly on other folks' spacecraft. So one of the things that we need to do is to come up with a list of publications that use our data. It's the same thing with telescopes and it ends up being a lot more difficult than you might expect. There are a few articles on how to do it from ADS staff and from librarians from ESO (Uta Grothkopf and co-authors), STSCI, and other telescopes. It turns out that you really have to do a full text search to get closer to comprehensive. ADS has a fulltext search covering some things, but I asked the experts in the Physics-Math-Astro Division of SLA and many of them also use the fulltext searches on the journal publisher pages (which are of varying quality). I found that Google Scholar was the only thing that got book chapters. This is all pretty complicated if your instrument has a pretty common name or a name that is a very common word.

Other suggestions were to use funding data from Web of Science or elsewhere (soon to be part of CrossRef data), but that really only gets the science team for the instrument. Our main point is to find out who is downloading the data from the NASA site (or elsewhere?) and doing good science with it.

Heather Piwowar has done a lot of research on data citation (is it done, how do you find it), but I believe mostly with life sciences data. Joe Hourclé has also presented several times on data citation and there is the DataCite organization to work on this issue. But this is all future stuff. Right now it's the wild west.

No responses yet

Older posts »