Archive for the 'Uncategorized' category

I'm a coding fool... use of the nuggets mentioned last post

Jul 18 2014 Published by under information analysis, Uncategorized

Self-efficacy FTW... may I never crash!

Last post I mentioned determining if a string is an element of a character vector and opening Excel files in R. Here's what I was doing and how it worked.

I had a directory of xls files downloaded from a research database, one per organization, showing the high level subject categories in which they published. The subject categories were actually narrower than what I needed (think like condensed matter, high energy, amo, chemical physics and I need "physics"). I needed to rank the organizations by articles published in these composite categories that included maybe 1o or so of the categories from the database.

Originally, I was going to open all the files and just count them up, but whoa, that would suck. R for the win!

First, run RStudio with either 32- or 64- bit depending on which Java you have installed.

Next, get the list of files. I had saved them in a directory with other things, too, so I needed to search by pattern. I had already set my working directory to the data directory, for good or bad.

fileList<-list.files(pattern = "subject")

Get the list of their categories for my composite category (physics here). This was just a list:

physics <- read.delim("physics.txt", header=F)

Make sure words are characters and not factors, numbers are just numbers. By trial and error and inspection I figured out that there was a non-numeric character in the count field.

Here's the function:
countPhysics<- function(file){
  #this creates a vector to hang on to the numbers of counts
  phys.hold <- vector(mode="numeric")
  #this is so i can make sure i just have numbers in the count field
  #this finds matching records and then keeps just the part we want
  #one of these days i'll just import right the first time instead of this part
  for (j in 1:length(physfile$Count)){
    if (is.element(physfile$Analysis.Value[[j]],physics)) {
      phys.hold[j]<- (physfile$Count[[j]])}
    else phys.hold[j]<-0
  total<- sum(phys.hold)  

So you run this like so:

physicsResult <-sapply(fileList,countPhysics)

I transposed it and then pasted it into an excel file I was working on but this is essentially the answer. I did the same thing for the other categories separately, but obviously I should have checked each line for matching each/all of my categories before moving to the next line and then outputted a frame. Oh well.

Comments are off for this post

Beginning adventures in Python

Jul 03 2013 Published by under information analysis, Uncategorized

I had a little slow period a month or so ago at work (not anymore, for sure!) and I decided it was time to start working on a goal I had set for myself for the year: learn to do some analysis that actually uses the full text of the document vs. just the metadata. Elsewhere I have discussed using Sci2, VantagePoint, bibliometrics, and actually Carrot2 (using the text of the abstract), but I need to go further. I don't aspire to become an expert in natural language processing (NLP) but there are some times I end up having to stop before I want to because I just don't know how to go on.

Anyhoo... first step was to see what I could do in R using the TM package and whatever else. I figured out how to do a word cloud but meh on some of the other tm stuff. I tried a little LDA but my corpus didn't work well with that. When doing the word cloud I realized I really wanted to lemmatize instead of stem. I looked around for ways to do it in R, and there is a WordNet package for R (thanks Greg Laden for pointing it out!) but it just wasn't doing it for me. I had recently worked my way through a bunch of the Python lessons on Code Academy and also bookmarked nltk - the natural language toolkit that works in python so I thought - ah-ha!

The first big deal was installing the stupid thing - language. Argh. I started with Eclipse and PyDev but alas, I am so not willing to figure out how that really works. I got one sample program running but next program it kept running the first one so meh.

I started working my way through the nltk book, and that used the shell mode I guess? where you get immediate responses? Installing packages - I never did figure out how to do that in Perl - it's easy in R but alas... so I gave up on PyDev and installed ActivePython which has a handy-dandy package installer which lo and behold works for people like me who only know enough to be dangerous.

The other things I'm learning: holy cow ignore what your computer is and do everything 32 bit for the love of chocolate. A bunch of problems from installing 64 bit where everything is looking for 32 bit. Uninstall and try again.

I still haven't figured out how to use the programming environment (?) that ships with ActivePython. I really like how RStudio completes things and that's why I wanted to use Eclipse. I'll have to try that next.

Anyway, I hope to take some notes and leave them here for my future recall as it's easy to forget how things worked.

Comments are off for this post

Knowing what you know, or rather, what you've written

When I first came to work where I work now, I asked around for the listing of recent publications so I could familiarize myself with what types of work we do. No such listing existed even though all publications are reviewed for public release and all copyright transfer agreements are *supposed* to be signed by our legal office. Long story short, I developed such a listing and I populated it by having alerts on the various research databases.

Now, 9 years later, it's still running and it is even used to populate an external publications area on our expertise search app.

By its nature and how it's populated, there's absolutely no way it could be totally comprehensive and it is also time-delayed. It's probably a little better now with how fast the databases have gotten and because Inspec and Compendex now index all author affiliations and not just the first author.

Anyway, our leadership is on an innovation kick and looking at metrics to see how we compare to our peers and also if any interventions have positive effects. The obvious thing to look at is patents, but that's complicated because policies toward patenting changed dramatically over the years. They're looking now at number of publications - something I think they probably ought to note as part of being in the Sci/Tech business. My listing has been looked at, but that only started in 2003/2004. From here forward the public release database can be used... but what about older stuff? Well, in the old days the library (and the director's office) kept reprint copies of everything published. Awesome. Well, except they're kinda just bound volumes of all sorts of sizes and shapes of articles. I guess these got scanned somehow and counted, but they ended up with a few articles with no dates or citations (title and author but not venue). Three of these got passed to me to locate. They're not in the above mentioned research databases, but we know they were published (as re-prints were provided) and not in technical reports.

The answer? Google. Of course. The first was a book chapter that was cited in a special issue of a journal dedicated to the co-author. The second was a conference paper that appeared on the second author's CV (originally written in 1972 - thank goodness for old professors with electronic CVs!). The third was a conference paper cited by a book chapter indexed by Google Books. BUT to find the year, I have to request the book from the medical library... which I have done.

At least back in the day the leadership understood the value of keeping a database (print volumes) of our work. From at least 2003 until 2012, there was no such recognition. Now that I will be benchmarking us with peer organizations, I wonder if they're in the same boat or if they've keep their house in order with respect to their intellectual contributions?

Comments are off for this post

Mobile device pairing. Kind of cool and way overdue.

May 15 2012 Published by under Uncategorized

I often get the question: isn't there something I can do to identify my work laptop so that I can go home and the journals, etc., will still recognize me without having to use the proxy or vpn?

Seemed kind of far-fetched. A publisher who was willing to that would be... gasp... giving up some control!

In a recent announcement, the American Mathematical Society informed us that their users are able to do just that.

They're not the first or only. You can roam with EndNote Web for a year. I think there is something similar with some of the Elsevier apps (maybe just scopus?). The ArtStor app used to do this (they might still... not sure).Maybe EngNetBase (but that was really clunky when I tried it).

This is nice - takes down some barriers for the users, increases usage, and still links downloads/reads to institutional subscriptions.

5 responses so far

When your web services change terms

Dec 23 2011 Published by under Uncategorized

Jonathan Rochkind has written a ton about web services and APIs that libraries can/should/do use. His posts are written from the point of view of someone who understands the programming bit, the data bit, and the library bit. This post is written by someone who watches that stuff with interest and has worked, on occasion, with programmers.

I mentioned some time ago that we got an internal (to my place of work) "ignition grant" to build a system for supporting the listing, searching, lending of personal or desk copies of books. It should be noted that the money was from lab leadership, but we were voted in by lab staff. We have an internal social networking tool that's running on Elgg so we decided to build it to hang off of that. My collaboration partners are from 2 sponsor-facing departments and work in information assurance type CS jobs, not as software developers. My contribution was really in how to track books and how people search for books, and lending books... oh and barcode scanners 🙂

So anyway, after a lot of discussion, we went with the Amazon API to provide book metadata including descriptions and book cover images. Unfortunately, Amazon changed their terms of service in November to require an associates ID. We ran this past various parties at the lab, including legal. No go. We couldn't sign up for an associates id because of other things in the license. So our beautiful system couldn't add any new books! And our grant was long over.

Luckily, some folks in the IT department stepped up to make a fix, but the problem is, what API to use?  I used Jonathan's posts and some other things around the web and came up with WorldCat and Open Library for cover images. So we're now back up and running but with no book descriptions.

Assuming we get the go ahead from legal, we hope to make our Elgg add-on open source and make it available from the Elgg site. If/when we do, we'll probably have screen shots to share and more information. It's a neat idea on another way to find expertise and to support collaboration (and saving money) within an organization.

The moral of the story is, watch out for the terms of service on apis, and keep watching because they can change and then your functioning service can go up in smoke. We feel a lot better about open library and somewhat better about worldcat ... but vigilance is important.

Comments are off for this post

Research Database Vendors Should Know

Research database vendors - the ones who design the interfaces that the end users use - should know that data export is not a trivial addition. Rather it is an essential part of their product.

Over and over and over again, librarians complain about one interface that works one day and doesn't work the next. The one that doesn't output the DOI unless you select complete format. The one that all the sudden stopped exporting the journal name. The interfaces that don't work with any known citation manager. The ones that download a text file with 3 random fields instead of direct exporting the full citation and abstract.

But you blow us off and you act like it's not important.

Well. I was just talking to a faculty member at another institution - even though a particular database is most appropriate for her research area and she finds interesting papers there, she now refuses to use it because it doesn't export to EndNote right. She's tired of the frustration and she is tired of finding that she has to edit everything she's imported so she's just given it up.

Once again librarians are telling you something and you need to listen. Researchers and faculty are super busy. They will not keep using your product if it makes their life harder. If they don't use your product then we'll stop subscribing. That's all there is to it.

2 responses so far

What, if anything, does Google owe us?

Jul 27 2011 Published by under information policy, Uncategorized

Google is an advertising company. They provide services - very useful services - in turn for showing us advertising. They are there to make a profit and there's absolutely nothing wrong with that.  They have a motto of not doing evil, which some people argue they don't always live up to.

So, they provide a service like search and there's no explicit agreement you need to make to use the service but their "universal terms" still hold and are accepted implicitly. You could use some sort of ad blocker to cover the ads and probably get away with it although that does violate some terms. With other services like gmail, gdocs, g+, there are explicit terms of service that you need to agree to in order to have access to the service.

For blogger, you can't post things for which you do not either have the copyright or have a license (such as creative commons) to post. In news this week (via Sir Shuping on Friendfeed), they did something pretty cool. They made it both easier for bulk DMCA takedown orders to be processed and also easier for bloggers to fight a bulk order (do a counter-notice). Why is this cool, if it's just the law and business? Well, copies of all of the takedown orders go to the Chilling Effects cease & desist clearinghouse where they are available for searching. Secondly, they don't take a sledgehammer to the problem. They move your post into draft status and you can edit it to take out the offending content and then re-post. To me, this seems entirely reasonable.

On the other hand, one of the terms of service for g+ is that you use your real name to sign up. Many people have said in many different places why this doesn't make sense - it's just stupid for Google to have this requirement and it certainly does not further their goal of preventing the morass of trolls and flames and disgusting comments that is found on YouTube. But that's not my point right now. My point is that their terms are that you must use a real name (and be a certain age) and their reaction if you don't hold up your part of the bargain is that they terminate your relationship with google across the board. They lock you out of gmail, blogger, picasa, gdocs, reader... everything.  According to this Gawker piece, it's really messed up a bunch of kids as well as the pseudonymous scientists and celebrities we've heard from. I actually don't see the part about real names in the universal terms, and there's no link from that page to google plus terms.

It's disproportionate. It's punishing instead of just doing business. It's decided algorithmically instead of by humans and clearly the algorithm needs to be tweaked.

I have an ongoing relationship with Google. I expect it to deliver services if I follow its terms of service. The universal terms, afaict, do not mention using my real name. That means that it isn't required for all services, just g+. In that case, remove access to just the service where it's written into the ToS.

I think Google owes people like Grrl Scientist an apology and access to all of her data. Even if they do not reactivate these accounts, they should provide download links for the content.

4 responses so far

Big news for April 1

Apr 01 2011 Published by under Uncategorized

We're moving to Occam’s Typewriter. Between all of the infighting between Janet and Sci and the new subscription model (with a paywall like the NY Times)... it's time to go.

Luckily, there will be lots of room over there.

3 responses so far

Happy pi day

Mar 14 2011 Published by under Uncategorized

Better late than never. We had chicken pot pie for dinner to celebrate. This is the America's Test Kitchen recipe with me copping out for the topping (store brand Bisquick dropped on there). Looks like dog food but is tasty!


One response so far

About the preservation of databases

Feb 16 2011 Published by under Information Science, Uncategorized

Egon Willighagen asked on Chm-Inf about why libraries aren't preserving databases. Beth Brown provided one reply.

I commented there and hopefully my comment will show up eventually but I seriously doubt we'll be able to help with this.

NASA, DOD, NIH, NSF, and other fund the development and first few years of hundreds if not thousands of databases. Then the database becomes less about new science and more about infrastructure or operations. Then the PI gets bored. Then maybe the users start to drop off... then the database disappears. I was just looking for information in a NASA database that was referenced all over the place. When I got there all I found was a notice that it wasn't funded anymore so no data for me!

We've been hearing this with data - about how it cost so much to gather but then is abandoned.  Libraries are working to try to take up some of the slack with this, but it's hard. Look, if NASA and DOD with big offices for science and technology information can't preserve their own stuff, they're not going to fund us to do so. Libraries don't have the money or the mandate.

I was at SLA whenever it was in DC and saw a presentation about yet another NASA database - even at the time the only thing I could think was how close is the PI to retirement?

Funders should ask about preservation plans for these things. I don't think they do.

3 responses so far

« Newer posts Older posts »