Post I wish I had time to write: Scientific meetings and motherhood

(by Christina Pikas) Feb 24 2015

I was reading Potnia's new post on meetings - why to go to them - and nodding my head vigorously (ouch) and connecting that to the part of the dissertation I'm writing now on tweeting meetings and the research over the years on how scientific meetings work and contribute...

and I got very sad. I'm a real extrovert and a magpie of all sorts of different kinds of research, but I can't justify spending my limited time reading articles that aren't pretty directly relevant to my job or my dissertation. When I went to bunches of meetings, I could soak a million little tidbits up, meet the people doing the work, browse lots of posters and talk to their authors. It's really a very efficient way to see what's up with a field.

and now... I haven't been to a conference since I was in my first trimester with my twins :(   Sure, I've listened in to some webinars and followed some tweets. It's not enough.

Would childcare at a venue help?  I don't know... I'd still have to get them there, I'd have to trust the childcare (what if I got there and checked them out and didn't like what I saw?), and I'm paying for childcare at home even when I go and money is super tight now with my income being the only one in our household for more than a year.  I thought about bringing my sister along and then we could see the sights together outside of hours. My work would pay my travel and my room and so I'd just have to pay her travel and everyone's food. But I can't really even swing that right now....

 

So yeah... at least there's twitter. The post I'd like to write actually cites references and what not.

And I'm only the 10 millionth person to have this issue this year so I  know I'm not a special snowflake but that doesn't mean I can't still bitch about it.

2 responses so far

Exporting high resolution graphs from RStudio

(by Christina Pikas) Feb 12 2015

This may not be obvious until you look into it but apparently the default export from RStudio -  if you use the nifty little tool in plots tab on the lower right hand side -  is 72dpi. This is fine for showing on web pages, typically, but is not enough for print. Particularly if you're submitting to a journal or something like that. There's lots of advice, but I found it somewhat confusing.

RStudio Interface for Windows from RStudio.com

RStudio Interface for Windows from RStudio.com

I found these posts helpful:

  • http://blog.revolutionanalytics.com/2009/01/10-tips-for-making-your-r-graphics-look-their-best.html
  • https://danieljhocking.wordpress.com/2013/03/12/high-resolution-figures-in-r/
  • http://www.r-bloggers.com/exporting-nice-plots-in-r/

I think someone I was reading just got out of RStudio and did his work in the standard interface. Really, there's no need for that. I also read somewhere that Cairo is not really used any more? There is a way to export to pdf from RStudio and check a box to use Cairo...

Here's what I did.

CairoPDF(file="something.pdf", width=11, height=8.5, family="Helvetica", pointsize=11)

set.seed(1337)

plot.igraph(g, layout=layout.fruchterman.reingold, edge.arrow.size=0.4, edge.color="black", vertex.size=V(g)$degree, vertex.label.dist=V(g)$vertex.label.dist, vertex.label.color="black", vertex.label.family="sans",edge.curved=TRUE, vertex.label.cex=V(g)$vertex.label.cex, edge.lty=E(g)$edge.lty, vertex.frame.color=V(g)$frame.color)

dev.off()

A couple of notes:

  • I found I needed to increase the arrowhead size
  • I needed to decrease the font size
  • I needed to set a seed so I was only changing one thing at a time as I experimented
  • When I did png, my dotted lines didn't look so dotted anymore. I didn't feel like messing with that further


Cairo(file="something.png", type="png", units="in", width=10, height=7, pointsize=12, dpi=300)

set.seed(1337)

plot.igraph(g, layout=layout.fruchterman.reingold, edge.arrow.size=0.1, edge.color="black", vertex.size=V(g)$degree, vertex.label.dist=V(g)$vertex.label.dist, vertex.label.color="black", vertex.label.family="sans",edge.curved=TRUE, vertex.label.cex=V(g)$vertex.label.cex, edge.lty=E(g)$edge.lty, vertex.frame.color=V(g)$frame.color)

dev.off()

One response so far

So... um... what if I'm still enjoying it?

(by Christina Pikas) Feb 05 2015

Am I supposed to kind of hate my dissertation topic by now? If I don't, does that mean I'm not working on it hard enough (maybe)? I'm doing it wrong? Maybe it's a phase and it will pass.

Making progress. Learning new stuff from my data. Feeling horribly inadequate when watching tweets fly by from another doctoral student dissertating on how scientists use blogs.... (holy moly how many scientists did she actually interview? hundreds? cray-cray... or am I a hater?)

Working on it every chance I get - taking a morning off every week. Staying up late. .. I will have to add more time off. If only we could afford more childcare!

2 responses so far

Using more of the possible dimensions in a network graph

(by Christina Pikas) Jan 30 2015

When doing bibliometrics, or social network analysis or any kind of network graph, there are only so many different ways to convey information.

  • Size of nodes
  • Shape of nodes (including pictures)
  • Color of nodes
  • Border of nodes (or multiple borders)
  • Labels (node or edge)
  • Edge weight
  • Edge color
  • Arrows
  • Shading areas around/behind nodes
  • Layout or arrangement of nodes

Of these, I almost always size nodes by degree (connections to other nodes), do thickness of lines by their weight, and do some sort of energy or spring layout.

If I do some sort of clustering or community detection or even want to call out components, I'll do that with node color.

My normal things are easy in any package that will graph networks. I was working on a project where we were looking at the maturity of a particular part of an industry. As part of these, we wanted to know if the necessary component systems were available from multiple suppliers and if those suppliers had relationships with different system integrators and if their things were operational or were just for lab or testing purposes.

We could have done a graph for each sub system but they wanted this graph to really just be one slide in a fairly small deck. I tried various approaches in Gephi and NetDraw and wasn't excited. So back to R and iGraph.  In the end (anonymized) :

igraphgraph

Resulting graph - minus any labels.

I used:

  • node shape for if a component or a system integrator
  • color for type of component
  • size for degree
  • line dashed or dotted for if it was in operation or not

I really wanted to show different shapes for each category but igraph only has like 6 default ones and they don't look all that different from each other. NetDraw has more. I tried to use raster images - but I'm on a windows machine and I found all that very confusing.

One unfortunate thing about this graph is that I had to list companies multiple times if they had offerings in multiple categories.

Customer seemed to like it.

I'm not going to take the time to anonymize all the code but here are some key pieces - ask if there's anything I figured out that you don't immediately see how to do.
I started with a spreadsheet (3 of us librarians were adding data)
nodetable tab:
id label category

edgetable tab:
source target yes/no notes

These I imported into gephi (super easy)... and then tried all sorts of stuff... and then exported into graphml
#read in the graph
g<-read.graph("g.graphml", format="graphml")


#shape nodes. these work, but you can't have n/a. so there has to be a default. also, there is an easier way
for (i in 1:101)ifelse(V(g)[i]$Category=='category', V(g)[i]$shape<-'circle', V(g)[i]$shape<-'square')

#color of nodes - a simple number will draw from the palette. see below
for (i in 1:101)if(V(g)[i]$Category=="category"){V(g)[i]$color<-1}

#calculate and keep the degree. i use it again for label placement (not shown) and to bold some labels (not shown)
V(g)$degree<-degree(g, mode="all")

#when I tested the graphing, the isolates were all mixed in and messed up all the labels.
#subgraph to show isolates separately
gi<-induced.subgraph(g,V(g)$degree==0)
gnoni<-induced.subgraph(g,V(g)$degree!=0)

#make dotted lines for not operational
for (i in 1:76) ifelse (E(gnoni)[i]$"operational"=="yes", E(gnoni)[i]$edge.lty<-1,E(gnoni)[i]$edge.lty<-2)

#prettier colors
library("RColorBrewer", lib.loc="~/R/win-library/3.1")
mypalette<-brewer.pal(6,"Paired")
palette(mypalette)

#legend definitions
colors < - c('gray40', 1,2,3,4,5,6)
labels <- vector of categories


#plot graph keep device open
plot.igraph(gnoni, layout=layout.fruchterman.reingold, edge.arrow.size=0.1, edge.color="black", vertex.size=V(gnoni)$degree, vertex.label.dist=V(gnoni)$vertex.label.dist, vertex.label.color="black", vertex.label.family="sans",edge.curved=TRUE, vertex.label.cex=0.8, edge.lty=E(gnoni)$edge.lty)

#put legends on - isolates are just shown as a legend so they are neatly lined up
#could have been done by plotting points

legend("bottomright",legend=labels, fill=colors, border="black", cex=0.7, inset=c(-0.1,0))
legend("topleft", legend=V(gi)$label, pch=19, col=V(gi)$color, cex=0.7, bty="n", y.intersp=0.5)
legend("topright", legend=c("Yes", "No"), lty=c(1,2), cex=0.7,inset=c(-0.02,0)) Continue Reading »

No responses yet

Ebook Explosion

(by Christina Pikas) Dec 17 2014

Seems like all the publishers and all the societies are trying to get into the eBook game. The newest announcement is from AAS (using IOP as a publisher). Considering the fact that a lot of these domains are not particularly known for monographs - like Computer Science and ACM's new ebook line - but instead for conference proceedings and journal articles, seems kinda weird.

Someone mentioned that maybe it was due to the ebook aggregator demand driven acquisition plans - but I think it's just the opposite. Many major publishers have jacked up prices (pdf) on EBL and Ebrary recently - all to push libraries in to licensing "big deal" bundles of the entire front list or entire subject categories. And it is super attractive to buy from the publishers because they're often without DRM, PDFs (one big publisher even offers a whole book in a single pdf, most are one pdf per chapter), ways to view online, easily findable using Google and also nice MARC records for adding to the catalog.

The ebook aggregators have nasty DRM. They have concurrent user rules. They have special rules for things that are considered textbooks.  We have to login with our enterprise login (which isn't my lab's day-to-day login) and the data about what books we view is tied to our identities. The new prices end up being as much as 30-40% of the cover price for a 1 day loan. That's right. The customer can look and maybe print a couple of pages for 24 hours and the library is charged a third the cover price of the book.

But for the society and publisher own pages, what seems like a one time purchase has now become yet another subscription. If you buy the 2014 front list will you not feel the pressure to buy the 2015 and 2016 publications?

Aggregators had seemed like some of the answer, but not so much with these prices. We've already mugged all the other budgets for our journal habit so where do these new things come from? The print budget was gone ages ago. Reference budget was also raided.  The ones we've licensed do get used a lot at MPOW.

No responses yet

Continuing value and viability of specialized research databases

(by Christina Pikas) Nov 26 2014

There was an interesting thread yesterday on the PAMnet listserv regarding "core" databases in Mathematics and which could be cut to save money.

One response was that it's better to search full text anyway (I couldn't disagree more).

Ben Wagner expressed concern that Google Scholar was going to drive all of the databases out of business and then Google would abandon the project.

Joe Hourclé posted about ADS - a core database in astro. Fred Stoss posted about PubMed - needs no intro here, surely!

Here's my response.

I think Scopus and WoS are the biggest immediate threats to the smaller domain specific indexes particularly when the largest number of academic users are looking for a few reasonable things and aren't doing the complex queries or needing to be very precise and have very high recall. In my world, I'm like the goalie: by the time they ask me, they've tried Google, they've asked their friends, they've asked their mother*... it's gotten past 10 people without an adequate answer. For these hard questions, I need the power of a good database (like Inspec). But... if you look at quantities and numbers of users... does that justify the huge cost? Maybe? But do our auditors agree? Infrequent big wins vs. day to day common usage?

As Ben has often chronicled, we've shifted money out of every other budget to support our sci/tech journal habit. We've starved the humanities. We've dropped databases. All for more and more expensive journals. Seems like if the content does get paid for out of other budgets via page charges or institutional support for open access publishing, that might make it even more important that libraries have better ways to find the distributed content. But, like Ben, I worry that we'll put these finding tools out of business.

Another observation: two of the "core" databases mentioned, ADS and PubMed, are government supported as a service to the community. The solar physics bibliography is a very specialized resource but is also super important to those researchers. Maybe if building specialty research databases is no longer profitable but there remains a need, the community-built tools will improve/grow/gain support? Maybe they'll be backwards and using technology from 1995, though :)

I'm working with some projects that are actually taking big piles of full text documents and using computational methods to classify using an ontology that's built by subject matter experts (with some advice from a professional taxonomist in my group). The volume/velocity/yadda yadda of the data precludes the careful indexing done by our fancy databases... but this and other projects like it I think show a swing back toward the importance of good indexing and the importance of having domain experts reviewing the classification system.

 

* My mom is a statistician so I might ask her first

 

No responses yet

PyCharm FTW

(by Christina Pikas) Oct 12 2014

Another random Python note. I asked at work again in the Python group of our internal social networking thingy and consensus was that I should try PyCharm as a development environment.

All the stinking tutorials are like use a text editor and command line - and that's what I'd been doing - but with R, RStudio is so fantastic that I thought surely there must be something workable for Python. I had tried the eclipse plugin and I couldn't even get it to run a program and i couldn't figure out what it was doing and ugh.

PyCharm now has a community edition so you don't even have to prove you're a student or pay for it. It's lovely, really. I don't see why I should have to use VI like it's 1991 or beat on something with rocks to see where I'm missing a quote or have the wrong indents. Why not have help? I'm trying to accomplish a task not create art.

I really do have to continue coding and stop playing with Python. Particularly since when I do I end up losing hours of my life when I'm supposed to be sleeping!

2 responses so far

Post migration settling in

(by Christina Pikas) Oct 06 2014

We migrated to a new server and updated our software. See here: http://scientopical.scientopia.org/2014/10/02/server-migration/

There might be some settling in issues. Let me know if you see anything here. I suspect my code on the sidebar needs attention - which I do not have time to give it :(  Also, comments might get caught in pending or spam filter. I hope not but let me know.

In case you missed me: dissertation progressing, committee ok with progress, toddler twins will likely kill me, family issues (health and otherwise), work is busy in a good way.

No responses yet

WRT NCTA ads: No.

(by Christina Pikas) Sep 16 2014

Net Neutrality FUD all over the place. NCTA has these ads in the WaPo about how 1) you can find everything on the internet and 2) you can't find any reason the internet should be regulated like a utility.

No.

You can, indeed find many well-reasoned essays on why the internet should be regulated like a utility - see, for example, books and essays and blog posts by Lessig, Wu, Cerf, and others. All shared freely. For now.

You can't find everything on the internet, either, and even many  "kids today" know this. See.

No responses yet

Government cost recovery gone awry: PACER and NTIS

(by Christina Pikas) Aug 27 2014

(reiterating these are just my personal opinion and do not reflect anything from my place of work - if you know what that is - or anything else)

For many years, the US federal government has tried to cut costs by outsourcing anything that isn't inherently governmental, making sure that government doesn't compete with industry, and requiring cost recovery for government agencies that provide services to other agencies (see A-76 ).

Old examples that might have changed: GPO had to do all printing of history books for military historians, but the quality was ok, the distribution was crap, and the DoD history organizations and readers had to pay a lot of money. So what they did when I worked there was to give the book to a university press that would do a decent job with it. The books were not copyrightable anyway because they were work for hire by a government employee. Everyone was happy. Another old example was that Navy was required to send all records to NARA. But then Navy all the sudden had to pay NARA to keep the documents (I think this has changed - my example is from late 1990s). This was things like deck logs. Hugely important documents.

NTIS has long been caught up in this. Agencies producing technical reports are required by law to send them to NTIS (if they are unlimited distribution). NTIS is required to recover the cost of their administration and archiving by selling the documents. This is hard because first, agencies are not thorough in sending stuff to NTIS (often because their central repository isn't even getting copies - even though required by regulations, instructions, etc.) and second, agencies make these documents available for free from their own sites.  NTIS also has picked up a few bucks here and there doing web and database consulting and licensing their abstracting and indexing database to vendors who resell to libraries. Why pay for it from a third-party vendor? Cross search with your favorite engineering database. Better search tools.

PACER is also caught up in this. There's actually a law that says US Courts has to recover the cost of running the system by charging for access or for documents. They do not want to but there is a law that they must obey.  This is information that really should be freely available and easily accessible. A famous activist tried to download the whole thing and make available, but he was stopped.

The results of forcing these agencies - GPO, NTIS, US Courts - to recover their costs are great and they directly work against the open government we need and deserve. It causes the agencies to cut corners and not have the systems they need. It causes customer agencies and citizens alike to distrust and dislike them.

Now, US Courts has removed large collections of historical documents from PACER because of an IT upgrade. Read the Washington Post article. Various people in Congress are trying to shut NTIS down, again. GPO seems to be ok, for now - lots of cool neat things from them.

Libraries  - like mine - have been burdened by cost recovery, too, and it often signals the beginning of the end. Superficially, makes sense to show how much something is valued and by whom. In practice, you need a lot more accounting systems and controls over the professional workers that prevent them from doing their job. These services are directly in support of strategic requirements (open government and accountability) but are infrastructure. People are blind to infrastructure until it's no longer there.  NTIS, PACER, GPO and others need to stop with this cost recovery business (meaning Congress has to pass a law that removes that requirement) and be funded as infrastructure. Outsource to get needed skills you can't hire in government, but be smart about it.

No responses yet

Older posts »