'Hot or Not' in Academic Research
Success in graduate school is all about learning. This isn’t exactly news, but I think that the types of learning that contribute most to success aren’t what one would expect from the outside. In popular presentation, a grad student’s job is to learn everything there is to know about one extremely narrow topic, and eventually push the boundaries of human knowledge a tiny bit farther out. Being a good graduate student certainly requires that you learn the main research results of your field, however as I’ve spent more time in grad school I’ve come to understand that knowing the human context surrounding research results is equally important. Identifying potential collaborators (or competitors) who are working on similar problems can have a huge impact. Likewise, knowing what types of problems are currently in vogue can acutely affect a grad student’s academic prospects. As much as professors like to tout the intellectual freedom of academic research, people who sail with the prevailing winds go furthest.
This sort of contextual information doesn’t come easily. Graduate advisors can convey some of what they know, but most learning in grad school is self-directed. What grad students learn on their own is typically absorbed haphazardly through article author bylines, Google Scholar searches, and research group websites. Given the volume of contemporary scientific output and the limited bandwidth of most students, this necessarily results in a shallow and imprecise view of the field. This brings us to the point of this post: having encountered this problem myself I’m going to try to extract some contextual information about my own field (condensed matter physics) from web scraped data.
Condensed matter physicists have one main conference every year. Conveniently for our purposes, the title, abstract, and authors of every talk given at the conference in the past decade are available online. The titles and authors can be scraped in bulk so that’s what we’ll be working with. A similar analysis could be done for other fields given similar data (for example, the authors and titles of peer-reviewed journal articles); unfortunately this data is somewhat hard to come by since publishing companies are a little touchy about people scraping their websites. If you’re interested in condensed matter physics though the data used in this analysis is available here.
The data is in a separate file for each year, so we’ll pull in all the pieces, concatenate them, and then clean up the data a little bit.
After pre-processing our data looks like this:
abs_authors abs_title abs_url year
1 Keiji Ono Nuclear Spin Induced Oscillatory Current in Sp... /Meeting/MAR05/Session/A1.1 2005
1 Go Yusa Electron-nuclear spin coupling in nano-scale d... /Meeting/MAR05/Session/A1.2 2005
1 Xuedong Hu Nuclear spin polarization and coherence in sem... /Meeting/MAR05/Session/A1.3 2005
1 Mikhail Lukin Controlling Nuclear Spin Environment of Quantu... /Meeting/MAR05/Session/A1.4 2005
1 Silke Paschen Hall effect indicates destruction of large Fer... /Meeting/MAR05/Session/A2.1 2005
Term popularity by year
As a first step, let’s analyze the popularity of a few of the most common terms in abstract titles using scikit-learn’s CountVectorizer. This class takes a list of strings, for example abstract titles, and returns a matrix where each row corresponds to an abstract title and each column corresponds to a single word. The (i, j)-th entry of the matrix is the number of times word j appears in abstract i. By summing over all the rows we can get a count of how many times each word shows up across all abstract titles. If we separate the titles by year, and repeat the analysis for each year we can compare the relative popularity of different terms over the past decade.
The code above handles the term counting; next we’ll split the abstracts up by year and count each subset individually.
Finally we pull everything into a pandas DataFrame and tidy things up a bit.
After counting the occurences of each individual word for each year we end up with data that looks like this:
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 total_counts
total_words 48797 52772 52367 52464 54423 57558 58921 69717 67202 71277 70511 656009
total_titles 6347 6856 6742 6767 6953 7288 7433 8733 8204 8722 8601 82646
quantum 572 548 582 552 597 630 658 793 788 841 804 7365
spin 445 439 465 476 469 530 567 684 680 738 734 6227
graphene 5 31 105 246 322 437 504 577 561 552 496 3836
magnetic 462 422 430 427 451 465 448 517 474 496 488 5080
dynamics 289 313 352 328 291 341 355 388 356 440 456 3909
properties 331 388 365 374 324 385 379 404 402 445 445 4242
effect 298 275 277 291 295 322 297 329 344 443 394 3565
topological 16 10 18 33 38 115 188 275 311 415 392 1811
Let’s take a look at the trends for a couple terms.
This broadly tracks with what I know about the field:
- carbon nanotubes were hot stuff 10-15 years ago, but are a bit passé now
- graphene first showed up in 2004, blew up for a couple years, but now interest is fading
- these days the cool kids are looking at new 2D materials like MoS\(_2\) and ‘topological’ materials
So far we’re just confirming pre-existing impressions; what if we want to discover new trends without knowing what terms to search for? One simple option is to make a heatmap of term frequency versus year. Here we’ll do that for the 25 most common single words and the 25 most common two-word tuples:
The heatmaps convey the same information we plotted above (topological insulators are hot, carbon nanotubes are not), but they also expose a new trend: studies of spin orbit interactions are just as popular as topogical insulators. This is exactly the kind of contextual information that it’s useful to have: knowing what is popular, I can search for the specific subject and read a few of the most important papers on the topic. This leaves me much better informed about the state of the field than reading a smattering of papers from the top journals.
So far we’ve addressed the broad outlines of the field, but as a grad student I’m more often concerned with the small subset of the field that’s directly related to my own research. Next let’s identify those researchers whose work is most similar to my own.
Similar authors
To find researchers working on similar work we’re going to use a word counting approach similar to the one we used above. However, this time around we’ll group the abstract titles by author instead of by year.
Note that we’re removing all authors who only show up on one abstract. Without this filtering the similar author list is often dominated by a single relevant abstract title. After re-arranging, our data looks like this:
num_abstracts concat_titles
abs_authors
Andrey F. Vilesov 4 Infrared spectra and intensities of H$_{2}$O-N...
Andrey Gromov 8 Exact soliton solutions in a many-body system ...
Andrey Ignatov 2 Nanoelectrical probing with multiprobe SPM Sys...
Andrey Iljin 3 Light sensitive liquid crystals: Focusing on s...
Andrey Kiselev 4 Measurement of the Spin Relaxation Lifetime (T...
Previously we were interested in the raw frequencies of different terms. Now we want to assess the similarity of two sets of words (those in the titles of two different authors’ abstracts). We could simply compare the frequency of each word between the two authors (i.e. take the inner product of the count vectors for both authors), however this method would likely be dominated by a few common words. Instead, we need some way to weigh the importance of each word so that rarer words carry more weight.
One popular solution to this problem is to use term frequency - inverse document frequency weighting. The wikipedia link gives a thorough explanation, but to summarize, the weighting is determined by:
- term frequency: how often the term appears in a given author’s abstract titles
- inverse document frequency: how often the term appears in all the titles
Scikit-learn has a ready-made class which we’ll use to transform the data according to this weighting.
Now that we have an appropriately weighted count matrix, we can find similar authors by taking the dot product of the count matrix and a count vector corresponding to terms that we care about.
As a test, let’s find authors whose work is similar to my most recently submitted paper, which is titled ‘Single Gate P-N Junctions in Graphene- Ferroelectric Devices’.
num_abstracts concat_titles
abs_authors
J. Henry Hinnefeld 3 Strain Effects in Graphene Transport Measureme...
Ruijuan Xu 3 Scanned Probe Measurements of Graphene on Ferr...
Mohammed Yusuf 4 Characterization Of Graphene-Ferroelectric Sup...
Xu Du 26 Bragg Spectroscopy of Excitations of a Quantum...
Maria Moura 2 Rippling of Graphene Tearing of Graphene
Chunning Lau 10 Supercurrent in Graphene Josephson Transistors...
Gang Liu 20 Nano-meter structured three-phase contact line...
Wenzhong Bao 45 Thermopower of Few-Layer Graphene Devices Spin...
Philip Kim 128 Electric Transport in MoSe molecular Nanowires...
Chun Ning Lau 36 Scanned Probe Imaging of Nanoscale Conducting ...
Looks pretty good – the two most similar researchers are myself and my collaborator. I recognize some of the other names, for example Philip Kim is one of the two or three biggest names in graphene research. Several of the other names are a mystery to me though. As a final step, let’s use some graph-based analyses to figure out the context surrounding these other researchers.
Graph-based analysis
We’ll start by building a graph of all coauthors of the 10 most similar authors.
For each author in the graph we’ll add edges joining that author to anyone else in the graph with whom they have been coauthors. Having populated our graph, I’ll borrow some D3.js code from here to visualize the graph and look for relationships between the similar authors.
In the figure below, each circle corresponds to an author, and the size of each circle indicates how many abstracts list that author as a contributor. (Try clicking and dragging one of the circles)
A few things are immediately apparent:
- Philip Kim is a big deal.
- Wenzhong Bao, Chun Ning Lau, and Gang Liu work together closely. They also have strong ties to Philip Kim.
- A few other groups work largely independently.
As with the keyword analysis above, now that I have some contextual information I can keep track of developments in the field much more efficiently. First, I know which other authors are doing work that is particularly relevant to my own, so I can carefully keep track of their publications. Second, I have a better feel for the relationships between different researchers working in my area, so I can be more strategic about looking for collaborators (and scoping out potential competitors). Finally, the raw size of each cluster helps me develop an intuitive feel for the relative prominence of different heavyweights in the field. Best of all, I can easily gain the same level of intuition about my next project by changing one or two lines of code. Taken together, these factors leave me much better positioned to be a productive, informed, and successful grad student. Now to build a time machine and explain all this to my first-year self …
If you’d like to play around with the data or the analysis yourself, the Jupyter notebook for this post is here and the data is here.