We have the data – now what?

Reading: “As We May Think” by Vannevar Bush, Atlantic Monthly, 176(1):101-108. (July 1945).

Many of the passages in Vannevar Bush’s article As We May Think resonate with the current state of scientific research.  As I was reading the first part of the article, I could see the desire to accumulate and aggregate massive amounts of information, making data collection a more efficient process.  In the back of my mind I kept thinking, “But what next? What happens after you have obtained all this information?”  In biological discovery, we are exactly at this point; the development of high-throughput technologies are producing enormous amounts of observations about biological systems.  As examples, 1,000 human genomes have been sequenced to better understand genetic variation in human populations, and thousands of cancer genomes have been measured to identify underlying mechanisms of cancer.  How are we going to use these datasets to further scientific discovery?  Then, I came across this paragraph:

So much for the manipulation of ideas and their insertion into the record. Thus far we seem to be worse off than before—for we can enormously extend the record; yet even in its present bulk we can hardly consult it. This is a much larger matter than merely the extraction of data for the purposes of scientific research; it involves the entire process by which man profits by his inheritance of acquired knowledge. The prime action of use is selection, and here we are halting indeed. There may be millions of fine thoughts, and the account of the experience on which they are based, all encased within stone walls of acceptable architectural form; but if the scholar can get at only one a week by diligent search, his syntheses are not likely to keep up with the current scene.

The passage above begins Bush’s response to the pivotal moment when we have collected, annotated, and recorded so much data it becomes difficult to manage.  In computer science, the sub-discipline of Big Data has emerged exactly to address Bush’s comment that “for we can enormously extend the record; yet even its present bulk we can hardly consult it.” How do we manage this data to aid in scientific discovery, by placing the right information in front of the right experts?  But first, we must know what we’re looking for.

Bush writes “there may be millions of fine thoughts…all encased within stone walls of acceptable architectural form.”  However, one may need to find a handful of these thoughts to advance scientific understanding.  We have a needle in a haystack problem, except that the individuals who manage the haystack may not have the expertise to identify the needle once they’ve found it.  Bush’s notion of selection and indexing in subsequent passages begin to capture the need for clear data organization, and the memex is his ultimate idea of organizing one’s knowledge.

Biological research is becoming more and more collaborative, as enormous datasets require sophisticated technologies and algorithms for analysis.  Building upon Bush’s memex, we now need tools and paradigms to describe one individual’s knowledge to others.  Perhaps we will develop a framework that allows researchers to benefit from the “inheritance of acquired knowledge” in an efficient and descriptive manner.  Wikipedia may in fact be a CliffsNotes for everything in life (their statistics page boasts about 800 new articles a day).   Providing a means for inherited knowledge may be a necessity to, as Bush puts it, “keep up with the current scene” of scientific discovery.


  1. The “now what” problem is key. Corporations certainly know what to do with our data: use it to sell to us in ever more creative ways, that have the tendency to pigeonhole and stalk us across multiple platforms (http://bit.ly/1FDLVf4). But what are the rest of us to do with it? Much of it we don’t even have access to, nor do we have the tools to parse it if we did, as you point out. Many interesting things to ponder here.


  2. That passage stood out to me too, mainly because it made me think about how freakin’ much I use search functions on computers. I mean I remember being young and learning how to “search” within word processing documents and among file names on my own computer, but if I recall correctly, you had to be pretty precise about what you were looking for and in knowing what you called it. Now I regularly search not just within the texts of files (thanks to OCR’ed PDFs) and websites, but also within books and articles I don’t even own. When we discuss readings in class, I often search for passages based on key phrases I remember from the passage, which is so different from having to either more specifically remember around where the passage was. I wonder how these changing search practices affect people’s work practices more generally.


