Summer Research Highlight

Seven fantastic undergrads & recent-grads are working with me this summer, and we’ve already made a ton of progress.  We have a separate student blog, The Pathway Not Taken, which was established as part of a Computing Research Association Collaborative REU grant (I hope that program comes back, it was great).

First up is Amy Rose Lazarte, who just graduated from Reed. Before heading to Puppet Labs as a software engineer, she’s working to build models of phytoplankton fitness in freshwater lakes.  Read her post for more info:

Ecology Modeling: Thermal Variation and Phytoplankton Fitness

(If you want to learn a bit about all projects, read my summer kickstarter post).

Gender & Racial Disparities in Big Cancer Data

As a researcher who works with large publicly available biological datasets, I was reminded of the potential biases in big data when I came across this blog post from the University of Michigan Health Lab:

How Genomic Sequencing May Be Widening Racial Disparities in Cancer Care .  Nicole Fawcett, Aug 17, 2106.

Cancer is a notoriously heterogeneous disease, meaning that different patients with the same cancer type may harbor different sets of mutations.  Further, many genes associated with cancer tend to be mutated at very low frequencies in tumors [1].   In order to gain enough statistical power to confidently identify these rare “driver” mutations, we need data from hundreds to thousands of tumor samples.  Obtaining such a large number of samples often requires collecting tissues whenever possible.

The Cancer Genome Atlas (TCGA) is a massive data repository for dozens of cancers, containing data from hundreds to thousands of individuals for most cancer types.  The post above describes a recent study that determined the racial breakdown of tumor samples in 10 of the 31 tumor types from TCGA.  They found that while the samples were racially diverse — even, in some cases, matching the U.S. population — the number of African-American, Asian, and Hispanic samples were too small to identify group-specific mutations with 10% frequency for any tumor type except breast cancer in African-Americans. On the other hand, there were enough Caucasian samples in every tumor type to identify mutations with 10% frequency in the population (and 5% frequency for 8 of the 10 tumor types assessed).  Consequently, we identify more “rare” mutations that pertain to Caucasians simply because we have more data to support the findings.  Further, only 3% of the total samples were Hispanic, while Hispanics comprise 16% of the U.S. population.

This disparity is not limited to a race.  Gender representation in big cancer data has also been in the press.  The under-representation of women in sex-nonspecific cancer over the past 15 years has been reviewed by Hoyt and Rubin (Cancer 2012), who noted that this gap may be widening.

Want to see the discrepancies for yourself?  The data is easy enough to obtain, but Enpicom has a fantastic interactive visualization of the entire TCGA data repository by patient gender, race, and age.

screen-shot-2016-09-14-at-5-20-46-pm

Consider glioma, for example – while the incidence rate of brain tumors is higher in women than in men [2], women comprised only 41.4% of the over 1,100 samples.

screen-shot-2016-09-14-at-1-50-31-pm

Even more alarmingly,  over 88% of the samples are Caucasian.screen-shot-2016-09-14-at-1-50-03-pm

There is evidence of higher incidence rates of brain cancer in Caucasians compared African-Americans and Hispanics, but surely this doesn’t justify the over-representation in this dataset.

So, what should we do?

On one hand, we need to carefully design data collection efforts to ensure that different racial/ethnic groups are adequately represented – not simply to reflect the proportion in the U.S. population but to gain enough statistical power to confidently identify rare mutations.   On the other hand,  “convenience sampling” methods of obtaining tumors from the most convenient places, even if the population is homogenous, have enabled consortia to collect enough data in the first place.  In fact, we better understand the “rare mutation” concept due to the mostly-white patient data collected by TCGA and others.

The only clear answer is that we need more data.


[1] This is often called the “long tail” distribution of cancer gene mutations.  For more information, see, for example,  Lessons from the Cancer Genome. Garraway and Lander.  Cell 2013.

[2] All primary malignant and non-malignant brain and CNS tumors.  In fact, the incidence rate of malignant brain tumors is slightly higher in men.  Cancer statistics from the Central Brain Tumor Registry of the United States.

 

 

Yep, cancer is still complicated

Image from amazon.com

If you haven’t read The Emperor of All Maladies: A Biography of Cancer by Siddhartha Mukherjee, I would highly recommend it. And if you would rather watch it, Ken Burns produced a documentary focusing on the book that recently aired on PBS.   While we have come a long way in cancer research, it is alarming how little we still know about it.  In the age of personalized medicine and the plethora of cancer datasets, you would think that understanding cancer is getting to be, at the very least, more understandable.  This New York Times opinion article gives a few examples where finding a druggable mutation is not as easy as one would hope.

Trying to Fool Cancer – NYTimes.com.

Ephemeralization

This WIRED article resonated with the New Media Seminar I’m taking at Virginia Tech.

Big Data: One Thing to Think About When Buying Your Apple Watch | WIRED.

I hadn’t heard of  the term ephemeralization coined by Buckminster Fuller before, which is the promise of technology to do “more and more with less and less until eventually you can do everything with nothing.” Fuller cites Ford’s assembly line as one example of ephemeralization.  Ali Rebaie, the author of the WIRED article, writes that the Big Data movement is another form of it.  Our ability to analyze huge datasets has lead to designing more efficient technology.  All in all, Fuller seems to fit right in with the others we have been reading in the seminar.

The vision of machine learning, from 1950

Reading: “Computing Machinery and Intelligence” by Alan Turing. Mind: A Quarterly Review of Psychology and Philosophy 59(236):433-360. October 1950. (one reprint is here by a quick Google search).

Computer scientist majors will learn about the famous Turing Machine in any introductory Theory of Computation class.  They might get a cursory mention of the “Imitation Game,” the subject of this article (with the recent movie, this may change).  I am intrigued by so many aspects of this article, but I will limit my observations to two items.

Part I: Could this article be published today?

The notion of the “Imitation Game” and an exploration of its feasibility is incredibly forward-thinking for Turing’s time  — so much so that he admits to his audience that he doesn’t have much in the way of proof.

The reader will have anticipated that I have no very convincing arguments of a positive nature to support my views.  If I had I should not have taken such pains to point out the fallacies in contrary views.

The article was published in a philosophy journal, so Turing was able to allow his arguments to take idealistic positions which were not practical at the time (though many of his arguments are closer to reality today).  Yet he does not focus on the arguments that establish the feasibility of such a computer (or the program), but lays out a framework for “teaching” machines to play the Imitation Game.  Through his descriptions I can easily see the foundations of fundamental computer science sub-disciplines such as artificial intelligence and machine learning.  He truly was an innovative thinker for his time.  I wonder if a similar forward-thinking article would be published today, with little evidence for idealistic scenarios.  Perhaps there is a Turing of 2015 trying to convince the scientific community of a potential technological capacity that will only be confirmed fifty years from now.

Part II: Scale

There are many numbers in Turing’s article relating to the amount of storage capacity required for a computer to successfully participate in the Imitation Game.  He didn’t seem to be too worried about storage requirements:

I believe that in about fifty years’ time it will be possible to programme computers with a storage capacity of about 109 to make them play the imitation game so well than an average interrogator will not have more than 70 per cent. chance of making the right identification after five minutes of questioning.

I was interested in seeing how accurate his estimates were.  Keep in mind that 10×102=103; that is, each time the exponent increases by one we are multiplying the quantity by 10. For example, if we look at the capacity of the Encyclopedia Brittanica

  • 2×109: capacity of the Encyclopedia Brittanica, 11th Ed. (Turing, 1950)
  • 8×109: capacity of the Encyclopedia Brittanica, 2010 Ed. (last one to be printed)

We see that the size of the encyclopedia has quadrupled in the past 60 years.  Now, let’s look at Turing’s estimates of the capacities of both a future computer and the human brain.

  • 109: capacity of a computer by 2000 (Turing, 1950)
  • 1010-1015: estimated capacity of the human brain (Turing, 1950)
  • 3×1010: standard memory of a MacBook Pro, 2015 (4Gb memory)
  • 4×1012: standard storage of a MacBook Pro, 2015 (500Gb storage)
  • 8×1012-8×1013 : estimated capacity of the human brain (Thanks Slate, 2012)
  • 2×1013: pretty cheap external hard drive (3Tb)

Our current laptops can hold more bits in memory than Turing believed would be able to be stored!  Pretty amazing.  Consider the speed (in FLOPS = floating point operations per second) of two of the world’s supercomputers:

  • 80×1012:  IBM’s Watson, designed to answer questions on Jeopardy (80 TeraFLOPS)
  • 33.86×1015: Tianhe-2, the world’s fastest supercomputer according to TOP500 (33.86 PetaFLOPS)

In 2011, USC researchers suggested that we could store about 295 exabytes of information, which translates to 2.3×1021 bits.  That’s a number even I cannot comprehend.

We have the data – now what?

This is the first post for a New Media Seminar that I am participating in at Virginia Tech.  The main site for this seminar includes comments on weekly readings.  Seminar Twitter Hashtag: #vtnmss15.

Reading: “As We May Think” by Vannevar Bush, Atlantic Monthly, 176(1):101-108. (July 1945).

Many of the passages in Vannevar Bush’s article As We May Think resonate with the current state of scientific research.  As I was reading the first part of the article, I could see the desire to accumulate and aggregate massive amounts of information, making data collection a more efficient process.  In the back of my mind I kept thinking, “But what next? What happens after you have obtained all this information?”  In biological discovery, we are exactly at this point; the development of high-throughput technologies are producing enormous amounts of observations about biological systems.  As examples, 1,000 human genomes have been sequenced to better understand genetic variation in human populations, and thousands of cancer genomes have been measured to identify underlying mechanisms of cancer.  How are we going to use these datasets to further scientific discovery?  Then, I came across this paragraph:

So much for the manipulation of ideas and their insertion into the record. Thus far we seem to be worse off than before—for we can enormously extend the record; yet even in its present bulk we can hardly consult it. This is a much larger matter than merely the extraction of data for the purposes of scientific research; it involves the entire process by which man profits by his inheritance of acquired knowledge. The prime action of use is selection, and here we are halting indeed. There may be millions of fine thoughts, and the account of the experience on which they are based, all encased within stone walls of acceptable architectural form; but if the scholar can get at only one a week by diligent search, his syntheses are not likely to keep up with the current scene.

The passage above begins Bush’s response to the pivotal moment when we have collected, annotated, and recorded so much data it becomes difficult to manage.  In computer science, the sub-discipline of Big Data has emerged exactly to address Bush’s comment that “for we can enormously extend the record; yet even its present bulk we can hardly consult it.” How do we manage this data to aid in scientific discovery, by placing the right information in front of the right experts?  But first, we must know what we’re looking for.

Bush writes “there may be millions of fine thoughts…all encased within stone walls of acceptable architectural form.”  However, one may need to find a handful of these thoughts to advance scientific understanding.  We have a needle in a haystack problem, except that the individuals who manage the haystack may not have the expertise to identify the needle once they’ve found it.  Bush’s notion of selection and indexing in subsequent passages begin to capture the need for clear data organization, and the memex is his ultimate idea of organizing one’s knowledge.

Biological research is becoming more and more collaborative, as enormous datasets require sophisticated technologies and algorithms for analysis.  Building upon Bush’s memex, we now need tools and paradigms to describe one individual’s knowledge to others.  Perhaps we will develop a framework that allows researchers to benefit from the “inheritance of acquired knowledge” in an efficient and descriptive manner.  Wikipedia may in fact be a CliffsNotes for everything in life (their statistics page boasts about 800 new articles a day).   Providing a means for inherited knowledge may be a necessity to, as Bush puts it, “keep up with the current scene” of scientific discovery.