Gender & Racial Disparities in Big Cancer Data

As a researcher who works with large publicly available biological datasets, I was reminded of the potential biases in big data when I came across this blog post from the University of Michigan Health Lab:

How Genomic Sequencing May Be Widening Racial Disparities in Cancer Care .  Nicole Fawcett, Aug 17, 2106.

Cancer is a notoriously heterogeneous disease, meaning that different patients with the same cancer type may harbor different sets of mutations.  Further, many genes associated with cancer tend to be mutated at very low frequencies in tumors [1].   In order to gain enough statistical power to confidently identify these rare “driver” mutations, we need data from hundreds to thousands of tumor samples.  Obtaining such a large number of samples often requires collecting tissues whenever possible.

The Cancer Genome Atlas (TCGA) is a massive data repository for dozens of cancers, containing data from hundreds to thousands of individuals for most cancer types.  The post above describes a recent study that determined the racial breakdown of tumor samples in 10 of the 31 tumor types from TCGA.  They found that while the samples were racially diverse — even, in some cases, matching the U.S. population — the number of African-American, Asian, and Hispanic samples were too small to identify group-specific mutations with 10% frequency for any tumor type except breast cancer in African-Americans. On the other hand, there were enough Caucasian samples in every tumor type to identify mutations with 10% frequency in the population (and 5% frequency for 8 of the 10 tumor types assessed).  Consequently, we identify more “rare” mutations that pertain to Caucasians simply because we have more data to support the findings.  Further, only 3% of the total samples were Hispanic, while Hispanics comprise 16% of the U.S. population.

This disparity is not limited to a race.  Gender representation in big cancer data has also been in the press.  The under-representation of women in sex-nonspecific cancer over the past 15 years has been reviewed by Hoyt and Rubin (Cancer 2012), who noted that this gap may be widening.

Want to see the discrepancies for yourself?  The data is easy enough to obtain, but Enpicom has a fantastic interactive visualization of the entire TCGA data repository by patient gender, race, and age.

screen-shot-2016-09-14-at-5-20-46-pm

Consider glioma, for example – while the incidence rate of brain tumors is higher in women than in men [2], women comprised only 41.4% of the over 1,100 samples.

screen-shot-2016-09-14-at-1-50-31-pm

Even more alarmingly,  over 88% of the samples are Caucasian.screen-shot-2016-09-14-at-1-50-03-pm

There is evidence of higher incidence rates of brain cancer in Caucasians compared African-Americans and Hispanics, but surely this doesn’t justify the over-representation in this dataset.

So, what should we do?

On one hand, we need to carefully design data collection efforts to ensure that different racial/ethnic groups are adequately represented – not simply to reflect the proportion in the U.S. population but to gain enough statistical power to confidently identify rare mutations.   On the other hand,  “convenience sampling” methods of obtaining tumors from the most convenient places, even if the population is homogenous, have enabled consortia to collect enough data in the first place.  In fact, we better understand the “rare mutation” concept due to the mostly-white patient data collected by TCGA and others.

The only clear answer is that we need more data.


[1] This is often called the “long tail” distribution of cancer gene mutations.  For more information, see, for example,  Lessons from the Cancer Genome. Garraway and Lander.  Cell 2013.

[2] All primary malignant and non-malignant brain and CNS tumors.  In fact, the incidence rate of malignant brain tumors is slightly higher in men.  Cancer statistics from the Central Brain Tumor Registry of the United States.

 

 

Ready, Set, Year Two

I have returned from summer break to begin teaching a new course this fall.  My break  included a hiatus in blog posts; now that classes have started up, I’m back to writing them.  Other lessons from my first true “summer break:”

  1. Yep, I still love research. Summer was a refreshing change of pace, where I was able to chip away at existing research projects and establish new collaborations here in Portland.
  2. Pacific Northwest summer weather is great.  No humidity + few bugs. I didn’t think that was possible.
  3. Feelings of preparedness are relative.  Despite having a year under my belt, there are enough new tasks and responsibilities that I still feel like a newbie.

Happy back-to-school for those who live by the academic calendar, and welcome to the Reed Class of 2020.

class-of-20

The Class of 2020 at Reed College’s Convocation.  Photo by Leah Nash.

 

Spotted in the Lab

This is the last week of classes.  Reed seniors are finalizing their theses — a culmination of their year-long projects — before sending them off to faculty readers.  As we near the end, my computational biology lab has a new round of students working night and day.  Don’t worry, though – Monty the Motivation Whale is there for you.

20160425_095123

Monty’s appearance might be due to the fact that one of the Reed seniors is a lead scientist at the Orca Behavior Institute, a non-profit he started in 2015.

Pre-prints as a speedup to scientific communication

Tomorrow, I’ll sit on a panel about Open Data and Open Science as part of Reed’s Digital Scholarship Week.  I am somewhat familiar with these topics in computer science, but I decided to read up on the progress with Open Access in Biology.

As a junior professor trying to get a foothold in a research program, I’ll admit that I haven’t spent a lot of time thinking about Open Science.  In fact, the first thing I did was look up what it meant:

Open science is the movement to make scientific research, data and dissemination accessible to all levels of an inquiring society.                       – Foster Project Website

Ok, this seems obvious,  especially since so much research is funded by taxpayer dollars.  Surprisingly, Open Science is not yet a reality.  In this post, I’ll focus on the speed of dissemination – the idea that once you have a scientific finding, you want to communicate it to the community in a timely manner.

Biology findings are often shared in the form of peer-reviewed journal publications, where experts in the field comment on drafts before they are deemed acceptable for publication.  Peer-review may be controversial and even compromised (just read a few RetractionWatch posts), but in theory it’s a good idea for others to rigorously “check” your work.  However, the peer-review process can be slow. Painfully slow.  Findings are often published months to even years after the fact.

In computer science, my “home” research discipline, it’s a different story.  Computer science research is communicated largely through conferences, which often includes paper deadlines, quick peer-review turnaround times, and a chance to explain your research to colleagues.  Manuscripts that haven’t undergone peer-review yet may be posted to arXiv.org, a server dedicated to over one million papers in physics, mathematics, and other quantitative fields.  Manuscripts submitted to arXiv are freely available to anyone with an internet connection, targeting “all levels of an inquiring society.”

A biology version of the site, BioRxiv.org, was created in 2013 — more than 20 years after arXiv was established.   It only contains about three thousand manuscripts.  What is the discrepancy here?  Why is the field reluctant to change?

Last February, a meeting was held at the Howard Hughes Medical Institute (HHMI) Headquarters to discuss the state of publishing in the biological sciences. The meeting, Accelerating Science and Publication in Biology (appropriately shortened to ASAPbio), considered how “pre-prints” may accelerate and improve research.  Pre-prints are manuscript drafts that have not yet been peer-reviewed but are freely available to the scientific community.  ASAPBio posted a great video overview about pre-prints, for those unfamiliar with the idea.  While the general consensus was that publishing needs to change, there are still some major factors that make biologists reluctant to post pre-prints (see the infographic below).

This is an excellent time to talk open science in Biology.  It has become a hot topic in the last few months (though some in the field have been pushing for open science for years). The New York Times recently wrote about the Nobel Laureates who are posting pre-prints, and The Economist picked up a story about Zika virus experiment results that were released in real time in an effort to help stop the Zika epidemic.

Open Science has the potential to lead to more scientific impact than any journal or conference publication.  The obstacles are now determining what pre-prints mean to an academic’s career – in publishing the manuscripts, determining priority of discovery (meaning “I found this first”), and obtaining grants.  I rely on freely-available data and findings in my own research, yet I’ve never published a pre-print.  After writing this post, I think  I may start doing so.

preprint-opinions-graphicAdditional Sources:

Mick Watson’s 2/22/2016 post about generational change on his blog Opiniomics.

Michael Eisen’s  2/18/2016 post about pre-print posting on his blog it is NOT junk.

Handful of Biologists Went Rogue and Published Directly to Internet, New York Times, 3/15/2016.

Taking the online medicine, The Economist, 3/19/2016.