New Reed CompBio Blog

As part of a recently-funded collaborative REU (generously supported by the CRA-W), my colleague Derek Applewhite and I are working with undergraduates to study machine learning methods to predict genes that regulate cell movement patterns in schizophrenia.  The team will post their work on new a Reed College blog, The Pathway Not Taken, and I may re-post selected pieces here.  The first post gives a general idea of the problem we will work on, and how biology and computer science are intertwined in the project.

Advertisements

Summer Reading List

For the first time in years, I’m making an effort to read some books for fun this summer.  I even made a list! There’s a theme, though – I might throw in a mystery novel for good measure.

  1. The Gene: an Intimate History by Siddhartha Mukherjee.  This book, by the author of The Emperor of all Maladies (which I wrote a bit about in an earlier post), is a detailed history of genes – from initial theories to current events. image from Google Books.
  2. Inventing the Mathematician: Gender, Race, and Our Cultural Understanding of Mathematics by Sara N. Hottinger.  The author is a professor of Women’s and Gender Studies.  I first of her book after reading her piece on the Inside Higher Ed blog, where she described why she decided to pursue a degree in feminist studies despite her passion and aptitude for mathematics. inventingbook
  3. The Fuzzy and the Techie: Why the Liberal Arts Will Rule the Digital World by Scott Hartley.  The author is a venture capitalist who writes about a new generation of entrepreneurs with a mix of STEM and liberal arts training.  fuzzy

 

Fixing science, one researcher at a time

A recent column in Nature caught my attention today.  The piece, No researcher is too junior to fix science by John Tregoning, talks about the problematic competitiveness of science.  Tregoning ends the column with an appeal to combat this current scientific culture.

Let’s strive instead to stand together. One science historian called last month’s science march unprecedented in its scale and breadth. That energy and optimism need not dissipate — it should be funnelled into making the system function better. The pay-off might not be immediate, but let’s play the long game so that all can win.

I couldn’t have put it better.

Congratulations Seniors

It’s been a long while since I posted, so I figured it’s appropriate to resume this blog by congratulating all the seniors at Reed who have labored over their theses throughout the year.  Now, they will ceremoniously burn all their drafts this afternoon at the bonfire.  Happy Renn Fayre!

20170427_101815

Spotted outside my office door in the Biology building.

Field Trip! Computational Biology on the Road

A few weeks ago I took my students to the Association Computing Machinery Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB) in Seattle, WA.  It was a fantastic experience for everyone involved – the organizers did an excellent job running the conference.  I asked my students to reflect on the conference, and I figured I should do the same.

With such a large cohort of undergraduates at a scientific conference, my role shifted to encompass one of an educator as well as a researcher.  I honed in on the accessibility of the material in talks, feeling a bit of pride when the speakers showed an image or mentioned a topic I have taught in class.  I also had some moments of “wow, should have taught them that” when a speaker presented a fundamental concept we have not yet covered.  Many of my students came out of sessions excited about what they had just learned – they talked with the speakers, asked for their papers, and are now delving into this new material.  Graduate student attendees became mentors, fielding questions about why they went to graduate school and how they picked their research topic.

ACM-BCB was an ideal size – the conference had compelling talks and tutorials while being small enough to chat with the keynote speakers and conference organizers.  I caught up with existing colleagues and met some potential collaborators in the Pacific Northwest.  I also found myself in discussions with  graduate students about my position in a liberal arts environment.  Reed had a research presence, since three Reed students submitted posters to the poster session.  My students had garnered enough research experience — either through their thesis, summer research, or independent projects in class — to have engaging conversations with other attendees.

Finally, the trip to ACM-BCB as a class taught everyone (including me) the importance of logistics.  Some gems:

  1. Make sure the taxi to the train station can fit the entire group.
  2. Remember who you gave the posters to in your mad dash to find parking before your train departs (see #1).
  3. Make sure your PCard credit limit is set so it’s not declined at the hotel.
  4. Tell your students the correct time of the first keynote.

And the question of the day: is a (very detailed) receipt for a can of soda written on a napkin by a bartender reimbursable?

Gender & Racial Disparities in Big Cancer Data

As a researcher who works with large publicly available biological datasets, I was reminded of the potential biases in big data when I came across this blog post from the University of Michigan Health Lab:

How Genomic Sequencing May Be Widening Racial Disparities in Cancer Care .  Nicole Fawcett, Aug 17, 2106.

Cancer is a notoriously heterogeneous disease, meaning that different patients with the same cancer type may harbor different sets of mutations.  Further, many genes associated with cancer tend to be mutated at very low frequencies in tumors [1].   In order to gain enough statistical power to confidently identify these rare “driver” mutations, we need data from hundreds to thousands of tumor samples.  Obtaining such a large number of samples often requires collecting tissues whenever possible.

The Cancer Genome Atlas (TCGA) is a massive data repository for dozens of cancers, containing data from hundreds to thousands of individuals for most cancer types.  The post above describes a recent study that determined the racial breakdown of tumor samples in 10 of the 31 tumor types from TCGA.  They found that while the samples were racially diverse — even, in some cases, matching the U.S. population — the number of African-American, Asian, and Hispanic samples were too small to identify group-specific mutations with 10% frequency for any tumor type except breast cancer in African-Americans. On the other hand, there were enough Caucasian samples in every tumor type to identify mutations with 10% frequency in the population (and 5% frequency for 8 of the 10 tumor types assessed).  Consequently, we identify more “rare” mutations that pertain to Caucasians simply because we have more data to support the findings.  Further, only 3% of the total samples were Hispanic, while Hispanics comprise 16% of the U.S. population.

This disparity is not limited to a race.  Gender representation in big cancer data has also been in the press.  The under-representation of women in sex-nonspecific cancer over the past 15 years has been reviewed by Hoyt and Rubin (Cancer 2012), who noted that this gap may be widening.

Want to see the discrepancies for yourself?  The data is easy enough to obtain, but Enpicom has a fantastic interactive visualization of the entire TCGA data repository by patient gender, race, and age.

screen-shot-2016-09-14-at-5-20-46-pm

Consider glioma, for example – while the incidence rate of brain tumors is higher in women than in men [2], women comprised only 41.4% of the over 1,100 samples.

screen-shot-2016-09-14-at-1-50-31-pm

Even more alarmingly,  over 88% of the samples are Caucasian.screen-shot-2016-09-14-at-1-50-03-pm

There is evidence of higher incidence rates of brain cancer in Caucasians compared African-Americans and Hispanics, but surely this doesn’t justify the over-representation in this dataset.

So, what should we do?

On one hand, we need to carefully design data collection efforts to ensure that different racial/ethnic groups are adequately represented – not simply to reflect the proportion in the U.S. population but to gain enough statistical power to confidently identify rare mutations.   On the other hand,  “convenience sampling” methods of obtaining tumors from the most convenient places, even if the population is homogenous, have enabled consortia to collect enough data in the first place.  In fact, we better understand the “rare mutation” concept due to the mostly-white patient data collected by TCGA and others.

The only clear answer is that we need more data.


[1] This is often called the “long tail” distribution of cancer gene mutations.  For more information, see, for example,  Lessons from the Cancer Genome. Garraway and Lander.  Cell 2013.

[2] All primary malignant and non-malignant brain and CNS tumors.  In fact, the incidence rate of malignant brain tumors is slightly higher in men.  Cancer statistics from the Central Brain Tumor Registry of the United States.