Is Healthcare Big Data Biased?

Posted on November 30, 2012 I Written By

Mandi Bishop is a hardcore health data geek with a Master's in English and a passion for big data analytics, which she brings to her role as Dell Health’s Analytics Solutions Lead. She fell in love with her PCjr at 9 when she learned to program in BASIC. Individual accountability zealot, patient engagement advocate, innovation lover and ceaseless dreamer. Relentless in pursuit of answers to the question: "How do we GET there from here?" More byte-sized commentary on Twitter: @MandiBPro.

Have you ever wondered whether YOUR healthcare data is included in the “big data” everyone’s talking about? After all, healthcare big data analytics are going to change the world; shouldn’t those changes be representative of the population they will impact?

To answer that question, we have to identify the sources of the healthcare big data being used to effect change, and consider the likelihood that your data may have been captured and consumed by one of the reporting organizations. So let’s start with the “capture” part of that equation.

Have you received some type of healthcare service this year? That includes, but is not limited to: hospital visit, physical therapy, doctor visit, chiropractor visit, urgent care visit, e-visit or phone consultation, health risk assessment or health fair.

Have you purchased or requested any regulated healthcare product this year, such as prescription drugs?

Do you have private health insurance?

Are you enrolled in Medicare or Medicaid?

If yes to any of the above, and the last question, in particular, YES, your data is included in the “big data” analytics currently shaping policy. It is likely that each billable product and service is attached to your Electronic Health Record, available for review and reporting by each involved party from your PCP (Primary Care Provider) to your friendly insurance call center agent. Your individual collection of data points are aggregated into a larger population, and sliced and diced to provide insights into groundbreaking research efforts. Congratulations! But does that inclusion mean that the conclusions driven by healthcare big data are representative?

By nature, the relevance of data-driven insights increases in proportion to the size of the population – and data points – included. But what if the outliers for the general population are the norm for your data set? Are your conclusions skewed?

What if you represent a population segment that is recognized as underserved? Consider the following, from the first Health Disparities and Inequalities Report, prepared in 2011 by the CDC (Centers for Disease Control): “Increasingly, the research, policy, and public health practice literature report substantial disparities in life expectancy, morbidity, risk factors, and quality of life, as well as persistence of these disparities among segments of the population…defined by race/ethnicity, sex, education, income, geographic location, and disability status.”

If your access to healthcare is limited by any of the factors indicated above, your data may not be captured unless/until there is an acute episode which requires medical intervention. In the report, the CDC acknowledges the challenge of capturing national data to support health initiatives for these populations; it is widely accepted as a barrier to healthcare equality that must be overcome.

What if you’re healthy? I’ll use myself as an example. I don’t go to the doctor unless it’s urgent, and I haven’t visited my PCP in over a year. I’ve injured my shoulder and my back over the past year, both of which required MRI and CAT scans to diagnose severity; however, I do not follow any medically supervised treatment plan for rehabilitation. I don’t take any routine prescription medication. I’m an exercise enthusiast who works out intensely 5-6 days/week, and I sleep 8-9 hours a night. Yes, I do sleep that much. And no, me putting all this information into a blog does not constitute the data being captured for use in healthcare big data analytics. Because I haven’t needed to go to my PCP lately, don’t take routine prescription medication, and am not of age for Medicare or income level for Medicaid, the only current healthcare data available for analysis for me is orthopedic in nature and revolves around imaging data, not traditional clinical measures. Someone like me who had NOT experienced an acute care episode would have no current data available for consumption and reporting as part of a larger population.

Could it be that much, if not most, healthcare big data cited for research purposes is comprised primarily of a triangle of outlier population segments: 1) oldest, 2) poorest, and 3) sickest?

Perhaps. So, when reading on the advances in healthcare big data analytics, ask yourself whether that “big data” means “YOUR data”.

PS – For those of you curious about defining “big data” in healthcare, read Dr. Graham Hughes blog post for SAS, “How Big Is Big Data In Healthcare?”, detailing the nuances of the term as it relates to data size, complexity, and usage. Also, I’d like to thank the good folks at Vanderbilt University for compiling a fairly comprehensive list of healthcare data resources; it has been highly educational. Finally, if you’d like to read the complete CDC report, you can find it here.