Newly Released Open Source Libraries for Health Analytics from Health Catalyst

Posted on December 19, 2016 I Written By

Andy Oram is an editor at O'Reilly Media, a highly respected book publisher and technology information provider. An employee of the company since 1992, Andy currently specializes in open source, software engineering, and health IT, but his editorial output has ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. His articles have appeared often on EMR & EHR and other blogs in the health IT space. Andy also writes often for O'Reilly's Radar site ( and other publications on policy issues related to the Internet and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM, and DebConf.

I celebrate and try to report on each addition to the pool of open source resources for health care. Some, of course, are more significant than others, and I suspect the new libraries released by the Health Catalyst organization will prove to be one of the significant offerings. One can do a search for health care software on sites such as GitHub and turn up thousands of hits (of which many are probably under open and free licenses), but for a company with the reputation and accomplishments of Health Catalyst to open up the tools it has been using internally gives great legitimacy from the start.

According to Health Catalyst’s Director of Data Science Levi Thatcher, the main author of the project, these tools are tried and tested. Many of them are based on popular free software libraries in the general machine learning space: he mentions in particular the Python Scikit-learn library and the R language’s caret and and data.table libraries. The contribution of Health Catalyst is to build on these general tools to produce libraries tailored for the needs of health care facilities, with their unique populations, workflows, and billing needs. The company has used the libraries to deploy models related to operational, financial, and clinical questions. Eventually, Thatcher says, most of Health Catalyst’s applications will use predictive analytics based on, and now other programmers can too.

Currently, Health Catalyst is providing libraries for R and Python. Moving them from internal projects to open source was not particularly difficult, according to Thatcher: the team mainly had to improve the documentation and broaden the range of usable data connections (ODBC and more). The packages can be installed in the manner common to free software projects in these language. The documentation includes guidelines for submitting changes, so that an ecosystem of developers can build up around the software. When I asked about RESTful APIs, Thatcher answered, “We do plan on using RESTful APIs in our work—mainly as a way of integrating these tools with ETL processes.”

I asked Thatcher one more general question: why did Health Catalyst open the tools? What benefit do they derive as a company by giving away their creative work? Thatcher answers, “We want to elevate the industry and educate it about what’s possible, because a rising tide will lift all boats. With more data publicly available each year, I’m excited to see what new and open clinical or socio-economic datasets are used to optimize decisions related to health.”