What Do You Think Of Data Lakes?

Being that I am not a high-end technologist, I’m not always up on the latest trends in database management – so the following may not be news to everyone who reads this. As for me, though, the notion of a “data lake” is a new one, and I think it a valuable idea which could hold a lot of promise for managing unruly healthcare data.

The following is a definition of the term appearing on a site called KDnuggets which focuses on data mining, analytics, big data and data science:

A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured and unstructured data. The data structure and requirements are not defined until the data is needed.

According to article author Tamara Dull, while a data warehouse contains data which is structured and processed, expensive to store, relies on a fixed configuration and used by business professionals, a data link contains everything from raw to structured data, is designed for low-cost storage (made possible largely because it relies on open source software Hadoop which can be installed on cheaper commodity hardware), can be configured and reconfigured as needed and is typically used by data scientists. It’s no secret where she comes down as to which model is more exciting.

Perhaps the only downside she identifies as an issue with data lakes is that security may still be a concern, at least when compared to data warehouses. “Data warehouse technologies have been around for decades,” Dull notes. “Thus, the ability to secure data in a data warehouse is much more mature than securing data in a data lake.” But this issue is likely to receive in the near future, as the big data industry is focused tightly on security of late, and to her it’s not a question of if security will mature but when.

It doesn’t take much to envision how the data lake model might benefit healthcare organizations. After all, it may make sense to collect data for which we don’t yet have a well-developed idea of its use. Wearables data comes to mind, as does video from telemedicine consults, but there are probably many other examples you could supply.

On the other hand, one could always counter that there’s not much value in storing data for which you don’t have an immediate use, and which isn’t structured for handy analysis by business analysts on the fly. So even if data lake technology is less costly than data warehousing, it may or may not be worth the investment.

For what it’s worth, I’d come down on the side of the data-lake boosters. Given the growing volume of heterogenous data being generated by healthcare organizations, it’s worth asking whether deploying a healthcare data lake makes sense. With a data lake in place, healthcare leaders can at least catalog and store large volumes of un-normalized data, and that’s probably a good thing. After all, it seems inevitable that we will have to wring value out of such data at some point.

About the author

Anne Zieger

Anne Zieger is a healthcare journalist who has written about the industry for 30 years. Her work has appeared in all of the leading healthcare industry publications, and she's served as editor in chief of several healthcare B2B sites.

1 Comment

  • The problem is “data lakes” are not a recent invention, it’s simply an invented name for post-relational data structures that have been around since the 1980 and are core to “Cases” in healthcare (i.e. Cases can accommodate any mix of structured and unstructured data)

    The reason post-relational databases are able accommodate pretty much “anything” and this includes documents, images, even video recordings, is because of an invention called BLOB (Binary Large Object).

    It is prudent to at least export all data going into an EHR to a generic data exchanger that allows either immediate export/import to a data warehouse OR buffering of the data for possible future export/import to a data warehouse.

    The rationale for having data warehouses is to have data in a format that SQL commands can work with for statistical/tabular reporting.

    Data extraction out of EHRs is more complex in that an agency can have many need-to-know subscribers, each ideally wanting to read data using their own native data element naming conventions and using a data transport format that allows easy import at the target location (some local or remote database).

Click here to post a comment
   

Categories