When we or our loved ones are diagnosed with a serious illness, we turn to a sophisticated medical community, desperate for information about what is going on and what we can do. One of the most discouraging things we can hear in this situation is that modern medicine “just doesn’t know” enough about the condition to give us definite answers. How can this be? Is this really the first time that medical science has seen this condition? Surely in all the history of medicine, someone has encountered this before. What did we learn from that? What did and didn’t work?
The problem isn’t that this condition hasn’t been seen before—it probably has. The problem has to do with all that medical data that has to have been collected and analyzed to find useful patterns that apply to this case. We’ve been collecting medical data systematically for decades now, so what’s the problem? This data is so valuable to every person on earth; surely it has been cataloged, cross-referenced, federated and analyzed in every possible way, right?
Unfortunately, that’s not the case. To get an idea of the current state of medical data, consider the care.data project in the UK. It is an ambitious project to create the largest medical research database in history, based on data from the National Health Service. But the project is in danger, because of public concern about privacy and consent issues. In Australia, the telecomm giant Telstra has entered into the health records space in an attempt to coordinate collection and utilization of health records all across Australia. In the United States, the HL7 medical records standard organization is over 20 years old, and has recently been updated to a “fast” version called FHIR.
An investigation of the nuances of these efforts could fill a book, but there are a few striking facts that stand out:
- The largest research data set proposed today is made up of medical records from a single country (and not even a particularly large country, at that!)
- Investors see it as a daring move for a large company to attempt to roll-up health records—again, in a single country.
- One of the most successful health record standards is getting a facelift after two decades of use.
What’s going on here? Why, in 2014, are we still struggling with this problem? Why haven’t we just standardized all this information, and brought it together from around the globe? There are technical, political, and business challenges that have stood in the way of a global medical research database.
On the political side, concerns about privacy and consent plague these efforts, seen most recently in the care.data case, in which public perception is that big business and the government are using personal data without our consent. Then there are business challenges: private enterprises around the globe provide medical data management systems to clinicians and researchers. A lot of their corporate value is wrapped up in these proprietary systems.
So we’d like all their record formats to interoperate? Fine. But are we willing to pay them the substantial development cost of re-engineering their systems to conform to some new standard? No, we aren’t; after all, isn’t that a cost of doing business that the vendors themselves should bear? But when those same vendors ask for the right to commercialize the data they collect, to recoup those costs, there is a public reaction against having “our” data sold for someone else’s profit, and we’re back to the privacy and consent issues.
Then there are the technical issues. In a field as vast as medicine, a huge amount of technical work is needed to model all the relevant kinds of data. The plethora of medical standards isn’t a reflection of poor quality, but rather of the vast scope of use cases and settings in which the standards are applied. But how do we know where one standard leaves off, and another begins? This isn’t an easy problem, and can leave various standards bodies at odds with one another.
This all makes health care an example of what I like to call the Data Wilderness. There is a vast amount of data, in a wide variety of idiosyncratic forms, in an ever-changing landscape. Even our efforts to tame it seem to add to the confusion. Most technology approaches to the data wilderness –like single, large databases or global data standards—concentrate on the orderly end state that we dream of, after we have cleared the wilderness.
But while we can dream of a day when we have imposed order onto the chaos, the awful truth is that all of us will continue to spend our whole careers in the wilderness. Don’t believe me? Look how long we have been trying to tame the data wilderness in health care, and where we are after all that work. Do you really think this will be sorted out in your lifetime?
So, should we despair of ever being able to utilize our health record data? Not at all. But we do need to change our expectations of data management technology; we need wilderness survival technology. What does that look like? Wilderness survival technology works in an inherently distributed and heterogeneous world. In the wilderness, our queries won’t be simple and they don’t rely on a single data schema, so our survival technology will have to cope with more complexity and variety than before.
Where do we find our survival tools? There are a lot of new data technologies out there, but for wilderness survival, a graph database hits a lot of the sweet spots. Graphs are a natural way to talk about distributed data, allowing us to describe our data without strict adherence to a fixed schema. A graph database specializes in complex queries, matching combinations of conditions that will inevitably result from the multitude of data formats that we meet in the wilderness.
Graph data technology won’t solve the global health record problem on its own. We still need to understand the data sources and how they relate to one another. But it represents a change in how we think about the problem we are facing—from trying to get everyone to agree, to coping with today’s world, in which they don’t.
Dean Allemang, co-author of the bestselling book, Semantic Web for the Working Ontologist, is a consultant, thought-leader, and entrepreneur focusing on industrial applications of distributed data technology. He served nearly a decade as Chief Scientist at TopQuadrant, the world’s leading provider of Semantic Web development tools, producing enterprise solutions for a variety of industries. As part of his drive to see the Semantic Web become an industrial success, he is particularly interested in innovations that move forward the state of the art in distributed data technology. Dean’s current work is concentrated on the life sciences and finance industries, where he currently sees the most promising industrial interest in this technology.