Picture of Kirsten PatrickKirsten Patrick is Deputy Editor at CMAJ; she’s currently attending the 31st International Conference on PharmacoEpidemiology &Therapeutic Risk Management (ICPE) in Boston


About a year ago I suggested “Big Medical Data” as a potential topic for a CMAJ editorial to our editors’ writing group. I remember receiving some blank looks that sounded a lot like “Weirdo!” In fact, that may well have been upon my return from the last ICPE, or perhaps it was a year before that when I came back from participating in the working group that produced The REporting of Studies Conducted Using Observational Routinely-Collected Health Data (RECORD) Statement. Anyway, there’s something about talking to people who are working with, and developing new ways of crunching, Big Data that gets me all fired up about it. I can see an exciting future full of possibilities and I want to evangelize.

In the first plenary session at ICPE yesterday, entitled “Computer Power and Human Reason: from calculation to judgement”, speakers seemed to be defending the role of the pharmacoepidemiologist now that crunching data with computer programs can tell us just about anything we need to know. What are the virtues of the human operator vs. computer systems? “Is it the pilot or the plane that’s critical for a successful flight?” asked Dr Robert Ball of the FDA’s Office of Surveillance and Epidemiology (OSE). Ball talked about the increasing utility of mining EMR data as a way to conduct routine drug surveillance but pointed out that, while EMR data are useful to complement spontaneous adverse event reporting, data crunching should not replace such reports; ~50% of all FDA’s post-marketing drug label changes are from spontaneous adverse reports.

Marc Berger, Vice President, Real World Data and Analytics at Pfizer, acknowledged that data are ‘exploding’. We now have access to information on multiple, and increasing numbers of variables as well as access to more and more advanced analytics (including machine learning) and pattern recognition algorithms that can give us extraordinary insights. However, he said, the biggest problem remains failure to ask a specific question, design a study carefully, and answer the question properly. This was music to my medical editor ears. Essentially, we have more and more sophisticated ways of data mining and yet the principle remains: ask a specific question, state it ‘a priori’ and answer it with credible methods. That’s not to say computer programs aren’t inherently useful. Humans are pattern-recognition machines, said Berger, while computers use algorithms to arrive at an answer. You can’t replace human reason with computers; we still need humans at the front end (to ask the right question) and the back end with tacit knowledge and those very skills of pattern-recognition. However, there is a place for quick and dirty data analysis according to Berger. He said that as a patient presenting with a particular complaint he’d want his physician to have the ability to give him an answer to the following question using (electronic medical record) EMR data: “Looking at the last 1000 or 10,000 men in their early sixties who were treated for [my] condition, what patterns are emerging to guide my own decision making?”

Indeed, Dr Louis Fiore of Veterans Affairs (VA) Boston Healthcare, talked about the potential for mining the real world data of the EMR in a later session. Most EMRs were “designed to facilitate one-on-one interactions, not to support analysis of aggregated data as required by many secondary uses”. So, most potentially useful information exists within EMRs as “unstructured free text”. As D’Avolio and Fiore said in a recent article, “Despite the availability of billions of data points, little is done to answer the three critical questions upon which discovery and improvement is predicated: 1) what are we doing; 2) to whom are we doing it and; 3) is it working?”

To get computable data from the EMR is tricky but you can do awesome things with it, said Fiore, from ‘locally selfish knowledge’ or ‘local learning’ (used for quality improvement in the VA network) right up to well-conducted randomized studies conducted relatively cheaply within routine practice, for e.g. to answer questions about comparative effectiveness of drugs that we should know the answer to but don’t. Local learning allows eyeballing of data in one health care system or region to answer the question, “What happened to the last ten patients treated in this way or for this condition?” or, “How is ward A comparing with ward B on a particular outcome?” In the context of clinical care it breaks down the silos that doctors are working in and gives patients better tools with which to make decisions. It’s not high quality information but it’s some information that can guide decision making.

Pharmacoepidemiologists complain, said Berger, that ‘the data are sparse…’ and ‘the data are dirty…’ His reply to that is that ALL data are sparse and all data are dirty but they can be useful if we analyze them properly. And everyone is using these dirty data – Google, Facebook, health care insurers… We need to use them too.

Susan Gruber of the Reagan-Udall Foundation, an independent organization created by US Congress to advance the mission of the FDA by advancing regulatory science and research, discussed the importance of being able to integrate data from multiple sources or platforms and analyze these to answer important questions (Louis Fiore calls this ‘informatics chicanery’). She said it’s not about looking for a NEEDLE but for relevant pieces/collections of HAY in the haystack, and we need sophisticated computer programs to help with the identification of important linkages. Gruber painted a picture of ideal data: it’s comprehensive, patient-level, real-time-collected, geo-tagged, and securely-deidentified. It would need to be auto-analysed, she said, but we should also have the ability to customise solutions (and we should not create a false dichotomy there).

Phew! I’m glad that much better brains than mine are working on this stuff. There is clearly huge potential to do game-changing health care research with big data; to recognise patterns of harm early; to optimise and customise treatment for patients based on information from ‘learning systems’; and to save health care dollars. Bring it on.