Harvard and Elsevier explore collaborations in data science
December 15, 2017 | 12 min read
By Alison Bert, DMA
For the Harvard Data Science Initiative, colleagues brainstormed about how to support evidence-based policy and advance precision medicine and healthcare
Caption: Ceilyn Boyd, Research Data Program Manager for the Harvard Library, speaks at the roundtable on "Key Factors for Scientific Impact on Policy" as Dr. Michelle Gregory, VP of Content and Innovation at Elsevier (left), and Anita de Waard, VP of Research Data Collaborations at Elsevier, look on. (Photo by Alison Bert)
CAMBRIDGE, Massachusetts — Can we give policymakers tools to gauge the relevance of scientific research?
Do we have the data do we need to predict medical outcomes more effectively?
How do socioeconomic factors, like where people live and work, affect their health – and can data help reveal those connections to improve care?
These were just a few of the big questions experts from Harvard and Elsevier tackled in 90-minute multidisciplinary roundtables at the Harvard Data Science Initiative (HDSI)(opens in new tab/window) November 6. They were there to see how they could join forces to address problems of societal importance through data curation and analytics.
“We had a really exciting discussion – it was a little bit like speed dating,” said Olaf Lodbrok(opens in new tab/window), SVP and General Manager of Precision Medicine at Elsevier, who was a moderator.
In exploring the application of data science to evidence-based policy, precision medicine and healthcare, participants talked about their research interests, compared the data they had access to, and suggested projects to collaborate on.
The conversation traversed highly technical realms like causal inference, behavioral models and natural language processing, while delving into issues that affect our daily lives.
The HDSI was launched at Harvard in March 2017 “to unite (data science) efforts across the university, foster collaboration in both research and teaching, and catalyze research that will benefit our society and economy.” The roundtables, representing the first round of meetings between the HDSI and Elsevier, are intended to lay the foundation for future collaborative projects.
Elsevier is making a substantial donation to HDSI over the next five years, while also providing datasets and ontologies when possible along with technological expertise.
Elsevier’s data(opens in new tab/window) – about 65 million records – come from a variety of souces, including 420,000 research articles published annually; the Scopus(opens in new tab/window) abstract and citation database of peer-reviewed literature; third-party data, such as records of legislative hearings and patents filings; and the tools we develop to enable researchers and clinicians to access the information they need quickly. In addition, as an information analytics company, Elsevier has experts who can potentially collaborate with Harvard faculty across a range of topics, including bibliometrics, data visualization, machine learning and NLP.
Meanwhile, LexisNexis(opens in new tab/window) (owned by our parent company, RELX Group(opens in new tab/window)) leverages about 65 billion public and proprietary records for its Risk Solutions business, including insurance claims data.
At the roundtables, Elsevier’s involvement in data and analytics came as a surprise to some faculty. “I had no idea that Elsevier was so big into this,” said Dr. Subu Subramanian(opens in new tab/window), Professor of Population Health and Geography and Director of a university-wide Initiative on Applied Quantitative Methods in Social Sciences.
Ceilyn Boyd(opens in new tab/window), Research Data Program Manager for the Harvard Library, said the experience opened her eyes to a new way of working with Elsevier. Now she’s looking at Elsevier as a collaborator in data science and research.
This is an opportunity for us to do something different together. Elsevier is sitting on an incredible store of data; some of it’s scholarship data that our faculty and staff have created, but then there’s other data as well. So it will be very interesting to see what we can do in this space together. I’m looking forward to seeing what shape these collaborations take – and being a part of it.
Roundtable 1: Key factors for scientific impact on policy
The goal of this roundtable was to find ways to support evidence-based policymaking. It was moderated by Dr. David Parkes, Professor of Computer Science and Co-Director of HDSI, and Ann Gabriel(opens in new tab/window), VP of Academic and Research Relations at Elsevier.
Policymakers face growing pressure to justify investments in scientific research and link them to measurable outcomes. At Elsevier, we see this as an opportunity to increase the influence of peer-reviewed science on policy. For example, our Analytical Services team produces reports(opens in new tab/window) on the global research landscape to advise policymakers on key topics, including sustainability science , gender diversity and cancer research trends(opens in new tab/window). We use data from scientific publications and citations as well as the institutions and governments we partner with. We also have tools to track scientific mentions in news and social media, and metrics to gauge the impact of research in non-traditional ways.
Still, when it comes to influencing policymakers, there are various challenges to overcome. Two Harvard professors spoke of their own experiences with government.
Dr. Cherry Murray(opens in new tab/window), Professor of Technology and Public Policy and Professor of Physics, oversaw $5.5 billion in research funding as Director of the Department of Energy’s Office of Science from 2015 to 2017. She said it can be difficult to convince policymakers to base their decisions on the evidence and “real impact” of science rather than going with “their gut feeling.” Policymakers tend to seek out science that support their views, she said, so studies are likely to have far more impact when they fit the bill.
Dr. Cory Zigler, Assistant Professor of Biostatistics, said his experience contributing to environmental health policymaking “has not really been a happy one.” With air pollution, for example, “the quality of science” is not being given priority, “so I’m very curious what could be done about that.”
Dr. Maria De Kleijn-Lloyd(opens in new tab/window), SVP of Analytical Services at Elsevier, said bibliometrics has been evolving to focus more on the real-world impact of the research. She added that visualizations may be “one of the real missing links between good science and actual impact.
Caption: Dr. Maria De Kleijn-LLoyd, SVP of Analytical Services at Elsevier, talks about the evoloving role of bibliometrics at the scientific policy roundtable. Looking on are (left to right); Prof. Cory Zigler; Dr. Bamini Jayabalasingham, Senior Analytical Product Manager at Elsevier; Dr. Rob Faris, Research Director for the Berkman Klein Center at Harvard; and Prof. Cherry Murray. (Photo by Alison Bert.
Participants agreed that science had to be more persuasive, and they suggested projects and tools to help impact policymaking.
At the closing reception, Dr. Parkes said the diversity of backgrounds made for a lively and productive session. While much of the conversation focused on the interaction between science, media and policy, participants talked about another option – one Dr. Parkes referred to as a “technological homerun”:
What if we went straight to the people (with) a smartphone app that somehow helps individuals to understand science? So forget about media and forget about policy … try to do something very disruptive. What if we could develop technology to distinguish between established science and controversial science? What if we could follow the money? Who’s funding science? Follow the social networks of scientists, get at the motivations (and) biases underlying science.
He concluded with a provocative question, quoting Dr. Zirui Song(opens in new tab/window), Assistant Professor of Health Care Policy at Harvard Medical School and an internal medicine physician at Massachusetts General Hospital: “What if truth were a special interest?”
“At the end of the day,” Dr. Parkes said, “if we want to help science have impact on policy, that’s a wonderful way to think about things.”
Roundtable 2: Interpretable models for precision medicine
Caption: Dr. Mauricio Santillana, an Assistant Professor at Harvard Medical School, talks about data for precision medicine as Elsevier colleagues listen (left to right): Theresa Hunt, VP of Global Marketing, Research Reference; Hajo Oltmanns, SVP and General Manager of Integrated Decision Support and Performance Management; Dr. Willian Chen; VP of Product Management for Precision Medicine, and and Olaf Lodbrok. (Photo by Alison Bert)
Precision medicine – which involves using a patient’s data to predict their disease progression and select options for treatment or prevention – needs to rely on the right kind of data and on a wide enough sample to be meaningful. Experts build models to customize healthcare to individuals depending on a variety of factors, like their age, gender, genes and overall medical history.
But that kind of data can be elusive. Some of it is in the EHR (electronic health record), which may not be accessible to researchers. Plus EHRs are not standardized in their format or the information they collect; they can vary widely among health facilities and countries.
So Lodbrok, moderator for the session, posed a question: “Who has enough data – large pooled EHR data, with lab values, longitudinal, patient consented, normalized, standardized to one schema?”
The consensus was that this kind of data would be extremely valuable, but no one has access today. Plus the system is fraught with complications such as a legacy of public and private systems, new initiatives, and regulated issues regarding security and privacy. Could access be obtained?
Three of Harvard’s biostatistics professors weighed in on their research and how there are tackling different aspects of precision medicine.
Dr. Sebastien Haneuse(opens in new tab/window) uses claims data from Medicare – a federal system of health insurance in the US for people 65 and over – to understand variations in the quality of care at a national level. He proposed combining that data with information from EHRs and public health research.
With complex longitudinal data, Dr. Miguel Hernan(opens in new tab/window) combines causal inference methodology and machine learning to evaluate the effectiveness of various interventions.
Dr. Tianxi Cai(opens in new tab/window) suggested text mining – extracting medical knowledge from the literature, genetic data, and clinical notes from EHRs. “Ultimately, I’m interested in … how can we use the knowledge to predict a patient’s disease progression or treatment response,” she said.
Whatever information is used, transparency is essential in building an interpretable model for precision medicine, said Dr. Mauricio Santillana(opens in new tab/window), an Assistant Professor at Harvard Medical School and a faculty member in the Computational Health Informatics Program(opens in new tab/window) at Boston Children’s Hospital. “We need to have an open and transparent model; then later we can introduce a black box approach,” he explained.
Ultimately, the group expressed the need for a large normalized health data set as a resource for doing research. They also saw potential for follow-up in various areas, including natural language processing of clinical notes and medical literature, using Elsevier’s resources and NLP experts.
Roundtable 3: Social and behavioral determinants of healthcare
To optimize healthcare, we must look beyond biology.
That was the overarching theme of this roundtable on social determinants of healthcare (SDH), which one Harvard researcher referred to as “the fifth vital sign.” Harvard’s experts in healthcare policy, epidemiology and data science brainstormed with RELX leaders on how we could incorporate socioeconomic data to predict a patients’ health outcomes and their likelihood to engage in their own care.
“We’d love to understand how SDH can be harnessed to help clinicians educate, engage and treat their patients,” said Cory Polonetsky(opens in new tab/window), Senior Commercial Director for Patient Engagement at Elsevier. “We don’t want to assume that the same approaches will work with patients who are similar clinically if they are different in terms of SDH.”
The accessibility of alternative data sources is driving healthcare organizations to explore how they can better serve and engage the populations they manage. In the United States, 25 percent of healthcare spending goes to the treatment of diseases or disabilities resulting from potentially changeable behaviors, according to a 2013 report by the US Department of Health and Human Services(opens in new tab/window). The National Quality Forum, Centers for Disease Control and World Health Organization have all acknowledged the impact and importance of addressing social and behavioral determinants of health.
Some organizations are using claims data to determine a person’s willingness to engage in their own care and adhere to care plans. In many cases, however, claims data is not available for individuals. And sometimes key information is not in the claims data. Kathy Mosbaugh(opens in new tab/window), VP of Analytics Solutions for LexisNexis, noted that it’s tough to get that data because it’s deep in the EHR.
Mosbaugh said LexisNexis has data from public records that can help predict health outcomes beyond what’s possible with clinical factors alone. She suggested that area level variables, such as access to healthy foods or the availability of smoking cessation programs, are critically important factors for improving health.
Dr. Sara Bleich(opens in new tab/window), Professor of Public Health Policy, said it’s crucial to connect social factors with health outcomes. In her own research, she focuses on populations at higher risk for obesity and diabetes, providing evidence to support policy alternatives for prevention and and control. Why do certain people develop obesity? In some cases, she said, their housing situation could be an underlying factor.
Dr. Bleich added that she and her colleagues struggle to get large datasets that would help them correlate social factors with health outcomes because they don’t have access to the health outcomes data.
The need for data drove much of the conversation.
As a moderator, Dr. Francesca Dominici(opens in new tab/window), Professor of Biostatistics at the Harvard TH Chan School of Public Health and Co-Director of the Data Science Initiative, summarized the conversation while addressing Elsevier and LexisNexis colleagues:
You want to solve important problems, including public health; you want to maximize impact. And that’s what we want to do, too. … But what I’m hearing is that you guys have an enormous amount of data at an individual level. … So how can we help? How can we work with you? Is there anything that we can bring?
Her questions were answered by Dr. Brad Fenwick, Senior VP for Global Strategic Alliances at Elsevier: “When people start understanding the mountain of data and the activity of accumulating more data and organizing it in the various parts of RELX – it is a tremendous amount, almost overwhelming in terms of what to do with it,” he said.
But he pointed out that the Harvard faculty bring something equally important to the table:
What we (would) benefit from, in many cases, is working with all of you to define the best questions to answer. We have our own questions we’re trying to answer … but they’re within the context of the company and what we’re doing. You can ask more existential questions that we still have the ability to answer.
In reflecting on the event, Polonetsky said: “Everyone seemed excited about the prospect of researching the applications of SDH in a way that no one party could do on its own.”
At the networking reception in Harvard’s Cabot Science Library, Elsevier colleagues displayed posters about their teams’ work in research data management, big data analysis, researcher mobility, fostering research collaboration, text-mining and biomedical visualization, and research integrity. Here’s a sample: