Mercè Crosas is University Research Data Officer at Harvard University and Chief Data Science and Technology Officer at the Institute for Quantitative Social Science. She is involved in several projects on data sharing, data analysis and services, data curation, data science support and issues related to data privacy. Mercè is originally an astrophysicist but now works across many scientific domains, providing her with insights on the commonalities and the differences among these disciplines regarding data issues. We spoke with her to discuss the challenges that she came across in her work and what these mean for the future of big data and data for AI.
Defining big data – creating confusion
She is not too fond of defining big data. Even though ‘big data’ is a popular term and easy to use, it is also a confusing term. From her own experience, twenty years ago, when she was still working as an astrophysicist, they already considered the data they had at hand to be big data. The term became popular due to the increased complexity of datasets consisting of the data that we constantly generate. In such datasets, which are often unstructured, it is important to determine what data is useful and what data is noise. The interesting part lies in inferring knowledge from the dataset. This is what the field of data science is all about and she, therefore, prefers referring to data science instead of big data.
Technologies such as AI provide great future perspectives because they can help us to learn from these large datasets. Computational power is increasing and the fact that the community is contributing, for instance by means of open source libraries, is a good thing. It would not be reasonable to leave the opportunities that these technologies provide aside. At the same time though, it is important to keep the risks that are attached to them in mind, because large amounts of data do not always provide better results. If an experiment is designed well, the quality of the data is good and if you apply a good statistical analysis, then this can provide much better answers compared to the scenario where you would just have big, but low-quality datasets. So, the mere fact that we have large amounts of data available does not also imply that we are learning something from it, there is still room for large compound errors and wrong assumptions to be made during the process and this should be kept in mind.
The importance of doing proper data management
It is important to make the distinction between open source software and open data because these are two different things. Mercè suggests that both open data and open-source software are desirable for transparency, validation, and wide dissemination of science. Where possible, Mercè promotes to make data openly accessible, but keep them restricted if there are privacy issues. In research, data sharing enables validation and reproducibility of scientific findings and maximizes the return on research investments. In organizations, data sharing leads to insights and opportunities to improve goods and services. However, in some cases, there are boundaries to data accessibility and these benefits, such as restrictions caused by privacy legislation or IP rights. Data access should not be regarded as an all-or-nothing matter, because access can also be provided in tiers. Currently, not enough data is open, but there has been a growth in several scientific communities that promote open data. In sciences where more sensitive data is involved, for instance in biomedical sciences, opening up datasets has been more cumbersome due to privacy issues. Sometimes the lack of open data is due to the lack of incentives for sharing data. That’s why it is important to give data authors credit for their data through data citation and formal recognition, Mercè says.
Regardless of whether data is open or not, it is important to keep the algorithms which are being used transparently to ensure that we can document and verify what has been done with the data. So far, regarding AI, interoperability of data has been the hardest part and it still is, simply because connecting and merging datasets takes a lot of curation. The main goal is to achieve data harmonization, which will allow bringing together data that originates from different datasets with varying file formats and metadata schemas. Currently, there are no good tools that enable connection of data sets, so developing these would be very helpful to achieve this goal. Defining and implementing data standards could be part of the solution as well, although achieving such standards will be challenging because of technology evolving rapidly, often resulting in new types of data that bring along different (or new) constraints.
Mercè has been involved in the Dataverse project, an open-source web application for sharing, preserving, finding, and citing research data. It allows the user to make data available to others and reproduce or reuse the work of others in an easier manner. Thus, the Dataverse project is aligned with the FAIR data guiding principles for data stewardship. The FAIR Data principles, of which Mercè is one of the co-authors, entail that the data must be Findable, Accessible, Interoperable and Reusable. In addition to supporting the sharing and reuse of data by humans, these principles put an emphasis on enhancing the ability of machines to automatically find and use the data.
Next steps and practical solutions
Summarizing, the accessibility of data is good for our society because we can learn from it. We need the right policies and regulations in place that push for progress. Take for instance tackling climate change, a field where the availability of data is going to be very important. In some cases, the necessary data for progress is not available and there still has to be put the effort into obtaining it. Because of this, we cannot merely rely on standards for data sharing and access because not all issues will be solved by them. We need regulation in place that helps to enforce good data practices. Regulation created for this purpose should be evidence-based because we need certainty that they will be effective.
Mercè has also been involved in the DataTags Project, which created a system for the purpose of sharing sensitive data in a safe manner. A datatag is a set of security features and access requirements for file handling. This tag is attached to the metadata of a dataset that contains sensitive data. The purpose of the tag is to reduce the complexity of data sharing and to make it easier to meet compliance, security and contractual obligations that come along with the dataset. These datatags are interoperable across datasets and -systems. The datatags are exhaustive in the sense that only one datatag can be applicable to a dataset. The applicability of a datatag is dependent on how the dataset is defined. There are different datatags available, varying from blue to crimson (see the picture below). The tagged files can be stored in a datatags repository, which is a repository that stores and shares data files in accordance with a standardized and ordered set of security and access requirements.
Currently, Mercè’s team is in the process of implementing datatags for datasets in the Harvard Dataverse repository. This has been a big task due to legal compliance issues, security requirements and the conditions set by various data agreements. These datasets often contain sensitive information about individuals and therefore safeguards need to be put in place to protect these individuals. Policies on data sharing play a critical role in balancing the benefits and risks. The average citizen wants privacy and safety of his data but has little time for data governance. As the amount of data driven products is only expected to increase, so is the demand of citizens for privacy management. It is important to map the data beforehand because the manner in which relevant regulation is to be attached to the data is dependent on the data itself. When regulation changes, the datatags will have to be adopted as well, for instance by providing an updated version of the tag. For these purposes, they teamed up with lawyers helping them with the verification of the datatags. More recently, Mercè has been involved with the OpenDP project as one of the co-PIs, an open-source platform for differential privacy libraries. This work would allow to mine and analyze sensitive datasets while preserving their privacy and never been accessed directly by the researchers. Dataverse, DataTags, and OpenDP will together provide a privacy-preserving platform for sharing and analyzing sensitive data.
Access to data in a manner that you can actually use it is not the only issue. Once the data is accessible in a manner that allows computing, we should also ensure that we are able to provide the conditions that allow for the processing of such large datasets. In some cases, a dataset cannot be downloaded due to its size or privacy concerns and it makes more sense to keep the data in the cloud and share the compute.
To conclude, we need good policies in place for data. At the same time, it is very important that we develop technological tools and methods to help data users reap the maximum benefit of big data, whether they be researchers or citizens. Moreover, the development and implementation of AI, policies, and tools for ensuring data protection will have to be developed hand in hand. Mercè’s work is a perfect example of such a contribution.
Recommendations for further readings
– Sweeney L, Crosas M., Bar-Sinai M., Sharing Sensitive Data with Confidence: The Data Tags System, Technology Science 2015 101601 October 16, 2015.
– Wilkinson et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship, 15 March 2016, Scientific Data 3:160018
– Eleni Castro, Mèrce Crosas, Alex Garnett, Kasey Sheridan, and Micah Altman, Evaluating and promoting open data practices in open access journals, Journal of Scholarly Publishing October 2017