View on GitHub

Reproducibility in Data Integration Through the Looking Glass

Webpage containing information on Delft-Data's paper about reproducibility in data integration

Description

Database research is facing a reproducibility crisis, despite the encouragement by multiple initiatives. The lack of open artifacts, most importantly data and source code, obstructs the validation of existing research hindering our community’s progress. We aim to evaluate the status quo of reproducibility in data integration, focusing on four data integration tasks: schema matching, schema mapping, entity matching and entity resolution.

We investigated artifacts from 27 papers and measured their reproducibility. Apart from a thorough study of the current situation, we introduce a reproducibility checklist for data integration, and a newly proposed metric for assessing a research paper’s degree of reproducibility during the review process. This survey shows that less than 10% of the papers can be fully reproduced and that there are substantial differences in the degree of reproducibility among different data integration tasks. Beside the checklist and the metric, we also propose actionable steps to move forward.

Authors

Andra Ionescu
Andra Ionescu
TU Delft
Marios Fragkoulis
Marios Fragkoulis
TU Delft
Christoph Lofi
Christoph Lofi
TU Delft
Asterios Katsifodimos
Asterios Katsifodimos
TU Delft

Reproducibility checklist

The checklist is a projection of the minimum necessary to reproduce a paper in data management. We use three categories (method, data, experiment) that were also used in other adjacent domains such as Artificial Intelligence and Machine Learning.

Method variables

The variables from this category help a researcher understand the problem solved, the goal of the proposed method, and the methodology to achieve it.

Data variables

The role of the Data type variables is to describe the datasets used in a paper.

Experiment variables

The variables from the Experiment type describe how experiments are conducted, including information about the environment, and the source code.

Metric

The metric proposed is purely an indication of the amount of details available and missing. Therefore, for each category, compute the percentage of the variables with True and False values.

How to move forward

  1. Create workshops and challenges about reproducibility.
  2. Collaborate with open source software.
  3. Invest effort in hosting the data and artifacts in long-term storage as we do for the papers.
  4. Help change the culture! Use the checklist to publicly show the reproducibility level of your paper, encourage PhD students to work on making their artifacts openly available and reproducible.