Description
Database research is facing a reproducibility crisis, despite the encouragement by multiple initiatives. The lack of open artifacts, most importantly data and source code, obstructs the validation of existing research hindering our community’s progress. We aim to evaluate the status quo of reproducibility in data integration, focusing on four data integration tasks: schema matching, schema mapping, entity matching and entity resolution.
We investigated artifacts from 27 papers and measured their reproducibility. Apart from a thorough study of the current situation, we introduce a reproducibility checklist for data integration, and a newly proposed metric for assessing a research paper’s degree of reproducibility during the review process. This survey shows that less than 10% of the papers can be fully reproduced and that there are substantial differences in the degree of reproducibility among different data integration tasks. Beside the checklist and the metric, we also propose actionable steps to move forward.
Authors

TU Delft

TU Delft

TU Delft

TU Delft
Reproducibility checklist
The checklist is a projection of the minimum necessary to reproduce a paper in data management. We use three categories (method, data, experiment) that were also used in other adjacent domains such as Artificial Intelligence and Machine Learning.
Method variables
The variables from this category help a researcher understand the problem solved, the goal of the proposed method, and the methodology to achieve it.
- Problem = The problem the research aims to solve is specified
- Objective = The goal of the research is indicated
- Contribution/Research Questions = The goal of the research is indicated
- Research Methodology = The research methodology is described (the means on how to achieve the objective)
- Pseudo-code = The pseudo-code exists in the paper (does not apply if the sourcecode is provided)
- Parameters/Threshold description = The parameters used by the algorithms are described
Data variables
The role of the Data type variables is to describe the datasets used in a paper.
- Experiment Data = At least one open-source available dataset is present or means on how to generate the data are provided
- Ground Truth/Gold Standard = The ground truth is given or means on how to create it are indicated
- Data Version(*) = The dataset version is provided
- Parameters to create/export data(*) = The parameters to create/export the data are specified
Experiment variables
The variables from the Experiment type describe how experiments are conducted, including information about the environment, and the source code.
- Source Code = The code is open-sourced and available
- External Source Code(*) = The source code is not mentioned in the paper, but external re-sources are available upon search
- Experiment Source Code = The URL to the open-source experiment code is given or the experimental pipeline exists in the source code)
- Experiment Setup = Details about how the experiments were conducted such as testing specific parts of the solution, using a certain configuration, the metrics used are indicated
- Parameters/Thresholds Value(*) = The values for the parameters are provided
- Hardware Specification = The description of the machine(s) used for experiments is present
- Runtime = The indication of how long does the algorithm run on the specified hardware
Metric
The metric proposed is purely an indication of the amount of details available and missing. Therefore, for each category, compute the percentage of the variables with True and False values.
How to move forward
- Create workshops and challenges about reproducibility.
- Collaborate with open source software.
- Invest effort in hosting the data and artifacts in long-term storage as we do for the papers.
- Help change the culture! Use the checklist to publicly show the reproducibility level of your paper, encourage PhD students to work on making their artifacts openly available and reproducible.