View on GitHub

Valentine: Evaluating Matching Techniques for Dataset Discovery

Webpage containing information on Delft-Data's Valentine paper

Description

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method’s success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics.

We aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.

Authors

Christos Koutras
TU Delft
Georgios Siachamis
TU Delft
Andra Ionescu
TU Delft
Kyriakos Psarakis
TU Delft
Jerry Brons
ING
Marios Fragkoulis
TU Delft
Christoph Lofi
TU Delft
Angela Bonifati
Lyon 1 University
Asterios Katsifodimos
TU Delft

Valentine Methods

The schema matching methods included in Valentine are the following:

  1. COMA: Python wrapper around COMA 3.0 Community Edition.
  2. Cupid: Contains the Python implementation of the paper “Generic Schema Matching with Cupid” (VLDB 2001).
  3. Distribution-based: Contains the python implementation of the paper “Automatic Discovery of Attributes in Relational Databases” (SIGMOD 2011).
  4. EmbDI: Contains the code of EmbDI provided by the authors in their GitLab repository and the paper “Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks” (SIGMOD 2020).
  5. Jaccard Levenshtein: Contains our own baseline that uses Jaccard Similarity between columns to assess their correspondence score, enhanced by Levenshtein Distance.
  6. SemProp: Contains the code of the method discussed in “Seeping semantics: Linking datasets using word embeddings for data discovery” (ICDE 2018), which is provided in the code repository of the paper “Aurum: A Data Discovery System” (ICDE 2018).
  7. Similarity Flooding: Contains the python implementation of the paper “Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching” (ICDE 2002).

Datasets

Valentine offers a wide spectrum of dataset pairs with ground truth containing valid matches among theri corresponding columns. These dataset pairs have been fabricated by Valentines dataset relatedness scenario generator. In our paper, we classify relatedness of two datasets into the following four categories: i) Unionable datasets, ii) View-Unionable datasets, iii) Joinable datasets, and iv) Semantically-Joinable datasets.

In the table below, we specify the dataset sources and links to the corresponding fabricated dataset pairs, with respect to each relatedness scenario. We also specify min and max number of rows and columns of the fabricated datasets.

Dataset Source #Pairs #Rows #Columns Links
TPC-DI 180 7492 - 14983 11 - 22 Unionable, View-Unionable, Joinable, Semantically-Joinable
Open Data 180 11628 - 23255 26 - 51 Unionable, View-Unionable, Joinable, Semantically-Joinable
ChEMBL 180 7500 - 15000 12 - 23 Unionable, View-Unionable, Joinable, Semantically-Joinable
WikiData 4 5423 - 10846 13 - 20 Unionable, View-Unionable, Joinable, Semantically-Joinable
Magellan Data 7 864 - 131099 3 - 7 Unionable

Filename Conventions

The filename conventions we use for the above datasets are explained as follows:

Repositories

Valentine Paper

Cite Valentine

@misc{koutras2020valentine,
      title={Valentine: Evaluating Matching Techniques for Dataset Discovery}, 
      author={Christos Koutras and George Siachamis and Andra Ionescu and Kyriakos Psarakis and Jerry Brons and Marios Fragkoulis and Christoph Lofi and Angela Bonifati and Asterios Katsifodimos},
      year={2020},
      eprint={2010.07386},
      archivePrefix={arXiv},
      primaryClass={cs.DB}
}