View on GitHub

Valentine: Evaluating Matching Techniques for Dataset Discovery

Webpage containing information on Delft-Data's Valentine paper

Description

Valentine is an extensible open-source product to execute and organize large-scale automated matching processes on tabular data either for experimentation or deployment in real world data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. To enable proper evaluation, Valentine offers a fabricator for creating evaluation dataset pairs that respect specific semantics. It also comes with a GUI that makes it easier than ever to: i) evaluate schema matching methods on dataset pairs that respect specific relatedness semantics (joinable/unionable), and ii) scale SotA methods to holistic matching in big data repositories or data lakes in order to find relationships among disparate tabular data.

Authors

Christos Koutras
TU Delft
Georgios Siachamis
TU Delft
Andra Ionescu
TU Delft
Kyriakos Psarakis
TU Delft
Jerry Brons
ING
Marios Fragkoulis
TU Delft
Christoph Lofi
TU Delft
Angela Bonifati
Lyon 1 University
Asterios Katsifodimos
TU Delft

Valentine Methods

The schema matching methods included in Valentine are the following:

  1. COMA: Python wrapper around COMA 3.0 Community Edition.
  2. Cupid: Contains the Python implementation of the paper “Generic Schema Matching with Cupid” (VLDB 2001).
  3. Distribution-based: Contains the python implementation of the paper “Automatic Discovery of Attributes in Relational Databases” (SIGMOD 2011).
  4. EmbDI: Contains the code of EmbDI provided by the authors in their GitLab repository and the paper “Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks” (SIGMOD 2020).
  5. Jaccard Levenshtein: Contains our own baseline that uses Jaccard Similarity between columns to assess their correspondence score, enhanced by Levenshtein Distance.
  6. SemProp: Contains the code of the method discussed in “Seeping semantics: Linking datasets using word embeddings for data discovery” (ICDE 2018), which is provided in the code repository of the paper “Aurum: A Data Discovery System” (ICDE 2018).
  7. Similarity Flooding: Contains the python implementation of the paper “Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching” (ICDE 2002).

Datasets

Valentine offers a wide spectrum of dataset pairs with ground truth containing valid matches among theri corresponding columns. These dataset pairs have been fabricated by Valentines dataset relatedness scenario generator. In our paper, we classify relatedness of two datasets into the following four categories: i) Unionable datasets, ii) View-Unionable datasets, iii) Joinable datasets, and iv) Semantically-Joinable datasets.

The datasets used in the paper are hosted on Zenodo with DOI: 10.5281/zenodo.5084605. In the table below, we specify the dataset sources and dedicated links to the corresponding fabricated dataset pairs, with respect to each relatedness scenario. We also specify min and max number of rows and columns of the fabricated datasets.

Dataset Source #Pairs #Rows #Columns Links
TPC-DI 180 7492 - 14983 11 - 22 Unionable, View-Unionable, Joinable, Semantically-Joinable
Open Data 180 11628 - 23255 26 - 51 Unionable, View-Unionable, Joinable, Semantically-Joinable
ChEMBL 180 7500 - 15000 12 - 23 Unionable, View-Unionable, Joinable, Semantically-Joinable
WikiData 4 5423 - 10846 13 - 20 Unionable, View-Unionable, Joinable, Semantically-Joinable
Magellan Data 7 864 - 131099 3 - 7 Unionable

Filename Conventions

The filename conventions we use for the above datasets are explained as follows:

Repositories

Valentine Papers

ICDE 2021 Presentation Video

ICDE 2021 Presentation

VLDB 2021 Demonstration

Cite Valentine

@inproceedings{koutras2021valentine,
  title={Valentine: Evaluating Matching Techniques for Dataset Discovery},
  author={Koutras, Christos and Siachamis, George and Ionescu, Andra and Psarakis, Kyriakos and Brons, Jerry and Fragkoulis, Marios and Lofi, Christoph and Bonifati, Angela and Katsifodimos, Asterios},
  booktitle={2021 IEEE 37th International Conference on Data Engineering (ICDE)},
  pages={468--479},
  year={2021},
  organization={IEEE}
}