Description
Valentine is an extensible open-source product to execute and organize large-scale automated matching processes on tabular data either for experimentation or deployment in real world data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories.
To enable proper evaluation, Valentine offers a fabricator for creating evaluation dataset pairs that respect specific semantics.
Finally, Valentine also comes with a GUI that makes it easier than ever to: i) evaluate schema matching methods on dataset pairs that respect specific relatedness semantics (joinable/unionable), and ii) scale SotA methods to holistic matching in big data repositories or data lakes in order to find relationships among disparate tabular data.
Authors
Valentine Methods
The schema matching methods included in Valentine are the following:
- COMA: Python wrapper around COMA 3.0 Community Edition.
- Cupid: Contains the Python implementation of the paper “Generic Schema Matching with Cupid” (VLDB 2001).
- Distribution-based: Contains the python implementation of the paper “Automatic Discovery of Attributes in Relational Databases” (SIGMOD 2011).
- EmbDI: Contains the code of EmbDI provided by the authors in their GitLab repository and the paper “Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks” (SIGMOD 2020).
- Jaccard Levenshtein: Contains our own baseline that uses Jaccard Similarity between columns to assess their correspondence score, enhanced by Levenshtein Distance.
- SemProp: Contains the code of the method discussed in “Seeping semantics: Linking datasets using word embeddings for data discovery” (ICDE 2018), which is provided in the code repository of the paper “Aurum: A Data Discovery System” (ICDE 2018).
- Similarity Flooding: Contains the python implementation of the paper “Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching” (ICDE 2002).
Datasets
Valentine offers a wide spectrum of dataset pairs with ground truth containing valid matches among theri corresponding columns. These dataset pairs have been fabricated by Valentines dataset relatedness scenario generator. In our paper, we classify relatedness of two datasets into the following four categories: i) Unionable datasets, ii) View-Unionable datasets, iii) Joinable datasets, and iv) Semantically-Joinable datasets.
The datasets used in the paper are hosted on Zenodo with DOI: 10.5281/zenodo.5084605. In the table below, we specify the dataset sources and dedicated links to the corresponding fabricated dataset pairs, with respect to each relatedness scenario. We also specify min and max number of rows and columns of the fabricated datasets.
Dataset Source | #Pairs | #Rows | #Columns | Links |
---|---|---|---|---|
TPC-DI | 180 | 7492 - 14983 | 11 - 22 | Unionable, View-Unionable, Joinable, Semantically-Joinable |
Open Data | 180 | 11628 - 23255 | 26 - 51 | Unionable, View-Unionable, Joinable, Semantically-Joinable |
ChEMBL | 180 | 7500 - 15000 | 12 - 23 | Unionable, View-Unionable, Joinable, Semantically-Joinable |
WikiData | 4 | 5423 - 10846 | 13 - 20 | Unionable, View-Unionable, Joinable, Semantically-Joinable |
Magellan Data | 7 | 864 - 131099 | 3 - 7 | Unionable |
Filename Conventions
The filename conventions we use for the above datasets are explained as follows:
- ac and ec mean that the dataset pairs have noisy or verbatim schemata respectively.
- av and ev mean that the dataset pairs have noisy or verbatim instances.
- horizontal_p means that the datasets are derived from a horizontal split of p% row overlap based on the original dataset.
- vertical_p means that the datasets are derived from a vertical split of p% column overlap based on the original dataset.
- both_p1_p2 means that the datasets are derived from both a horizontal slit of p1% row overlap and a vertical split of p2% column overlap based on the original dataset.
Repositories
- https://github.com/delftdata/valentine-system : Contains Valentine system + GUI to easily deploy it for evaluation or holistic matching in data lakes.
- https://github.com/delftdata/valentine : Main repository containing the Valentine framework source code.
- https://github.com/delftdata/valentine-generator : Contains the source code of the dataset generator of Valentine.
- https://github.com/delftdata/valentine-paper-results : Contains detailed experimental results and plots based on the paper’s evaluation.
Valentine Papers
- [ICDE 2021 Proceedings] Valentine: Evaluating Matching Techniques for Dataset Discovery
- [VLDB 2021 Demo] Valentine in Action: Matching Tabular Data at Scale
ICDE 2021 Presentation Video
- ICDE 2021 Presentation by Christos
- VLDB 2021 Demonstration by Kyriakos
Cite Valentine
@inproceedings{koutras2021valentine,
title={Valentine: Evaluating Matching Techniques for Dataset Discovery},
author={Koutras, Christos and Siachamis, George and Ionescu, Andra and Psarakis, Kyriakos and Brons, Jerry and Fragkoulis, Marios and Lofi, Christoph and Bonifati, Angela and Katsifodimos, Asterios},
booktitle={2021 IEEE 37th International Conference on Data Engineering (ICDE)},
pages={468--479},
year={2021},
organization={IEEE}
}