Research¶

Valentine started as a research project at Delft Data, the data-management research group at TU Delft. It was first released alongside the ICDE 2021 paper Valentine: Evaluating Matching Techniques for Dataset Discovery, which introduced both the matching benchmark and the evaluation methodology that the package still implements today.

Overview¶

Valentine is an open-source framework designed to execute large-scale automated matching processes on tabular data. The system implements established schema-matching methodologies and provides tools for evaluation and real-world deployment in data lakes.

The original research project shipped two main capabilities beyond the matching algorithms themselves:

A dataset fabricator — a tool that generates evaluation dataset pairs respecting specific relational semantics (unionable, view-unionable, joinable, semantically-joinable), so that matchers can be compared on workloads with a known ground truth.
A GUI for evaluating schema matching methods — an interactive tool that lets researchers run matchers, inspect results, and compute metrics on the fabricated benchmarks.

Datasets¶

Valentine offers a wide spectrum of dataset pairs with ground truth containing valid matches among their corresponding columns. These dataset pairs have been fabricated by Valentine's dataset relatedness scenario generator. The ICDE 2021 paper classifies relatedness of two datasets into four categories:

Category	Description
Unionable	Tables that describe the same entity and can be stacked vertically.
View-unionable	Tables derived from the same source via different projections/selections — unionable after alignment.
Joinable	Tables that can be combined via a shared key.
Semantically-joinable	Tables whose keys are not literally equal but semantically refer to the same entities.

The datasets used in the paper are hosted on Zenodo with DOI: 10.5281/zenodo.5084605. The table below lists the dataset sources and dedicated links to the corresponding fabricated dataset pairs per relatedness scenario, along with the min/max number of rows and columns of the fabricated datasets.

Dataset Source	#Pairs	#Rows	#Columns	Links
TPC-DI	180	7 492–14 983	11–22	Unionable, View-Unionable, Joinable, Semantically-Joinable
Open Data	180	11 628–23 255	26–51	Unionable, View-Unionable, Joinable, Semantically-Joinable
ChEMBL	180	7 500–15 000	12–23	Unionable, View-Unionable, Joinable, Semantically-Joinable
WikiData	4	5 423–10 846	13–20	Unionable, View-Unionable, Joinable, Semantically-Joinable
Magellan Data	7	864–131 099	3–7	Unionable

Filename conventions¶

The filenames of the fabricated datasets encode the scenario parameters:

ac / ec — dataset pairs with noisy or verbatim schemata, respectively.
av / ev — dataset pairs with noisy or verbatim instances.
horizontal_p — datasets derived from a horizontal split with p% row overlap based on the original dataset.
vertical_p — datasets derived from a vertical split with p% column overlap based on the original dataset.
both_p1_p2 — datasets derived from both a horizontal split (p1% row overlap) and a vertical split (p2% column overlap).

Papers¶

Valentine: Evaluating Matching Techniques for Dataset Discovery¶

The original paper proposes Valentine as an extensible experimental suite for comparing schema matching techniques on dataset-discovery workloads. It formalizes the evaluation protocol (precision, recall, F1 at different cutoffs) and benchmarks COMA, Cupid, Similarity Flooding, Distribution-Based, and Jaccard-based matchers across a range of real-world fabrication scenarios.

Koutras, C., Siachamis, G., Ionescu, A., Psarakis, K., Brons, J., Fragkoulis, M., Lofi, C., Bonifati, A., Katsifodimos, A. Valentine: Evaluating Matching Techniques for Dataset Discovery. ICDE 2021.

Read the paper · ICDE 2021 presentation (Christos Koutras)

BibTeX

@inproceedings{koutras2021valentine,
  title={Valentine: Evaluating Matching Techniques for Dataset Discovery},
  author={Koutras, Christos and Siachamis, George and Ionescu, Andra and
          Psarakis, Kyriakos and Brons, Jerry and Fragkoulis, Marios and
          Lofi, Christoph and Bonifati, Angela and Katsifodimos, Asterios},
  booktitle={2021 IEEE 37th International Conference on Data Engineering (ICDE)},
  pages={468--479},
  year={2021},
  organization={IEEE}
}

Valentine in Action: Matching Tabular Data at Scale¶

A VLDB 2021 demo paper showing Valentine in action on larger, more diverse table collections and introducing the interactive tooling built around the library.

Koutras, C., Psarakis, K., Siachamis, G., Ionescu, A., Fragkoulis, M., Bonifati, A., Katsifodimos, A. Valentine in Action: Matching Tabular Data at Scale. VLDB 2021 (Demo).

Read the paper · VLDB 2021 demonstration (Kyriakos Psarakis)

BibTeX

@article{koutras2021demo,
  title={Valentine in Action: Matching Tabular Data at Scale},
  author={Koutras, Christos and Psarakis, Kyriakos and Siachamis, George and
          Ionescu, Andra and Fragkoulis, Marios and Bonifati, Angela and
          Katsifodimos, Asterios},
  journal={VLDB},
  volume={14},
  number={12},
  pages={2871--2874},
  year={2021},
  publisher={VLDB Endowment}
}

Algorithms & references¶

Valentine ships pure-Python implementations of several well-known schema- matching techniques. The table below links each matcher to the paper it is based on.

Matcher	Paper
`Coma`	Do, H.H., Rahm, E. COMA: A System for Flexible Combination of Schema Matching Approaches. VLDB 2002.
`Cupid`	Madhavan, J., Bernstein, P.A., Rahm, E. Generic Schema Matching with Cupid. VLDB 2001.
`DistributionBased`	Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D. Automatic Discovery of Attributes in Relational Databases. SIGMOD 2011.
`JaccardDistanceMatcher`	Baseline using Jaccard similarity over column value sets, with a configurable string distance for element equality.
`SimilarityFlooding`	Melnik, S., Garcia-Molina, H., Rahm, E. Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching. ICDE 2002.

Experimental suite¶

The original experimental suite from the ICDE paper — including the benchmark data generators, the GUI, and the dataset fabricator — is preserved on the v1.1 tag of the repository. Use it if you want to reproduce the paper's numbers exactly; use the current master for new work.

Matchers not in the current package¶

The research suite evaluated seven matching methods in total. Two embedding-based methods were part of the original benchmark but are not maintained in the current Python package. They remain available in the v1.1 snapshot for reproducibility.

Method	Paper
EmbDI	Cappuzzo, R., Papotti, P., Thirumuruganathan, S. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. SIGMOD 2020.
SemProp	Fernandez, R.C., Mansour, E., Qahtan, A.A., Elmagarmid, A., Ilyas, I., Madden, S., Ouzzani, M., Stonebraker, M., Tang, N. Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. ICDE 2018.

Citing Valentine¶

If Valentine is useful in your research, please cite the ICDE paper (and optionally the VLDB demo). The BibTeX entries above are ready to drop into your bibliography.