Valentine is a Python package for capturing potential relationships among columns of different tabular datasets, given as pandas DataFrames. It implements several schema- and instance-based matching algorithms behind a single, uniform API, and ships with evaluation metrics so you can measure match quality against a ground truth.
Installation¶
Requires Python >=3.10, <3.15.
A 30-second taste¶
import pandas as pd
from valentine import valentine_match
from valentine.algorithms import Coma
df1 = pd.read_csv("source_candidates.csv")
df2 = pd.read_csv("target_candidates.csv")
matches = valentine_match([df1, df2], Coma(use_instances=True))
for pair, score in matches.items():
print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")
Ready for more? Head over to Getting started, or jump straight to the API reference.
Research¶
Valentine started as a research project at Delft Data and is based on the ICDE 2021 paper. See the Research page for the papers behind the package, the algorithms it implements, and citation info.