Skip to content

Getting started

Installation

Valentine is published on PyPI and installs with a single pip command. It requires Python 3.10 or newer (and is tested up to 3.14).

pip install valentine
uv add valentine
poetry add valentine

For local development, clone the repo and install in editable mode:

git clone https://github.com/delftdata/valentine
cd valentine
pip install -e ".[dev]"

Your first match

The single entry point for matching is valentine_match. It takes an iterable of DataFrames and a matcher instance, and returns a MatcherResults mapping — see the Matcher results guide for everything you can do with it.

import pandas as pd
from valentine import valentine_match
from valentine.algorithms import Coma

df1 = pd.read_csv("source_candidates.csv")
df2 = pd.read_csv("target_candidates.csv")

matcher = Coma(use_instances=True)
matches = valentine_match([df1, df2], matcher)

for pair, score in matches.items():
    print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")

Table names

Each ColumnPair key in the results carries both a source_table and a target_table. By default these default to "aaa", "bbb", "ccc", … — low-similarity names that won't bias schema-based matchers. Pass df_names=["sales", "orders", ...] to set your own.

Matching many DataFrames

Pass any iterable of DataFrames — list, tuple, generator — and Valentine computes all unique pairs:

matches = valentine_match(
    [sales_df, orders_df, products_df],
    Coma(),
    df_names=["sales", "orders", "products"],
)

Each matcher decides for itself how to handle the batch. Algorithms that benefit from a holistic view of all tables (Coma's TF-IDF corpus, SimilarityFlooding's IDF weights, DistributionBased's global ranks) override get_matches_batch so their statistics reflect the entire input rather than just the current pair.

Picking a matcher

Valentine ships with five matching algorithms covering both schema- and instance-based matching:

Matcher Type Good at
Coma Schema + Instance General-purpose, interpretable, well-tuned
Cupid Schema only Tree/linguistic similarity
DistributionBased Instance only Numeric & categorical value distributions
JaccardDistanceMatcher Instance only Exact/fuzzy Jaccard on value sets
SimilarityFlooding Schema only Graph-based fixpoint propagation

See Matchers for the conceptual guide, or jump straight to the API reference for parameter defaults.

Evaluating a match

If you have a ground truth — a list of expected column pairs — Valentine computes Precision, Recall, F1 and other metrics in one call:

ground_truth = [
    ("emp_id", "employee_number"),
    ("fname",  "first_name"),
    ("lname",  "last_name"),
    ("dept",   "department"),
]

metrics = matches.get_metrics(ground_truth)
print(metrics)

Full details are in Evaluation metrics, with the method signature documented under MatcherResults.get_metrics.