Skip to content

Getting started

Installation

Valentine is published on PyPI and installs with a single pip command. It requires Python >=3.10, <3.15.

pip install valentine
uv add valentine
poetry add valentine

Polars support

To use Polars DataFrames, install the optional polars extra:

pip install valentine[polars]
uv add valentine[polars]
poetry add valentine -E polars

Sentence-transformer embeddings

To use the embedding-based variant of JaccardDistanceMatcher, where value "equality" between two columns is decided by cosine similarity of sentence-transformer embeddings instead of a string distance like Levenshtein, install the optional embeddings extra:

pip install valentine[embeddings]
uv add valentine[embeddings]
poetry add valentine -E embeddings

This pulls in sentence-transformers, which itself depends on torch.

For local development, clone the repo and install in editable mode:

git clone https://github.com/delftdata/valentine
cd valentine
pip install -e ".[dev]"

Your first match

The single entry point for matching is valentine_match. It takes an iterable of DataFrames (pandas or Polars) and a matcher instance, and returns a MatcherResults mapping — see the Matcher results guide for everything you can do with it.

import pandas as pd
from valentine import valentine_match
from valentine.algorithms import Coma

df1 = pd.read_csv("source_candidates.csv")
df2 = pd.read_csv("target_candidates.csv")

matcher = Coma(use_instances=True)
matches = valentine_match([df1, df2], matcher)

for pair, score in matches.items():
    print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")
import polars as pl
from valentine import valentine_match
from valentine.algorithms import Coma

df1 = pl.read_csv("source_candidates.csv")
df2 = pl.read_csv("target_candidates.csv")

matcher = Coma(use_instances=True)
matches = valentine_match([df1, df2], matcher)

for pair, score in matches.items():
    print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")
import pandas as pd
import polars as pl
from valentine import valentine_match
from valentine.algorithms import Coma

df_pandas = pd.read_csv("source_candidates.csv")
df_polars = pl.read_csv("target_candidates.csv")

matcher = Coma(use_instances=True)
matches = valentine_match([df_pandas, df_polars], matcher)

for pair, score in matches.items():
    print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")

Table names

Each ColumnPair key in the results carries both a source_table and a target_table. By default these default to "aaa", "bbb", "ccc", … — low-similarity names that won't bias schema-based matchers. Pass df_names=["sales", "orders", ...] to set your own.

Matching many DataFrames

Pass any iterable of DataFrames (pandas, Polars, or mixed) — list, tuple, generator — and Valentine computes all unique pairs:

matches = valentine_match(
    [sales_df, orders_df, products_df],
    Coma(),
    df_names=["sales", "orders", "products"],
)

Each matcher decides for itself how to handle the batch. Algorithms that benefit from a holistic view of all tables (Coma's TF-IDF corpus, SimilarityFlooding's IDF weights, DistributionBased's global ranks) override get_matches_batch so their statistics reflect the entire input rather than just the current pair.

Picking a matcher

Valentine ships with five matching algorithms covering both schema- and instance-based matching:

Matcher Type Good at
Coma Schema + Instance General-purpose, interpretable, well-tuned
Cupid Schema only Tree/linguistic similarity
DistributionBased Instance only Numeric & categorical value distributions
JaccardDistanceMatcher Instance only Tversky / Jaccard / containment on value sets; supports exact, fuzzy (rapidfuzz), and embedding-based value matching
SimilarityFlooding Schema only Graph-based fixpoint propagation

See Matchers for the conceptual guide, or jump straight to the API reference for parameter defaults.

Evaluating a match

If you have a ground truth — a list of expected column pairs — Valentine computes Precision, Recall, F1 and other metrics in one call:

ground_truth = [
    ("emp_id", "employee_number"),
    ("fname",  "first_name"),
    ("lname",  "last_name"),
    ("dept",   "department"),
]

metrics = matches.get_metrics(ground_truth)
print(metrics)

Full details are in Evaluation metrics, with the method signature documented under MatcherResults.get_metrics.