Matchers¶

This page is the conceptual guide to Valentine's five matching algorithms: what each one does, when to reach for it, and the trade-offs involved. For constructor signatures, parameter defaults, and validation rules, head straight to the API reference.

Every matcher in Valentine subclasses BaseMatcher and is compatible with the top-level valentine_match API. All five live in valentine.algorithms:

from valentine.algorithms import (
    Coma,
    Cupid,
    DistributionBased,
    JaccardDistanceMatcher,
    SimilarityFlooding,
)

Matcher	Signals	Best for
`Coma`	Schema + instances (optional)	General-purpose first choice. Strong defaults, informative sub-scores.
`Cupid`	Schema only	Nested schemas where column names and structure matter more than data.
`DistributionBased`	Instances only	Matching by value distributions when names are unreliable.
`JaccardDistanceMatcher`	Instances only	Simple, explainable baseline. Useful for sanity checks.
`SimilarityFlooding`	Schema only	Structure-heavy schemas where graph neighbourhoods carry signal.

Which matcher should I pick?¶

flowchart TD
    A([Start]) --> B{Do you have<br/>column values?}
    B -- "No, schema only" --> C{Schema is<br/>nested or<br/>graph-shaped?}
    B -- "Yes" --> D{Names are<br/>reliable?}

    C -- "Nested / linguistic" --> E([<b>Cupid</b><br/>tree + linguistic])
    C -- "Graph-heavy" --> F([<b>SimilarityFlooding</b><br/>fixpoint propagation])

    D -- "Yes" --> G([<b>Coma</b><br/>schema + instances])
    D -- "Not really" --> H{Need a quick<br/>baseline?}

    H -- "Yes" --> I([<b>JaccardDistanceMatcher</b><br/>set similarity])
    H -- "No, go heavy" --> J([<b>DistributionBased</b><br/>EMD on histograms])

    classDef pick fill:#fce4ec,stroke:#e91e63,stroke-width:2px,color:#880e4f;
    class E,F,G,I,J pick;

When in doubt, start with Coma — it's the strongest default and the only matcher that ships per-sub-matcher score breakdowns.

`Coma`¶

Pure-Python implementation of the COMA 3.0 schema matching algorithm. COMA (COmbination of MAtching algorithms) composes multiple sub-matchers — each targeting a different aspect of schema or data similarity — and combines their scores.

Schema matchers (enabled by use_schema=True, the default):

Name — trigram (Dice) similarity on column names
Path — trigram similarity on dot-separated schema paths
Leaves — name similarity across all leaf-level columns
Parents — structural similarity via parent-level leaf comparison

Instance matcher (enabled by use_instances=True):

TF-IDF cosine similarity — each cell value is treated as a document, a global IDF is computed across all columns of both tables, and per-column similarity is aggregated with a max-matching Dice formula.

After computing all-pairs similarity scores, a selection step filters results using bidirectional best-match logic (DIR_BOTH) controlled by max_n, delta, and threshold. When matching more than two tables via get_matches_batch, the TF-IDF corpus is built once from all tables.

from valentine.algorithms import Coma

matcher = Coma(use_instances=True)

Match explanations

Coma is the only matcher that fills in per-sub-matcher score breakdowns. After running Coma, call matches.get_details(pair) to see how each individual sub-matcher contributed to the final score. See Match details.

Performance. Schema-only mode is dominated by trigram comparisons — roughly O(n_left · n_right · L) in column counts and average name length. Adding use_instances=True builds a TF-IDF corpus over all sampled cell values across all input tables, so cost grows linearly with the instance_sample_size parameter of valentine_match (default 1000) and the total number of columns. Expect sub-second matching for two ~30-column tables, single seconds for ~100 columns, and tens of seconds once you cross a few hundred columns with instances enabled. Memory scales with the size of the TF-IDF vocabulary; lower instance_sample_size in the valentine_match call if you hit a wall.

:material-book-marked: Full parameter reference →

`Cupid`¶

Python implementation of Generic Schema Matching with Cupid (Madhavan, Bernstein & Rahm, VLDB 2001). Cupid combines linguistic similarity of column names with structural similarity derived from the shape of the schema tree. Use it when you have deep or nested schemas and the column data is unavailable or unreliable.

from valentine.algorithms import Cupid

matcher = Cupid(w_struct=0.2, leaf_w_struct=0.2, th_accept=0.7)

Performance. Schema-only and independent of row counts, so cost is driven entirely by the size of the schema tree. Expect sub-second matching for typical relational schemas (tens to a few hundred columns). Deeply nested or wide schemas (XML/JSON-shaped) push runtime into seconds because the structural pass propagates similarities across neighbouring nodes.

:material-book-marked: Full parameter reference →

`DistributionBased`¶

Python implementation of Automatic Discovery of Attributes in Relational Databases (Zhang et al., SIGMOD 2011). Columns are compared by quantile histograms of their value distributions; Earth Mover's Distance drives the ranking of matches within each cluster. Great for numeric or categorical data where names give you nothing to work with.

from valentine.algorithms import DistributionBased

matcher = DistributionBased(threshold1=0.15, threshold2=0.15)

When you pass more than two tables, Valentine calls get_matches_batch, which DistributionBased overrides to compute global value ranks across every table at once — giving each pair the benefit of the full data landscape.

Performance. The most compute-heavy matcher in the package. Cost is dominated by Earth Mover's Distance computations between column histograms; runtime scales roughly O(n_columns² · sample_size · log) per pair, so it grows fast with both column count and instance_sample_size. As a rule of thumb, expect single-digit seconds for ~30 columns at the default sample size, and minutes once you cross ~100 columns or use the full DataFrame. Lower instance_sample_size aggressively for exploration runs, then bump it up for the final pass.

:material-book-marked: Full parameter reference →

`JaccardDistanceMatcher`¶

An instance-based matcher that compares columns by Jaccard (or Tversky) similarity of their value sets. Element equality is configurable: choose from classic string-distance functions (Levenshtein, Jaro–Winkler, exact, …) or switch to sentence-transformer embeddings for semantic matching. Useful as a sanity check alongside a heavier matcher, or as a fast first pass on clean, short-valued columns.

from valentine.algorithms import JaccardDistanceMatcher
from valentine.algorithms.jaccard_distance import StringDistanceFunction

# Classic string-distance mode (default)
matcher = JaccardDistanceMatcher(
    threshold_dist=0.8,
    distance_fun=StringDistanceFunction.Levenshtein,
)

# Embedding mode — requires pip install valentine[embeddings]
matcher = JaccardDistanceMatcher(
    distance_fun=StringDistanceFunction.Embedding,
    threshold_dist=0.7,          # cosine similarity threshold
    embedding_model="all-MiniLM-L6-v2",
    embedding_device=None,       # auto: CUDA → MPS → CPU
)

The element-equality function is configured with the StringDistanceFunction enum, which exposes Levenshtein, DamerauLevenshtein, Hamming, Jaro, JaroWinkler, Exact, and Embedding.

The value-set comparison itself can be generalised from Jaccard to Tversky similarity via the tversky_alpha and tversky_beta parameters (both default 0.5, reproducing Jaccard exactly). Other presets: alpha=beta=1 gives Sørensen–Dice; alpha=1, beta=0 gives set containment.

When distance_fun=StringDistanceFunction.Embedding, JaccardDistanceMatcher overrides get_matches_batch to embed every unique column value once across all tables, sharing the forward pass across all pairs.

Performance. With Exact element equality the cost is essentially set-intersection — milliseconds per column pair. Switching to a string-distance function turns each comparison into an O(|A| · |B|) cross-product over column value sets, so it slows down quickly once columns hold more than a few hundred unique values. Embedding mode is the slowest option (one model forward pass per unique value), but the get_matches_batch override amortises embedding cost across all column pairs in a batch — making it relatively more efficient as the number of tables grows.

:material-book-marked: Full parameter reference →

`SimilarityFlooding`¶

Python implementation of Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching (Melnik, Garcia-Molina & Rahm, ICDE 2002). Each schema is represented as a labelled graph; an initial element-level similarity is iteratively propagated across the graph until a fixpoint is reached. Shines on structure-heavy schemas where graph neighbourhoods carry signal beyond what pure string matching can pick up.

from valentine.algorithms import (
    Formula,
    Policy,
    SimilarityFlooding,
    StringMatcher,
)

matcher = SimilarityFlooding(
    coeff_policy=Policy.INVERSE_AVERAGE,
    formula=Formula.FORMULA_C,
    string_matcher=StringMatcher.PREFIX_SUFFIX,
)

Behaviour is parameterized by three enums: Policy controls the propagation coefficients, Formula selects the fixpoint iteration formula, and StringMatcher picks the initial element-level similarity function. When you select StringMatcher.PREFIX_SUFFIX_TFIDF and run with more than two tables, Valentine computes a global IDF from every table's schema vocabulary.

Performance. Schema-only and dominated by the fixpoint iteration over the propagation graph. Each iteration is O(|edges|), and the graph size grows with the number of schema elements (columns + types + labels). Expect sub-second runtime on small relational schemas, single seconds on schemas with hundreds of elements. Convergence is the main variable: pick a tighter Formula if iterations stretch out, and prefer StringMatcher.PREFIX_SUFFIX over PREFIX_SUFFIX_TFIDF when you don't need cross-table corpus statistics.

:material-book-marked: Full parameter reference →

Writing a custom matcher¶

Every matcher subclasses BaseMatcher and implements at minimum the get_matches method. Override get_matches_batch if you can exploit a holistic view over every table. Populate match_details from inside your matcher if you want to surface sub-scores to users via MatcherResults.get_details.

Invalid parameters should raise ValueError at construction time — the built-in matchers follow this convention for threshold ranges, negative counts, and mutually-exclusive flags.

Matchers¶

Which matcher should I pick?¶

Coma¶

Cupid¶

DistributionBased¶

JaccardDistanceMatcher¶

SimilarityFlooding¶

Writing a custom matcher¶

`Coma`¶

`Cupid`¶

`DistributionBased`¶

`JaccardDistanceMatcher`¶

`SimilarityFlooding`¶