Skip to content

Matchers

This page is the conceptual guide to Valentine's five matching algorithms: what each one does, when to reach for it, and the trade-offs involved. For constructor signatures, parameter defaults, and validation rules, head straight to the API reference.

Every matcher in Valentine subclasses BaseMatcher and is compatible with the top-level valentine_match API. All five live in valentine.algorithms:

from valentine.algorithms import (
    Coma,
    Cupid,
    DistributionBased,
    JaccardDistanceMatcher,
    SimilarityFlooding,
)
Matcher Signals Best for
Coma Schema + instances (optional) General-purpose first choice. Strong defaults, informative sub-scores.
Cupid Schema only Nested schemas where column names and structure matter more than data.
DistributionBased Instances only Matching by value distributions when names are unreliable.
JaccardDistanceMatcher Instances only Simple, explainable baseline. Useful for sanity checks.
SimilarityFlooding Schema only Structure-heavy schemas where graph neighbourhoods carry signal.

Which matcher should I pick?

flowchart TD
    A([Start]) --> B{Do you have<br/>column values?}
    B -- "No, schema only" --> C{Schema is<br/>nested or<br/>graph-shaped?}
    B -- "Yes" --> D{Names are<br/>reliable?}

    C -- "Nested / linguistic" --> E([<b>Cupid</b><br/>tree + linguistic])
    C -- "Graph-heavy" --> F([<b>SimilarityFlooding</b><br/>fixpoint propagation])

    D -- "Yes" --> G([<b>Coma</b><br/>schema + instances])
    D -- "Not really" --> H{Need a quick<br/>baseline?}

    H -- "Yes" --> I([<b>JaccardDistanceMatcher</b><br/>set similarity])
    H -- "No, go heavy" --> J([<b>DistributionBased</b><br/>EMD on histograms])

    classDef pick fill:#fce4ec,stroke:#e91e63,stroke-width:2px,color:#880e4f;
    class E,F,G,I,J pick;

When in doubt, start with Coma — it's the strongest default and the only matcher that ships per-sub-matcher score breakdowns.

Coma

Pure-Python implementation of the COMA 3.0 schema matching algorithm. COMA (COmbination of MAtching algorithms) composes multiple sub-matchers — each targeting a different aspect of schema or data similarity — and combines their scores.

Schema matchers (enabled by use_schema=True, the default):

  • Name — trigram (Dice) similarity on column names
  • Path — trigram similarity on dot-separated schema paths
  • Leaves — name similarity across all leaf-level columns
  • Parents — structural similarity via parent-level leaf comparison

Instance matcher (enabled by use_instances=True):

  • TF-IDF cosine similarity — each cell value is treated as a document, a global IDF is computed across all columns of both tables, and per-column similarity is aggregated with a max-matching Dice formula.

After computing all-pairs similarity scores, a selection step filters results using bidirectional best-match logic (DIR_BOTH) controlled by max_n, delta, and threshold. When matching more than two tables via get_matches_batch, the TF-IDF corpus is built once from all tables.

from valentine.algorithms import Coma

matcher = Coma(use_instances=True)

Match explanations

Coma is the only matcher that fills in per-sub-matcher score breakdowns. After running Coma, call matches.get_details(pair) to see how each individual sub-matcher contributed to the final score. See Match details.

Performance. Schema-only mode is dominated by trigram comparisons — roughly O(n_left · n_right · L) in column counts and average name length. Adding use_instances=True builds a TF-IDF corpus over all sampled cell values across all input tables, so cost grows linearly with instance_sample_size (default 1000) and the total number of columns. Expect sub-second matching for two ~30-column tables, single seconds for ~100 columns, and tens of seconds once you cross a few hundred columns with instances enabled. Memory scales with the size of the TF-IDF vocabulary; lower instance_sample_size if you hit a wall.

:material-book-marked: Full parameter reference →

Cupid

Python implementation of Generic Schema Matching with Cupid (Madhavan, Bernstein & Rahm, VLDB 2001). Cupid combines linguistic similarity of column names with structural similarity derived from the shape of the schema tree. Use it when you have deep or nested schemas and the column data is unavailable or unreliable.

from valentine.algorithms import Cupid

matcher = Cupid(w_struct=0.2, leaf_w_struct=0.2, th_accept=0.7)

Performance. Schema-only and independent of row counts, so cost is driven entirely by the size of the schema tree. Expect sub-second matching for typical relational schemas (tens to a few hundred columns). Deeply nested or wide schemas (XML/JSON-shaped) push runtime into seconds because the structural pass propagates similarities across neighbouring nodes.

:material-book-marked: Full parameter reference →

DistributionBased

Python implementation of Automatic Discovery of Attributes in Relational Databases (Zhang et al., SIGMOD 2011). Columns are compared by quantile histograms of their value distributions; Earth Mover's Distance drives the ranking of matches within each cluster. Great for numeric or categorical data where names give you nothing to work with.

from valentine.algorithms import DistributionBased

matcher = DistributionBased(threshold1=0.15, threshold2=0.15)

When you pass more than two tables, Valentine calls get_matches_batch, which DistributionBased overrides to compute global value ranks across every table at once — giving each pair the benefit of the full data landscape.

Performance. The most compute-heavy matcher in the package. Cost is dominated by Earth Mover's Distance computations between column histograms; runtime scales roughly O(n_columns² · sample_size · log) per pair, so it grows fast with both column count and instance_sample_size. As a rule of thumb, expect single-digit seconds for ~30 columns at the default sample size, and minutes once you cross ~100 columns or use the full DataFrame. Lower instance_sample_size aggressively for exploration runs, then bump it up for the final pass.

:material-book-marked: Full parameter reference →

JaccardDistanceMatcher

A simple, explainable instance-based baseline. Columns are compared by Jaccard similarity of their value sets, with element equality decided by a configurable string distance function (Levenshtein, Jaro–Winkler, exact, …). Useful as a sanity check alongside a heavier matcher, or as a fast first pass on clean, short-valued columns.

from valentine.algorithms import JaccardDistanceMatcher
from valentine.algorithms.jaccard_distance import StringDistanceFunction

matcher = JaccardDistanceMatcher(
    threshold_dist=0.8,
    distance_fun=StringDistanceFunction.Levenshtein,
)

The element-equality function is configured with the StringDistanceFunction enum, which exposes Levenshtein, DamerauLevenshtein, Hamming, Jaro, JaroWinkler, and Exact.

Performance. Fast and predictable. With Exact element equality the cost is essentially set-intersection — milliseconds per column pair. Switching to a string-distance function turns each comparison into an O(|A| · |B|) cross-product over column value sets, so it slows down quickly once columns hold more than a few hundred unique values. Use it as a fast first pass, or pair it with StringDistanceFunction.Exact on clean, short-valued columns.

:material-book-marked: Full parameter reference →

SimilarityFlooding

Python implementation of Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching (Melnik, Garcia-Molina & Rahm, ICDE 2002). Each schema is represented as a labelled graph; an initial element-level similarity is iteratively propagated across the graph until a fixpoint is reached. Shines on structure-heavy schemas where graph neighbourhoods carry signal beyond what pure string matching can pick up.

from valentine.algorithms import (
    Formula,
    Policy,
    SimilarityFlooding,
    StringMatcher,
)

matcher = SimilarityFlooding(
    coeff_policy=Policy.INVERSE_AVERAGE,
    formula=Formula.FORMULA_C,
    string_matcher=StringMatcher.PREFIX_SUFFIX,
)

Behaviour is parameterized by three enums: Policy controls the propagation coefficients, Formula selects the fixpoint iteration formula, and StringMatcher picks the initial element-level similarity function. When you select StringMatcher.PREFIX_SUFFIX_TFIDF and run with more than two tables, Valentine computes a global IDF from every table's schema vocabulary.

Performance. Schema-only and dominated by the fixpoint iteration over the propagation graph. Each iteration is O(|edges|), and the graph size grows with the number of schema elements (columns + types + labels). Expect sub-second runtime on small relational schemas, single seconds on schemas with hundreds of elements. Convergence is the main variable: pick a tighter Formula if iterations stretch out, and prefer StringMatcher.PREFIX_SUFFIX over PREFIX_SUFFIX_TFIDF when you don't need cross-table corpus statistics.

:material-book-marked: Full parameter reference →

Writing a custom matcher

Every matcher subclasses BaseMatcher and implements at minimum the get_matches method. Override get_matches_batch if you can exploit a holistic view over every table. Populate match_details from inside your matcher if you want to surface sub-scores to users via MatcherResults.get_details.

Invalid parameters should raise ValueError at construction time — the built-in matchers follow this convention for threshold ranges, negative counts, and mutually-exclusive flags.