FAQ¶
Common questions and gotchas. If yours isn't here, open an issue on GitHub.
Which matcher should I use?¶
Start with Coma. It is the strongest default, handles
both schema and instance signals, and is the only matcher that ships
per-sub-matcher score breakdowns so
you can tell why two columns matched. Move to a different matcher
only when you have a clear reason — see the
decision diagram on the
matchers page.
How do I match more than two DataFrames?¶
Pass any iterable to valentine_match and
Valentine computes all N * (N - 1) / 2 unique pairs:
matches = valentine_match(
[sales_df, orders_df, products_df],
Coma(),
df_names=["sales", "orders", "products"],
)
Each ColumnPair in the result carries both the
source and target table names, so you can group, filter, or pretty-print
results by table pair.
Why is the matcher slow / using a lot of memory?¶
The two biggest dials are how many columns each table has and how
much instance data the matcher sees per column. For instance-based
matchers (Coma(use_instances=True), DistributionBased,
JaccardDistanceMatcher), Valentine samples up to
instance_sample_size rows per column (default 1000). If your tables
are large:
- Lower
instance_sample_sizeto200–500for a quick first pass. - Set
instance_sample_size=Noneto use the full DataFrame only when you need a final, high-quality match. - Set
instance_sample_size=0to disable instance data entirely and fall back to schema-only matching.
See valentine_match for the full signature,
and the Matchers page for per-matcher performance notes.
What's the difference between instance_sample_size=None and 0?¶
None— feed the entire column to the matcher. Most accurate, most expensive. Use this for final runs on small/medium tables.0— feed no instance data. Schema-only matching. Use this when the data is sensitive, unavailable, or irrelevant.- Positive integer
n— sample at mostnrows per column. The default1000is a good speed/accuracy trade-off for most workloads.
My column names are non-ASCII / contain Unicode. Will it work?¶
Yes. All matchers operate on Python strings and handle Unicode identifiers correctly. Trigram and edit-distance comparisons run on code points, not bytes.
How do I get only the top N matches?¶
MatcherResults is sorted high-to-low and
provides three reduction helpers:
matches.take_top_n(10) # absolute top 10
matches.take_top_percent(5) # top 5%
matches.one_to_one() # bidirectional best matches
All three return a new MatcherResults — the original is immutable.
See Matcher results for the full pipeline.
How do I evaluate match quality?¶
If you have a ground truth — a list of expected (source_col,
target_col) pairs — call
get_metrics:
ground_truth = [("emp_id", "employee_number"), ("fname", "first_name")]
print(matches.get_metrics(ground_truth))
By default this computes Precision, Recall, and F1. Pass a
metrics={...} set with custom thresholds or your own
Metric subclasses for more detail. See
Evaluation metrics.
How do I plug in my own data source (not a DataFrame)?¶
Subclass BaseTable and
BaseColumn. The
API reference includes a runnable DictTable example
that wraps a plain Python dict[str, list]. The same example is
exercised by the test suite, so it is guaranteed to stay in sync with
the API.
How do I plug in my own matcher?¶
Subclass BaseMatcher and implement
get_matches. If your matcher benefits from a
holistic view across all input tables, override
get_matches_batch. Populate
match_details from inside your matcher if
you want users to access per-sub-score breakdowns via
get_details. Raise ValueError from
__init__ for invalid parameters — the built-in matchers all do.
I get the same column matched to itself. How do I exclude self-matches?¶
Self-pairs (same table, same column) never appear in
MatcherResults. If you see cross-table
matches with identical column names that you want to exclude, filter
the result mapping yourself:
filtered = {
pair: score
for pair, score in matches.items()
if pair.source_column != pair.target_column
}
Does Valentine require Java?¶
No. Every matcher in v1.x is pure Python — including the COMA implementation. There is no JVM, no subprocess, no temp file shuffling. The Java version of COMA was removed in v1.0.0.
What changed in v1.0.0?¶
The matching API was unified, valentine_match_batch was removed,
results became immutable, and metrics were overhauled. The full story
and a step-by-step migration guide are in the
changelog.