Changelog & migration¶
This page tracks user-visible changes to Valentine and explains how to port code between releases. The format is based on Keep a Changelog and the project follows Semantic Versioning. For the full commit history, see GitHub releases.
Maintainers: how to update this page
When preparing a release, move the contents of the
Unreleased section below into a new versioned heading
(## vX.Y.Z β YYYY-MM-DD) and reset the Unreleased sub-sections
to empty. Keep sub-section order consistent:
Added Β· Changed Β· Deprecated Β· Removed Β· Fixed Β· Security.
v1.0.0 β 2026-05-14¶
v1.0.0 is a significant redesign of Valentine's public API together with a performance and accuracy overhaul of every matcher. If you are coming from 0.5.x or earlier, the changes below will affect your code.
Headline: 13Γβ243Γ per-matcher speedup on the NYC Open Data benchmark (1,442 s β 19 s total), pure-Python Coma (no JVM), Polars support, embedding-based Jaccard, and Hungarian as the new default 1:1 selector.
Added¶
ColumnPairNamedTuplewith explicitsource_table,source_column,target_table,target_columnfields β replacing the previous nested-tuple match keys.- Sub-matcher score breakdowns exposed via
MatcherResults.detailsandget_details(pair). Currently populated byComa. - Ground-truth input accepts table-aware
ColumnPairinstances in addition to column-name pairs β see Evaluation metrics. - Top-level
instance_sample_sizeparameter onvalentine_match(default1000) for controlling instance sampling without constructing a customDataframeTable. - Predefined metric sets:
METRICS_ALL,METRICS_PRECISION_RECALL, andMETRICS_PRECISION_INCREASING_Nalongside the existingMETRICS_COREβ see Predefined metric sets. MeanReciprocalRank(MRR) metric, also added toMETRICS_ALLandMETRICS_CORE. Per-source ranking: for each source column, finds the rank of the first correct target in the column's ranked predictions.- Polars support. New
PolarsTable/PolarsColumnadapters invalentine/data_sources/polars/.valentine_matchauto-detects pandas and Polars frames and supports mixing them in a single call. Install withpip install valentine[polars]. - Embedding-based string distance for
JaccardDistanceMatcherviaStringDistanceFunction.Embedding, using sentence-transformers cosine similarity. Knobs:embedding_model,embedding_device(auto-pickscudaβmpsβcpu),embedding_batch_size. One global encode pass per match call. Install withpip install valentine[embeddings]. (Closes #65.) - Tversky-based set-similarity reduction in
JaccardDistanceMatcher(tversky_alpha,tversky_beta). Defaults reproduce Jaccard exactly;Ξ±=Ξ²=0.5gives SΓΈrensen-Dice,Ξ±=1, Ξ²=0gives containment. - Three named one-to-one selectors on
MatcherResults:one_to_one_hungarian()(new default β globally optimal viascipy.optimize.linear_sum_assignment),one_to_one_greedy()(previous behaviour), andone_to_one_mutual_top(n)(mutual nearest-neighbour filter). - Pluggable 1:1 algorithm in the metrics API. New
one_to_one_methodkeyword onMetric.apply()andMatcherResults.get_metrics()accepts"hungarian" | "greedy" | "mutual_top". Defaults to"hungarian". - Configurable
instance_weightconstructor parameter onComa(default1.0). - Generic abbreviation matching in Coma name similarity β handles
prefix and ordered-subsequence forms (
deptβdepartment,fnameβfirstname,stβstreet). - Full documentation site with matcher guide, API reference, and migration notes.
Changed¶
- Unified top-level match API. A single
valentine_matchnow accepts any iterable of DataFrames (list, tuple, generator), replacing the previousvalentine_match/valentine_match_batchpair. - Immutable
MatcherResults. The result object is now aMapping, not adictsubclass. Derived views (e.g.one_to_one_hungarian()) are cached and cannot be silently invalidated. Comais now a pure-Python implementation of COMA 3.0 β no JVM dependency. Constructor signature updated tomax_n,use_instances,use_schema,delta,threshold.METRICS_ALLis now an explicit set rather than a dynamic scan ofMetric.__subclasses__(), so user-defined metrics no longer bleed into the predefined set.- Parameter validation happens at matcher construction time: invalid
thresholds, negative counts, or mutually-exclusive flags raise
ValueErrorimmediately rather than failing mid-match. - 13Γβ243Γ faster per matcher across the NYC benchmark dataset pairs (1,442 s β 19 s total). Coma uses TF-IDF cosine on cached float32 sparse CSR matrices with
pair-level memoisation; Cupid caches WordNet synsets and lemma walks;
DistributionBased replaces the per-row
bucket_binary_searchwithnp.searchsorted+np.bincountover precomputed bound arrays;JaccardDistanceMatcherusesrapidfuzz.process.cdistwith ascore_cutoffshort-circuit. Full per-matcher numbers in the Benchmark page. BaseTable.get_data_typetreats pandas"str"/"string"dtypes as text (previously misclassified as unknown).- Cupid datatype compatibility is now binary (same family = 1.0,
different = 0.0); a generic family-based classifier handles
arbitrary SQL type strings (
varchar(255),bigint, β¦). - Coma TF-IDF stopwords switched from a 33-word Lucene frozenset to NLTK's 179-word English stopwords for stronger noise filtering.
- The default 1:1 selector for the metrics API is now Hungarian.
Existing callers that relied on greedy selection should pass
one_to_one_method="greedy".
Deprecated¶
NotAValentineMatcheris kept as an alias forInvalidMatcherErrorbut will be removed in a future release. Updateexceptclauses to use the new name.
Removed¶
valentine_match_batchβ usevalentine_matchwith an iterable instead.- The Java-backed COMA wrapper and its JVM dependency.
- Mutable
dictsemantics on match results (__setitem__,update,pop, β¦). MatcherResults.one_to_one()β use one of the three explicitly named selectors:one_to_one_hungarian()(new default),one_to_one_greedy()(previous behaviour), orone_to_one_mutual_top(n).- Redundant Coma matchers in flat tabular schemas:
LEAVES_CM,PARENTS_CM,PATH_CM,SIBLINGS_CM,DATATYPE_MATCHER, and the predefinedINSTANCES_CM. These produced constant or duplicate-of-NAME_CMscores on tabular inputs, diluting the signal.
Fixed¶
- DistributionBased:
quantile_emdnow returnsinfinstead of dividing by zero when histogram values sum to zero. - Coma: TF-IDF cache stores list reference alongside its
id()key to detectid()reuse after garbage collection, preventing stale cache hits. - SimilarityFlooding:
NodeIDprefix collision fixed (columns named"NodeID*"no longer collide with internal graph nodes); tokeniser now handlessnake_case,SCREAMING_SNAKE, hyphens, and embedded digits. - Data source utilities:
get_encodinghandleschardetreturningNone;get_delimitercatchescsv.Snifferfailures on malformed input. - NLTK data downloads are now resilient: retried, atomic, and silent when data is already present.
Migrating from 0.5.x¶
1. valentine_match_batch is gone¶
Before (0.5.x):
from valentine import valentine_match, valentine_match_batch
matches = valentine_match(df1, df2, matcher) # two DataFrames
matches = valentine_match_batch([df1, df2, df3], matcher) # many DataFrames
After (1.0):
from valentine import valentine_match
matches = valentine_match([df1, df2], matcher) # any iterable
matches = valentine_match([df1, df2, df3], matcher)
valentine_match now accepts any iterable of
DataFrames; pairs, lists, tuples, and generators all work the same way.
2. Match keys are ColumnPair instances, not nested tuples¶
Before:
After:
for pair, score in matches.items():
print(f"{pair.source_column} <-> {pair.target_column}: {score}")
ColumnPair is a NamedTuple, so positional
indexing still works if you really need it, and destructuring into four
names is a simple migration path:
3. MatcherResults is immutable¶
Before:
After β these raise TypeError / AttributeError. Use the
transformation methods instead:
matches = matches.filter(min_score=0.7)
matches = matches.take_top_n(10)
matches = matches.take_top_percent(25)
Each returns a new MatcherResults
instance.
4. Ground truth accepts ColumnPair instances¶
Before β only (col, col) pairs were allowed:
After β both work, and table-aware comparison is now possible for multi-table matching:
from valentine.algorithms import ColumnPair
ground_truth = [
ColumnPair("hr", "emp_id", "payroll", "employee_number"),
...
]
See Evaluation metrics β Ground-truth formats.
5. NotAValentineMatcher is deprecated¶
The exception raised for bad matcher arguments is now
InvalidMatcherError. The old name is
kept as an alias for backward compatibility but will be removed in a
future release β update your except clauses.
# Before
from valentine import NotAValentineMatcher
# After
from valentine import InvalidMatcherError
6. The Java COMA wrapper has been removed¶
If you were relying on the previous Java-backed Coma implementation,
you no longer need a JVM β Coma is now pure Python and
ships with the package. The constructor signature has changed slightly;
see the API reference for the new parameters
(max_n, use_instances, use_schema, delta, threshold).
7. one_to_one() is gone β pick a selector¶
MatcherResults.one_to_one() has been replaced by three explicitly
named selectors:
# Before
filtered = matches.one_to_one()
# After β globally optimal (new default), recommended:
filtered = matches.one_to_one_hungarian()
# After β preserve previous greedy behaviour:
filtered = matches.one_to_one_greedy()
# After β mutual nearest neighbour:
filtered = matches.one_to_one_mutual_top(n=1)
The metrics API also takes the algorithm as a per-call argument:
Custom Metric subclasses that override apply need
to accept the new one_to_one_method keyword (or **kwargs).