Skip to content

Changelog & migration

This page tracks user-visible changes to Valentine and explains how to port code between releases. The format is based on Keep a Changelog and the project follows Semantic Versioning. For the full commit history, see GitHub releases.

Maintainers: how to update this page

When preparing a release, move the contents of the Unreleased section below into a new versioned heading (## vX.Y.Z β€” YYYY-MM-DD) and reset the Unreleased sub-sections to empty. Keep sub-section order consistent: Added Β· Changed Β· Deprecated Β· Removed Β· Fixed Β· Security.

v1.0.0 β€” 2026-05-14

v1.0.0 is a significant redesign of Valentine's public API together with a performance and accuracy overhaul of every matcher. If you are coming from 0.5.x or earlier, the changes below will affect your code.

Headline: 13×–243Γ— per-matcher speedup on the NYC Open Data benchmark (1,442 s β†’ 19 s total), pure-Python Coma (no JVM), Polars support, embedding-based Jaccard, and Hungarian as the new default 1:1 selector.

Added

  • ColumnPair NamedTuple with explicit source_table, source_column, target_table, target_column fields β€” replacing the previous nested-tuple match keys.
  • Sub-matcher score breakdowns exposed via MatcherResults.details and get_details(pair). Currently populated by Coma.
  • Ground-truth input accepts table-aware ColumnPair instances in addition to column-name pairs β€” see Evaluation metrics.
  • Top-level instance_sample_size parameter on valentine_match (default 1000) for controlling instance sampling without constructing a custom DataframeTable.
  • Predefined metric sets: METRICS_ALL, METRICS_PRECISION_RECALL, and METRICS_PRECISION_INCREASING_N alongside the existing METRICS_CORE β€” see Predefined metric sets.
  • MeanReciprocalRank (MRR) metric, also added to METRICS_ALL and METRICS_CORE. Per-source ranking: for each source column, finds the rank of the first correct target in the column's ranked predictions.
  • Polars support. New PolarsTable / PolarsColumn adapters in valentine/data_sources/polars/. valentine_match auto-detects pandas and Polars frames and supports mixing them in a single call. Install with pip install valentine[polars].
  • Embedding-based string distance for JaccardDistanceMatcher via StringDistanceFunction.Embedding, using sentence-transformers cosine similarity. Knobs: embedding_model, embedding_device (auto-picks cuda β†’ mps β†’ cpu), embedding_batch_size. One global encode pass per match call. Install with pip install valentine[embeddings]. (Closes #65.)
  • Tversky-based set-similarity reduction in JaccardDistanceMatcher (tversky_alpha, tversky_beta). Defaults reproduce Jaccard exactly; Ξ±=Ξ²=0.5 gives SΓΈrensen-Dice, Ξ±=1, Ξ²=0 gives containment.
  • Three named one-to-one selectors on MatcherResults: one_to_one_hungarian() (new default β€” globally optimal via scipy.optimize.linear_sum_assignment), one_to_one_greedy() (previous behaviour), and one_to_one_mutual_top(n) (mutual nearest-neighbour filter).
  • Pluggable 1:1 algorithm in the metrics API. New one_to_one_method keyword on Metric.apply() and MatcherResults.get_metrics() accepts "hungarian" | "greedy" | "mutual_top". Defaults to "hungarian".
  • Configurable instance_weight constructor parameter on Coma (default 1.0).
  • Generic abbreviation matching in Coma name similarity β€” handles prefix and ordered-subsequence forms (deptβ†’department, fnameβ†’firstname, stβ†’street).
  • Full documentation site with matcher guide, API reference, and migration notes.

Changed

  • Unified top-level match API. A single valentine_match now accepts any iterable of DataFrames (list, tuple, generator), replacing the previous valentine_match / valentine_match_batch pair.
  • Immutable MatcherResults. The result object is now a Mapping, not a dict subclass. Derived views (e.g. one_to_one_hungarian()) are cached and cannot be silently invalidated.
  • Coma is now a pure-Python implementation of COMA 3.0 β€” no JVM dependency. Constructor signature updated to max_n, use_instances, use_schema, delta, threshold.
  • METRICS_ALL is now an explicit set rather than a dynamic scan of Metric.__subclasses__(), so user-defined metrics no longer bleed into the predefined set.
  • Parameter validation happens at matcher construction time: invalid thresholds, negative counts, or mutually-exclusive flags raise ValueError immediately rather than failing mid-match.
  • 13×–243Γ— faster per matcher across the NYC benchmark dataset pairs (1,442 s β†’ 19 s total). Coma uses TF-IDF cosine on cached float32 sparse CSR matrices with pair-level memoisation; Cupid caches WordNet synsets and lemma walks; DistributionBased replaces the per-row bucket_binary_search with np.searchsorted + np.bincount over precomputed bound arrays; JaccardDistanceMatcher uses rapidfuzz.process.cdist with a score_cutoff short-circuit. Full per-matcher numbers in the Benchmark page.
  • BaseTable.get_data_type treats pandas "str" / "string" dtypes as text (previously misclassified as unknown).
  • Cupid datatype compatibility is now binary (same family = 1.0, different = 0.0); a generic family-based classifier handles arbitrary SQL type strings (varchar(255), bigint, …).
  • Coma TF-IDF stopwords switched from a 33-word Lucene frozenset to NLTK's 179-word English stopwords for stronger noise filtering.
  • The default 1:1 selector for the metrics API is now Hungarian. Existing callers that relied on greedy selection should pass one_to_one_method="greedy".

Deprecated

  • NotAValentineMatcher is kept as an alias for InvalidMatcherError but will be removed in a future release. Update except clauses to use the new name.

Removed

  • valentine_match_batch β€” use valentine_match with an iterable instead.
  • The Java-backed COMA wrapper and its JVM dependency.
  • Mutable dict semantics on match results (__setitem__, update, pop, …).
  • MatcherResults.one_to_one() β€” use one of the three explicitly named selectors: one_to_one_hungarian() (new default), one_to_one_greedy() (previous behaviour), or one_to_one_mutual_top(n).
  • Redundant Coma matchers in flat tabular schemas: LEAVES_CM, PARENTS_CM, PATH_CM, SIBLINGS_CM, DATATYPE_MATCHER, and the predefined INSTANCES_CM. These produced constant or duplicate-of-NAME_CM scores on tabular inputs, diluting the signal.

Fixed

  • DistributionBased: quantile_emd now returns inf instead of dividing by zero when histogram values sum to zero.
  • Coma: TF-IDF cache stores list reference alongside its id() key to detect id() reuse after garbage collection, preventing stale cache hits.
  • SimilarityFlooding: NodeID prefix collision fixed (columns named "NodeID*" no longer collide with internal graph nodes); tokeniser now handles snake_case, SCREAMING_SNAKE, hyphens, and embedded digits.
  • Data source utilities: get_encoding handles chardet returning None; get_delimiter catches csv.Sniffer failures on malformed input.
  • NLTK data downloads are now resilient: retried, atomic, and silent when data is already present.

Migrating from 0.5.x

1. valentine_match_batch is gone

Before (0.5.x):

from valentine import valentine_match, valentine_match_batch

matches = valentine_match(df1, df2, matcher)              # two DataFrames
matches = valentine_match_batch([df1, df2, df3], matcher) # many DataFrames

After (1.0):

from valentine import valentine_match

matches = valentine_match([df1, df2], matcher)            # any iterable
matches = valentine_match([df1, df2, df3], matcher)

valentine_match now accepts any iterable of DataFrames; pairs, lists, tuples, and generators all work the same way.

2. Match keys are ColumnPair instances, not nested tuples

Before:

for ((t1, c1), (t2, c2)), score in matches.items():
    print(f"{c1} <-> {c2}: {score}")

After:

for pair, score in matches.items():
    print(f"{pair.source_column} <-> {pair.target_column}: {score}")

ColumnPair is a NamedTuple, so positional indexing still works if you really need it, and destructuring into four names is a simple migration path:

for (src_table, src_col, tgt_table, tgt_col), score in matches.items():
    ...

3. MatcherResults is immutable

Before:

matches[("t1", "c1"), ("t2", "c2")] = 1.0   # allowed
del matches[some_key]                        # allowed

After β€” these raise TypeError / AttributeError. Use the transformation methods instead:

matches = matches.filter(min_score=0.7)
matches = matches.take_top_n(10)
matches = matches.take_top_percent(25)

Each returns a new MatcherResults instance.

4. Ground truth accepts ColumnPair instances

Before β€” only (col, col) pairs were allowed:

ground_truth = [("emp_id", "employee_number"), ...]

After β€” both work, and table-aware comparison is now possible for multi-table matching:

from valentine.algorithms import ColumnPair

ground_truth = [
    ColumnPair("hr", "emp_id", "payroll", "employee_number"),
    ...
]

See Evaluation metrics β†’ Ground-truth formats.

5. NotAValentineMatcher is deprecated

The exception raised for bad matcher arguments is now InvalidMatcherError. The old name is kept as an alias for backward compatibility but will be removed in a future release β€” update your except clauses.

# Before
from valentine import NotAValentineMatcher

# After
from valentine import InvalidMatcherError

6. The Java COMA wrapper has been removed

If you were relying on the previous Java-backed Coma implementation, you no longer need a JVM β€” Coma is now pure Python and ships with the package. The constructor signature has changed slightly; see the API reference for the new parameters (max_n, use_instances, use_schema, delta, threshold).

7. one_to_one() is gone β€” pick a selector

MatcherResults.one_to_one() has been replaced by three explicitly named selectors:

# Before
filtered = matches.one_to_one()

# After β€” globally optimal (new default), recommended:
filtered = matches.one_to_one_hungarian()

# After β€” preserve previous greedy behaviour:
filtered = matches.one_to_one_greedy()

# After β€” mutual nearest neighbour:
filtered = matches.one_to_one_mutual_top(n=1)

The metrics API also takes the algorithm as a per-call argument:

matches.get_metrics(gt, metrics={F1Score()},
                    one_to_one_method="hungarian")  # default

Custom Metric subclasses that override apply need to accept the new one_to_one_method keyword (or **kwargs).