Skip to content

Benchmark: v0.5.0 → v1.0.0

This page documents the performance comparison between valentine v0.5.0 and v1.0.0 on the NYC dataset collection — 10 real-world table pairs from NYC Open Data spanning city government, education, housing, and transportation domains.

All timings are wall-clock seconds measured on a single Windows machine. Per-dataset timeout: 120 s. Accuracy metrics are mean F1 and mean MRR across all datasets that completed without error or timeout.

Coma in v0.5.0 vs v1.0.0

In v0.5.0, Coma was the canonical Java-backed implementation — it required a JRE on the host machine and spawned a JVM per call. A pure-Python variant, ComaPy, existed but was considered experimental. In v1.0.0, the Java backend was removed entirely: Coma is now the pure-Python implementation (the former ComaPy, promoted to stable). All Coma comparisons in this document use the canonical class of each version — Coma (Java) for v0.5.0 and Coma (Python) for v1.0.0.


Summary

Speed (total wall-clock time, 10 datasets)

Matcher v0.5.0 v1.0.0 Speedup
Coma (schema) 8.31 s 0.65 s 13×
Coma (instances) 322.23 s 4.71 s 68×
Cupid 163.04 s 3.55 s 46×
DistributionBased 164.70 s 3.94 s 42×
JaccardDistanceMatcher 730.36 s ⚠ 3.92 s 186×
SimilarityFlooding 53.84 s 3.30 s 16×

v0.5.0 reliability issues

  • JaccardDistanceMatcher timed out on 5 of 10 datasets; the 730 s total reflects only the 5 that completed plus 5 × 120 s timeouts.
  • DistributionBased crashed on the Public Design Commission dataset (min() arg is an empty sequence) due to a missing guard for sparse, text-heavy columns — fixed in v1.0.0.

Accuracy (mean across completed datasets)

Matcher v0.5.0 F1 v1.0.0 F1 v0.5.0 Recall@GT v1.0.0 Recall@GT v0.5.0 MRR v1.0.0 MRR
Coma (schema) 0.6582 0.6647 0.6424 0.6507 0.3050 0.3024
Coma (instances) 0.7654 § 0.7717 0.8132 § 0.7631 0.3427 § 0.3384
Cupid 0.4800 0.4848 0.4275 0.4298 0.2452 0.2489
DistributionBased 0.6465 † 0.6805 0.5903 † 0.6205 0.2892 † 0.3019
JaccardDistanceMatcher 0.6664 ‡ 0.6463 0.6250 ‡ 0.5611 0.3354 ‡ 0.2474
SimilarityFlooding 0.5071 0.4929 0.5014 0.5798 0.2853 0.3034

§ v0.5.0 Coma (instances) mean computed over 9 completed datasets (Housing_Maintenance timed out).
† v0.5.0 DistributionBased excludes the one crashed dataset (Public Design Commission).
‡ v0.5.0 Jaccard computed over 5 completed datasets only (5 timeouts).

Recall@GT

Recall@GT (RecallAtSizeofGroundTruth) measures recall when selecting exactly len(ground_truth) top predictions — i.e., the fraction of correct pairs recovered if you keep as many predictions as there are gold matches.

Accuracy is essentially preserved across the entire rewrite — F1 differences are within ±0.02 on all matchers. The speed-ups are purely from implementation improvements, not accuracy trade-offs.


Java Coma vs pure-Python Coma

v0.5.0 shipped two Coma variants: Coma (Java-backed, the canonical implementation) and ComaPy (pure Python, experimental). v1.0.0 ships only Coma — the pure-Python implementation, graduated from experimental ComaPy to the new stable default, with the Java backend retired entirely.

Matcher v0.5.0 Java v1.0.0 Python Speedup F1 delta Recall@GT delta MRR delta
Coma (schema) 8.31 s 0.65 s 13× +0.007 +0.008 −0.003
Coma (instances) 322.23 s 4.71 s 68× +0.006 −0.050 −0.004

Java Coma in v0.5.0

Instance-mode Java Coma required manual heap configuration (java_xmx parameter) and still ran out of memory on large datasets even with 8 GB allocated. v1.0.0 eliminates the JVM dependency entirely — no Java installation, no heap tuning, no OOM errors.

The pure-Python rewrite is not only faster and more reliable, it is also marginally more accurate on this benchmark.


New in v1.0.0: embedding-based Jaccard

v1.0.0 adds JaccardDistanceMatcher with distance_fun=StringDistanceFunction.Embedding, which uses sentence embeddings instead of character-level string distance.

Matcher Time Mean F1 Mean Recall@GT Mean MRR
JaccardDistanceMatcher (string) 3.92 s 0.6463 0.5611 0.2474
JaccardDistanceMatcher (embedding) 48.98 s 0.6567 0.5811 0.2514

The embedding variant requires sentence-transformers to be installed and trades ~14× more time for a small accuracy gain (+0.01 F1). It performs particularly well on columns with semantically related but lexically dissimilar names.


Per-dataset results

Each table covers one matcher. Columns show v0.5.0 and v1.0.0 side-by-side so differences are immediately visible. v0.5.0 Coma = Java-backed (canonical).

Coma (schema)

Dataset v0.5 F1 v1.0 F1 v0.5 Recall@GT v1.0 Recall@GT
Capital_Projects 0.667 0.800 0.600 0.800
DCM_StreetCenterLine 0.833 0.857 0.857 0.857
DPR_AthleticFacilities 0.737 0.778 0.800 0.800
DSNY_Districts 0.533 0.500 0.500 0.500
NYC_Municipal_Building 0.800 0.545 0.800 0.600
COVID-19_Free_Meals 0.444 0.444 0.400 0.400
Housing_Maintenance 0.762 0.750 0.667 0.750
Public_Design_Commission 0.417 0.417 0.600 0.400
Swim_for_Life 0.889 0.889 0.800 0.800
DOT_Resurfacing 0.500 0.667 0.400 0.600
Mean 0.658 0.665 0.642 0.651

Coma (instances)

v0.5.0 timed out on Housing_Maintenance even at 8 GB heap (131 s); excluded from v0.5.0 mean.

Dataset v0.5 F1 v1.0 F1 v0.5 Recall@GT v1.0 Recall@GT
Capital_Projects 0.909 0.800 1.000 0.800
DCM_StreetCenterLine 0.833 0.769 0.857 0.714
DPR_AthleticFacilities 0.556 0.300 0.700 0.300
DSNY_Districts 0.533 0.588 0.625 0.500
NYC_Municipal_Building 1.000 0.889 1.000 0.800
COVID-19_Free_Meals 0.667 0.750 0.600 0.800
Housing_Maintenance TIMEOUT 0.917 TIMEOUT 0.917
Public_Design_Commission 0.560 0.815 0.600 0.800
Swim_for_Life 0.889 0.889 1.000 1.000
DOT_Resurfacing 0.889 1.000 1.000 1.000
Mean 0.765 § 0.772 0.820 § 0.763

§ v0.5.0 mean over 9 completed datasets.

Cupid

Dataset v0.5 F1 v1.0 F1 v0.5 Recall@GT v1.0 Recall@GT
Capital_Projects 0.600 0.600 0.600 0.600
DCM_StreetCenterLine 0.727 0.667 0.571 0.714
DPR_AthleticFacilities 0.211 0.211 0.200 0.100
DSNY_Districts 0.462 0.462 0.500 0.500
NYC_Municipal_Building 0.333 0.333 0.200 0.200
COVID-19_Free_Meals 0.286 0.286 0.200 0.200
Housing_Maintenance 0.583 0.609 0.583 0.583
Public_Design_Commission 0.182 0.182 0.200 0.200
Swim_for_Life 0.667 0.750 0.600 0.600
DOT_Resurfacing 0.750 0.750 0.600 0.600
Mean 0.480 0.485 0.427 0.430

DistributionBased

v0.5.0 crashed on Public_Design_Commission (min() arg is an empty sequence); fixed in v1.0.0.

Dataset v0.5 F1 v1.0 F1 v0.5 Recall@GT v1.0 Recall@GT
Capital_Projects 0.500 0.571 0.400 0.400
DCM_StreetCenterLine 0.667 0.500 0.571 0.571
DPR_AthleticFacilities 0.320 0.333 0.100 0.200
DSNY_Districts 0.526 0.500 0.375 0.500
NYC_Municipal_Building 0.750 0.750 0.600 0.600
COVID-19_Free_Meals 0.750 0.750 0.800 0.800
Housing_Maintenance 0.667 0.762 0.667 0.667
Public_Design_Commission ERROR 0.750 ERROR 0.667
Swim_for_Life 0.889 1.000 1.000 1.000
DOT_Resurfacing 0.750 0.889 0.800 0.800
Mean 0.647 † 0.681 0.590 † 0.621

† v0.5.0 mean over 9 completed datasets.

JaccardDistanceMatcher

v0.5.0 timed out on 5 datasets; mean computed over 5 that completed.

Dataset v0.5 F1 v1.0 F1 v0.5 Recall@GT v1.0 Recall@GT
Capital_Projects 0.400 0.400 0.400 0.400
DCM_StreetCenterLine TIMEOUT 0.571 TIMEOUT 0.286
DPR_AthleticFacilities TIMEOUT 0.148 TIMEOUT 0.000
DSNY_Districts 0.316 0.333 0.125 0.125
NYC_Municipal_Building 0.727 0.727 0.800 0.800
COVID-19_Free_Meals 0.889 0.889 0.800 0.800
Housing_Maintenance TIMEOUT 0.667 TIMEOUT 0.667
Public_Design_Commission TIMEOUT 0.839 TIMEOUT 0.733
Swim_for_Life 1.000 1.000 1.000 1.000
DOT_Resurfacing TIMEOUT 0.889 TIMEOUT 0.800
Mean 0.666 ‡ 0.646 0.625 ‡ 0.561

‡ v0.5.0 mean over 5 completed datasets only.

SimilarityFlooding

Dataset v0.5 F1 v1.0 F1 v0.5 Recall@GT v1.0 Recall@GT
Capital_Projects 0.667 0.714 0.600 0.800
DCM_StreetCenterLine 0.714 0.625 0.714 0.714
DPR_AthleticFacilities 0.414 0.424 0.600 0.700
DSNY_Districts 0.333 0.476 0.250 0.500
NYC_Municipal_Building 0.400 0.333 0.400 0.600
COVID-19_Free_Meals 0.364 0.400 0.400 0.400
Housing_Maintenance 0.560 0.500 0.583 0.583
Public_Design_Commission 0.389 0.389 0.267 0.200
Swim_for_Life 0.800 0.667 0.800 0.800
DOT_Resurfacing 0.444 0.400 0.400 0.400
Mean 0.507 0.493 0.501 0.580

ComaPy (v0.5.0 experimental — now the stable Coma in v1.0.0)

Dataset schema F1 instances F1 schema Recall@GT instances Recall@GT
Capital_Projects 0.667 0.909 0.600 1.000
DCM_StreetCenterLine 0.857 0.857 0.857 0.857
DPR_AthleticFacilities 0.783 0.609 0.700 0.400
DSNY_Districts 0.533 0.667 0.500 0.625
NYC_Municipal_Building 0.727 0.800 0.800 0.800
COVID-19_Free_Meals 0.400 0.800 0.400 0.800
Housing_Maintenance 0.615 0.750 0.667 0.750
Public_Design_Commission 0.538 0.615 0.267 0.467
Swim_for_Life 0.800 1.000 0.800 1.000
DOT_Resurfacing 0.444 0.909 0.400 1.000
Mean 0.636 0.792 0.599 0.770

JaccardDistanceMatcher (embedding) (v1.0.0 only)

Dataset F1 Recall@GT
Capital_Projects 0.400 0.400
DCM_StreetCenterLine 0.571 0.286
DPR_AthleticFacilities 0.138 0.000
DSNY_Districts 0.381 0.125
NYC_Municipal_Building 0.727 0.800
COVID-19_Free_Meals 0.889 0.800
Housing_Maintenance 0.696 0.667
Public_Design_Commission 0.765 0.733
Swim_for_Life 1.000 1.000
DOT_Resurfacing 1.000 1.000
Mean 0.657 0.581

Methodology

  • Datasets: 10 real-world NYC Open Data table pairs from the NYC schema-matching benchmark, covering city government, education, housing, and transportation domains.
  • Metrics: F1Score and RecallAtSizeofGroundTruth (top-|GT| predictions, TP/|GT|) via matches.get_metrics(); MRR computed manually from ranked match order.
  • Timeout: 120 s per dataset per matcher, enforced via ThreadPoolExecutor with non-blocking shutdown.
  • v0.5.0 Java Coma heap: java_xmx="8192m" (8 GB) — the default 1 GB caused OOM on two large datasets; even 4 GB was insufficient for one.
  • Hardware: Single Windows workstation; timings are wall-clock, single-threaded (process_num=1 for DistributionBased).