Benchmark: v0.5.0 → v1.0.0¶

This page documents the performance comparison between valentine v0.5.0 and v1.0.0 on the NYC dataset collection — 10 real-world table pairs from NYC Open Data spanning city government, education, housing, and transportation domains.

All timings are wall-clock seconds measured on a single Windows machine. Per-dataset timeout: 120 s. Accuracy metrics are mean F1 and mean MRR across all datasets that completed without error or timeout.

Coma in v0.5.0 vs v1.0.0

In v0.5.0, Coma was the canonical Java-backed implementation — it required a JRE on the host machine and spawned a JVM per call. A pure-Python variant, ComaPy, existed but was considered experimental. In v1.0.0, the Java backend was removed entirely: Coma is now the pure-Python implementation (the former ComaPy, promoted to stable). All Coma comparisons in this document use the canonical class of each version — Coma (Java) for v0.5.0 and Coma (Python) for v1.0.0.

Summary¶

Speed (total wall-clock time, 10 datasets)¶

Matcher	v0.5.0	v1.0.0	Speedup
Coma (schema)	8.31 s	0.65 s	13×
Coma (instances)	322.23 s	4.71 s	68×
Cupid	163.04 s	3.55 s	46×
DistributionBased	164.70 s	3.94 s	42×
JaccardDistanceMatcher	730.36 s ⚠	3.92 s	186×
SimilarityFlooding	53.84 s	3.30 s	16×

v0.5.0 reliability issues

JaccardDistanceMatcher timed out on 5 of 10 datasets; the 730 s total reflects only the 5 that completed plus 5 × 120 s timeouts.
DistributionBased crashed on the Public Design Commission dataset (min() arg is an empty sequence) due to a missing guard for sparse, text-heavy columns — fixed in v1.0.0.

Accuracy (mean across completed datasets)¶

Matcher	v0.5.0 F1	v1.0.0 F1	v0.5.0 Recall@GT	v1.0.0 Recall@GT	v0.5.0 MRR	v1.0.0 MRR
Coma (schema)	0.6582	0.6647	0.6424	0.6507	0.3050	0.3024
Coma (instances)	0.7654¹	0.7717	0.8132¹	0.7631	0.3427¹	0.3384
Cupid	0.4800	0.4848	0.4275	0.4298	0.2452	0.2489
DistributionBased	0.6465²	0.6805	0.5903²	0.6205	0.2892²	0.3019
JaccardDistanceMatcher	0.6664³	0.6463	0.6250³	0.5611	0.3354³	0.2474
SimilarityFlooding	0.5071	0.4929	0.5014	0.5798	0.2853	0.3034

Recall@GT

Recall@GT (RecallAtSizeofGroundTruth) measures recall when selecting exactly len(ground_truth) top predictions — i.e., the fraction of correct pairs recovered if you keep as many predictions as there are gold matches.

Accuracy is essentially preserved across the entire rewrite — F1 differences are within ±0.02 on all matchers. The speed-ups are purely from implementation improvements, not accuracy trade-offs.

Java Coma vs pure-Python Coma¶

v0.5.0 shipped two Coma variants: Coma (Java-backed, the canonical implementation) and ComaPy (pure Python, experimental). v1.0.0 ships only Coma — the pure-Python implementation, graduated from experimental ComaPy to the new stable default, with the Java backend retired entirely.

Matcher	v0.5.0 Java	v1.0.0 Python	Speedup	F1 delta	Recall@GT delta	MRR delta
Coma (schema)	8.31 s	0.65 s	13×	+0.007	+0.008	−0.003
Coma (instances)	322.23 s	4.71 s	68×	+0.006	−0.050	−0.004

Java Coma in v0.5.0

Instance-mode Java Coma required manual heap configuration (java_xmx parameter) and still ran out of memory on large datasets even with 8 GB allocated. v1.0.0 eliminates the JVM dependency entirely — no Java installation, no heap tuning, no OOM errors.

The pure-Python rewrite is not only faster and more reliable, it is also marginally more accurate on this benchmark.

New in v1.0.0: embedding-based Jaccard¶

v1.0.0 adds JaccardDistanceMatcher with distance_fun=StringDistanceFunction.Embedding, which uses sentence embeddings instead of character-level string distance.

Matcher	Time	Mean F1	Mean Recall@GT	Mean MRR
JaccardDistanceMatcher (string)	3.92 s	0.6463	0.5611	0.2474
JaccardDistanceMatcher (embedding)	48.98 s	0.6567	0.5811	0.2514

The embedding variant requires sentence-transformers to be installed and trades ~14× more time for a small accuracy gain (+0.01 F1). It performs particularly well on columns with semantically related but lexically dissimilar names.

Per-dataset results¶

Each table covers one matcher. Columns show v0.5.0 and v1.0.0 side-by-side so differences are immediately visible. v0.5.0 Coma = Java-backed (canonical).

Coma (schema)¶

Dataset	v0.5 F1	v1.0 F1	v0.5 Recall@GT	v1.0 Recall@GT
Capital_Projects	0.667	0.800	0.600	0.800
DCM_StreetCenterLine	0.833	0.857	0.857	0.857
DPR_AthleticFacilities	0.737	0.778	0.800	0.800
DSNY_Districts	0.533	0.500	0.500	0.500
NYC_Municipal_Building	0.800	0.545	0.800	0.600
COVID-19_Free_Meals	0.444	0.444	0.400	0.400
Housing_Maintenance	0.762	0.750	0.667	0.750
Public_Design_Commission	0.417	0.417	0.600	0.400
Swim_for_Life	0.889	0.889	0.800	0.800
DOT_Resurfacing	0.500	0.667	0.400	0.600
Mean	0.658	0.665	0.642	0.651

Coma (instances)¶

v0.5.0 timed out on Housing_Maintenance even at 8 GB heap (131 s); excluded from v0.5.0 mean.

Dataset	v0.5 F1	v1.0 F1	v0.5 Recall@GT	v1.0 Recall@GT
Capital_Projects	0.909	0.800	1.000	0.800
DCM_StreetCenterLine	0.833	0.769	0.857	0.714
DPR_AthleticFacilities	0.556	0.300	0.700	0.300
DSNY_Districts	0.533	0.588	0.625	0.500
NYC_Municipal_Building	1.000	0.889	1.000	0.800
COVID-19_Free_Meals	0.667	0.750	0.600	0.800
Housing_Maintenance	TIMEOUT	0.917	TIMEOUT	0.917
Public_Design_Commission	0.560	0.815	0.600	0.800
Swim_for_Life	0.889	0.889	1.000	1.000
DOT_Resurfacing	0.889	1.000	1.000	1.000
Mean	0.765 §	0.772	0.820 §	0.763

§ v0.5.0 mean over 9 completed datasets.

Cupid¶

Dataset	v0.5 F1	v1.0 F1	v0.5 Recall@GT	v1.0 Recall@GT
Capital_Projects	0.600	0.600	0.600	0.600
DCM_StreetCenterLine	0.727	0.667	0.571	0.714
DPR_AthleticFacilities	0.211	0.211	0.200	0.100
DSNY_Districts	0.462	0.462	0.500	0.500
NYC_Municipal_Building	0.333	0.333	0.200	0.200
COVID-19_Free_Meals	0.286	0.286	0.200	0.200
Housing_Maintenance	0.583	0.609	0.583	0.583
Public_Design_Commission	0.182	0.182	0.200	0.200
Swim_for_Life	0.667	0.750	0.600	0.600
DOT_Resurfacing	0.750	0.750	0.600	0.600
Mean	0.480	0.485	0.427	0.430

DistributionBased¶

v0.5.0 crashed on Public_Design_Commission (min() arg is an empty sequence); fixed in v1.0.0.

Dataset	v0.5 F1	v1.0 F1	v0.5 Recall@GT	v1.0 Recall@GT
Capital_Projects	0.500	0.571	0.400	0.400
DCM_StreetCenterLine	0.667	0.500	0.571	0.571
DPR_AthleticFacilities	0.320	0.333	0.100	0.200
DSNY_Districts	0.526	0.500	0.375	0.500
NYC_Municipal_Building	0.750	0.750	0.600	0.600
COVID-19_Free_Meals	0.750	0.750	0.800	0.800
Housing_Maintenance	0.667	0.762	0.667	0.667
Public_Design_Commission	ERROR	0.750	ERROR	0.667
Swim_for_Life	0.889	1.000	1.000	1.000
DOT_Resurfacing	0.750	0.889	0.800	0.800
Mean	0.647 †	0.681	0.590 †	0.621

† v0.5.0 mean over 9 completed datasets.

JaccardDistanceMatcher¶

v0.5.0 timed out on 5 datasets; mean computed over 5 that completed.

Dataset	v0.5 F1	v1.0 F1	v0.5 Recall@GT	v1.0 Recall@GT
Capital_Projects	0.400	0.400	0.400	0.400
DCM_StreetCenterLine	TIMEOUT	0.571	TIMEOUT	0.286
DPR_AthleticFacilities	TIMEOUT	0.148	TIMEOUT	0.000
DSNY_Districts	0.316	0.333	0.125	0.125
NYC_Municipal_Building	0.727	0.727	0.800	0.800
COVID-19_Free_Meals	0.889	0.889	0.800	0.800
Housing_Maintenance	TIMEOUT	0.667	TIMEOUT	0.667
Public_Design_Commission	TIMEOUT	0.839	TIMEOUT	0.733
Swim_for_Life	1.000	1.000	1.000	1.000
DOT_Resurfacing	TIMEOUT	0.889	TIMEOUT	0.800
Mean	0.666 ‡	0.646	0.625 ‡	0.561

‡ v0.5.0 mean over 5 completed datasets only.

SimilarityFlooding¶

Dataset	v0.5 F1	v1.0 F1	v0.5 Recall@GT	v1.0 Recall@GT
Capital_Projects	0.667	0.714	0.600	0.800
DCM_StreetCenterLine	0.714	0.625	0.714	0.714
DPR_AthleticFacilities	0.414	0.424	0.600	0.700
DSNY_Districts	0.333	0.476	0.250	0.500
NYC_Municipal_Building	0.400	0.333	0.400	0.600
COVID-19_Free_Meals	0.364	0.400	0.400	0.400
Housing_Maintenance	0.560	0.500	0.583	0.583
Public_Design_Commission	0.389	0.389	0.267	0.200
Swim_for_Life	0.800	0.667	0.800	0.800
DOT_Resurfacing	0.444	0.400	0.400	0.400
Mean	0.507	0.493	0.501	0.580

ComaPy (v0.5.0 experimental — now the stable Coma in v1.0.0)¶

Dataset	schema F1	instances F1	schema Recall@GT	instances Recall@GT
Capital_Projects	0.667	0.909	0.600	1.000
DCM_StreetCenterLine	0.857	0.857	0.857	0.857
DPR_AthleticFacilities	0.783	0.609	0.700	0.400
DSNY_Districts	0.533	0.667	0.500	0.625
NYC_Municipal_Building	0.727	0.800	0.800	0.800
COVID-19_Free_Meals	0.400	0.800	0.400	0.800
Housing_Maintenance	0.615	0.750	0.667	0.750
Public_Design_Commission	0.538	0.615	0.267	0.467
Swim_for_Life	0.800	1.000	0.800	1.000
DOT_Resurfacing	0.444	0.909	0.400	1.000
Mean	0.636	0.792	0.599	0.770

JaccardDistanceMatcher (embedding) (v1.0.0 only)¶

Dataset	F1	Recall@GT
Capital_Projects	0.400	0.400
DCM_StreetCenterLine	0.571	0.286
DPR_AthleticFacilities	0.138	0.000
DSNY_Districts	0.381	0.125
NYC_Municipal_Building	0.727	0.800
COVID-19_Free_Meals	0.889	0.800
Housing_Maintenance	0.696	0.667
Public_Design_Commission	0.765	0.733
Swim_for_Life	1.000	1.000
DOT_Resurfacing	1.000	1.000
Mean	0.657	0.581

Methodology¶

Datasets: 10 real-world NYC Open Data table pairs from the NYC schema-matching benchmark, covering city government, education, housing, and transportation domains.
Metrics: F1Score and RecallAtSizeofGroundTruth (top-|GT| predictions, TP/|GT|) via matches.get_metrics(); MRR computed manually from ranked match order.
Timeout: 120 s per dataset per matcher, enforced via ThreadPoolExecutor with non-blocking shutdown.
v0.5.0 Java Coma heap: java_xmx="8192m" (8 GB) — the default 1 GB caused OOM on two large datasets; even 4 GB was insufficient for one.
Hardware: Single Windows workstation; timings are wall-clock, single-threaded (process_num=1 for DistributionBased).

v0.5.0 Coma (instances) mean computed over 9 completed datasets (Housing_Maintenance timed out). ↩↩↩
v0.5.0 DistributionBased excludes the one crashed dataset (Public Design Commission). ↩↩↩
v0.5.0 Jaccard computed over 5 completed datasets only (5 timeouts). ↩↩↩