Evaluation metrics¶

Given a ground truth — a list of expected column matches — Valentine computes Precision, Recall, F1 and related metrics in one call:

metrics = matches.get_metrics(ground_truth)

This page is the how-to guide for using metrics. For the full list of built-in metric classes, their parameters, and the predefined metric sets, see the API reference.

Ground-truth formats¶

ground_truth can be expressed in two formats, both accepted by MatcherResults.get_metrics.

Column-name pairs (table names ignored):

ground_truth = [
    ("emp_id", "employee_number"),
    ("fname",  "first_name"),
    ("lname",  "last_name"),
]

Full ColumnPair instances (table-aware comparison):

from valentine.algorithms import ColumnPair

ground_truth = [
    ColumnPair("hr", "emp_id", "payroll", "employee_number"),
    ColumnPair("hr", "fname",  "payroll", "first_name"),
]

Use ColumnPair ground truth when you're matching more than two tables, or when source and target tables share column names — without table info the metric code can't tell which match is which.

Built-in metrics¶

Valentine ships six metrics, all in valentine.metrics:

from valentine.metrics import (
    Precision,
    Recall,
    F1Score,
    PrecisionTopNPercent,
    RecallAtSizeofGroundTruth,
    MeanReciprocalRank,
)

Metric	What it measures
`Precision`	TP / (TP + FP).
`Recall`	TP / (TP + FN).
`F1Score`	Harmonic mean of precision and recall.
`PrecisionTopNPercent`	Precision restricted to the top `n%` of matches by score.
`RecallAtSizeofGroundTruth`	Recall when selecting the top `len(ground_truth)` matches.
`MeanReciprocalRank`	Average reciprocal rank of the first correct match per source column (standard IR metric).

Precision, Recall, F1Score and PrecisionTopNPercent all accept a one_to_one: bool flag that applies MatcherResults.one_to_one_hungarian() before counting. PrecisionTopNPercent additionally takes n: int for the cutoff, and RecallAtSizeofGroundTruth defaults to one_to_one=False. See the API reference for full defaults.

Default metric set¶

If you call get_metrics without specifying metrics, Valentine uses METRICS_CORE:

metrics = matches.get_metrics(ground_truth)
# {
#   "Precision": ...,
#   "Recall": ...,
#   "F1Score": ...,
#   "PrecisionTop10Percent": ...,
#   "RecallAtSizeofGroundTruth": ...,
#   "MeanReciprocalRank": ...,
# }

Valentine also ships METRICS_ALL, METRICS_PRECISION_RECALL, and METRICS_PRECISION_INCREASING_N for common experiment shapes.

from valentine.metrics import METRICS_PRECISION_INCREASING_N

metrics = matches.get_metrics(
    ground_truth,
    metrics=METRICS_PRECISION_INCREASING_N,
)

Custom metric selection¶

Pass any set of metric instances to pick exactly what you want:

from valentine.metrics import F1Score, PrecisionTopNPercent

metrics = matches.get_metrics(
    ground_truth,
    metrics={F1Score(one_to_one=False), PrecisionTopNPercent(n=70)},
)

Each metric is computed independently, and the returned dict is keyed by the metric's name() — which for PrecisionTopNPercent substitutes the n value, so you get PrecisionTop70Percent in the output.

Defining your own metric¶

Subclass Metric and implement apply:

from dataclasses import dataclass
from valentine.metrics import Metric


@dataclass(eq=True, frozen=True)
class SupportAtK(Metric):
    k: int = 5

    def apply(self, matches, ground_truth):
        top_k = matches.take_top_n(self.k)
        return self.return_format(len(top_k) / self.k)


metrics = matches.get_metrics(ground_truth, metrics={SupportAtK(k=10)})

The dataclass must be frozen=True so metric instances are hashable and comparable — get_metrics takes a set of metrics.