API reference¶

This page documents every public-facing class, function, and enum exported by the valentine package. For task-oriented guides see Getting started, Matchers, Matcher results, and Evaluation metrics.

Jump to section

Core · ColumnPair · MatcherResults · InvalidMatcherError · Matchers · Metrics · Data sources

The top-level package exports:

from valentine import (
    valentine_match,      # main entry point
    ColumnPair,           # NamedTuple key for matches
    MatcherResults,       # immutable Mapping returned by valentine_match
    InvalidMatcherError,  # raised for invalid matcher arguments
)

`valentine_match`¶

valentine_match(
    dfs: Iterable[pd.DataFrame | pl.DataFrame],
    matcher: BaseMatcher,
    df_names: list[str] | None = None,
    instance_sample_size: int | None = 1000,
) -> MatcherResults

Match columns across every unique pair of DataFrames. Accepts both pandas and Polars DataFrames, which can be freely mixed within the same call.

Parameters

Name	Type	Default	Description
`dfs`	`Iterable[pd.DataFrame \\| pl.DataFrame]`	—	Two or more DataFrames to match against each other. Any iterable works (list, tuple, generator). Pandas and Polars frames may be mixed freely.
`matcher`	`BaseMatcher`	—	Matcher instance (e.g. `Coma()`, `Cupid()`).
`df_names`	`list[str] \\| None`	`None`	Optional names for each DataFrame. When `None`, defaults to `"aaa"`, `"bbb"`, `"ccc"`, … (chosen for minimum string similarity so defaults don't influence schema-based matchers). Limited to 26 unnamed tables.
`instance_sample_size`	`int \\| None`	`1000`	Cap on the number of non-empty rows sampled per column for instance-based matchers (Coma with `use_instances=True`, `DistributionBased`, `JaccardDistanceMatcher`). Pass `None` to use every row. Pass `0` to skip instance data entirely — schema-only matchers are unaffected, but instance-based matchers will see empty columns.

Returns

A MatcherResults instance — an immutable mapping of ColumnPair to similarity scores, sorted high to low.

Raises

ValueError — fewer than 2 DataFrames, mismatched df_names length, or more than 26 DataFrames without explicit names.
InvalidMatcherError — matcher is not a BaseMatcher instance.

Example

import pandas as pd
from valentine import valentine_match
from valentine.algorithms import Coma

df1 = pd.DataFrame({"id": [1, 2], "name": ["a", "b"]})
df2 = pd.DataFrame({"user_id": [1, 2], "full_name": ["a", "b"]})

matches = valentine_match(
    [df1, df2],
    matcher=Coma(use_instances=True),
    df_names=["users", "accounts"],
    instance_sample_size=500,
)

`ColumnPair`¶

class ColumnPair(NamedTuple):
    source_table: str
    source_column: str
    target_table: str
    target_column: str

Immutable, hashable key identifying a matched pair of columns. Used everywhere a match result or ground truth entry is required.

Attributes

Attribute	Type	Description
`source_table`	`str`	Name of the source table.
`source_column`	`str`	Name of the source column.
`target_table`	`str`	Name of the target table.
`target_column`	`str`	Name of the target column.

Computed properties

Property	Type	Description
`source`	`tuple[str, str]`	`(source_table, source_column)`
`target`	`tuple[str, str]`	`(target_table, target_column)`

Because ColumnPair is a NamedTuple, it also supports positional indexing, iteration, and unpacking:

pair = ColumnPair("orders", "price", "sales", "amount")
st, sc, tt, tc = pair
pair[0]            # "orders"
pair.source        # ("orders", "price")

`MatcherResults`¶

class MatcherResults(Mapping[ColumnPair, float]):
    def __init__(
        self,
        matches: dict[ColumnPair, float],
        details: dict[ColumnPair, dict[str, float]] | None = None,
    ): ...

Immutable Mapping returned by valentine_match. Entries are sorted from highest to lowest similarity score on construction. Because the mapping is immutable, derived views (such as the cached result of one_to_one_hungarian) cannot be silently invalidated.

Mapping protocol¶

Operation	Behaviour
`len(results)`	Number of matches.
`iter(results)`	Iterate `ColumnPair` keys in descending score order.
`results[pair]`	Look up the similarity score for a given `ColumnPair`.
`pair in results`	Check membership.
`results.items()`	Yield `(ColumnPair, float)` pairs in descending score order.
`results == other`	Equality with another `MatcherResults` or a plain `dict[ColumnPair, float]`.

MatcherResults is not hashable (__hash__ is None).

Details¶

`details`¶

@property
def details -> dict[ColumnPair, dict[str, float]]

Per-pair sub-matcher score breakdowns. Returns an empty dict when the matcher does not provide details. Currently populated by Coma, which exposes scores for its name, path, leaves, parents, and instances sub-matchers.

`get_details`¶

def get_details(key: ColumnPair) -> dict[str, float] | None

Return the sub-matcher breakdown for a single pair, or None if no details are available.

Transformations¶

All transformations return a new MatcherResults instance; the original is left untouched. Sub-matcher details are carried over to the filtered subset.

`one_to_one_hungarian`¶

def one_to_one_hungarian(threshold: float | None = None) -> MatcherResults

Default 1:1 selector. Globally optimal bipartite filter via Hungarian assignment (scipy.optimize.linear_sum_assignment): each source and each target column appears in at most one returned pair, with the assignment chosen to maximise total similarity. Pairs below threshold are discarded.

threshold=None (default) uses the median of unique similarity scores as the cutoff, and the result is cached.
Passing an explicit threshold bypasses the cache.
When the input has fewer than two distinct score values, all entries are returned unchanged.

`one_to_one_greedy`¶

def one_to_one_greedy(threshold: float | None = None) -> MatcherResults

Greedy bipartite filter, kept for backwards compatibility. Starting from the highest-scoring pair, greedily assigns each source and each target column at most one partner. Same threshold semantics as one_to_one_hungarian. Greedy can lock in a locally-best pair that blocks a better global assignment, so prefer the Hungarian variant unless you need the legacy behaviour.

`one_to_one_mutual_top`¶

def one_to_one_mutual_top(n: int = 1) -> MatcherResults

Mutual top-n filter: keeps pair (s, t) only if t is in s's top-n targets AND s is in t's top-n sources. With n=1 this is the classic mutual nearest-neighbour filter — high-precision, drops one-sided affinities. Strictly stricter than one_to_one_hungarian.

`filter`¶

def filter(min_score: float) -> MatcherResults

Return only matches whose similarity is >= min_score.

`take_top_n`¶

def take_top_n(n: int) -> MatcherResults

Return the top n matches by score.

`take_top_percent`¶

def take_top_percent(percent: int) -> MatcherResults

Return the top percent% (0–100) of matches, rounded up.

`get_copy`¶

def get_copy() -> MatcherResults

Return a shallow copy of the instance.

Metrics¶

`get_metrics`¶

def get_metrics(
    ground_truth: list[tuple[str, str]] | list[ColumnPair],
    metrics: set[Metric] = METRICS_CORE,
    one_to_one_method: str = "hungarian",
) -> dict[str, Any]

Compute evaluation metrics against a ground truth. The ground truth can be either:

Column-name pairs — [("src_col", "tgt_col"), …]. Table names are ignored during comparison, which is convenient when you only care about column-level alignment.
ColumnPair instances — full table-aware comparison. Use this when the same column name appears in multiple tables.

Both formats may also be passed as plain 2- or 4-tuples; they are normalized internally. Returns a flat dict keyed by metric name (e.g. {"Precision": 0.9, "Recall": 0.8, "F1Score": 0.85, …}).

Parameters

Name	Type	Default	Description
`ground_truth`	`list[tuple[str, str]] \\| list[ColumnPair]`	—	Expected column-pair mappings. Column-name pairs are table-agnostic; `ColumnPair` instances are table-aware.
`metrics`	`set[Metric]`	`METRICS_CORE`	Set of `Metric` instances to compute.
`one_to_one_method`	`str`	`"hungarian"`	1:1 selector used by metrics that apply one-to-one filtering. One of `"hungarian"`, `"greedy"`, or `"mutual_top"`.

`InvalidMatcherError`¶

class InvalidMatcherError(Exception): ...

Raised by valentine_match when the matcher argument is not a BaseMatcher subclass instance.

Deprecated alias

NotAValentineMatcher is kept as an alias for backward compatibility with pre-1.0 code and will be removed in a future release. New code should catch InvalidMatcherError directly.

`Match` (internal)¶

valentine.algorithms.match.Match is an internal dataclass used by matchers to build up result entries before they are merged into a dict[ColumnPair, float]. It is intentionally not re-exported from the top-level package and should not be used in user code — ColumnPair is the stable, public key type.

Matchers (`valentine.algorithms`)¶

Every matcher extends the abstract BaseMatcher class. The module exports:

from valentine.algorithms import (
    BaseMatcher,
    Coma,
    Cupid,
    DistributionBased,
    JaccardDistanceMatcher,
    SimilarityFlooding,
    # Enums used by the matchers:
    Formula, Policy, StringMatcher,
    # Groupings:
    schema_only_algorithms,
    instance_only_algorithms,
    schema_instance_algorithms,
    all_matchers,
    # Key types:
    ColumnPair,
)

The groupings are plain lists of class names:

Constant	Contents
`schema_only_algorithms`	`["SimilarityFlooding", "Cupid"]`
`instance_only_algorithms`	`["DistributionBased", "JaccardDistanceMatcher"]`
`schema_instance_algorithms`	`["Coma"]`
`all_matchers`	Union of the three lists above.

`BaseMatcher`¶

Abstract base. Subclasses must implement get_matches; get_matches_batch has a default fall-back that calls get_matches on each unique pair.

`get_matches`¶

@abstractmethod
def get_matches(
    source_input: BaseTable,
    target_input: BaseTable,
) -> dict[ColumnPair, float]

Match columns between a single pair of tables. Returns a raw dict, not a MatcherResults.

`get_matches_batch`¶

def get_matches_batch(tables: list[BaseTable]) -> dict[ColumnPair, float]

Match columns across every unique pair of tables. Override this method in subclasses that benefit from a holistic view (e.g. global TF-IDF corpus, global distribution ranks). All three of Coma, DistributionBased, and SimilarityFlooding override it, as does JaccardDistanceMatcher when distance_fun=StringDistanceFunction.Embedding.

`match_details`¶

@property
def match_details -> dict[ColumnPair, dict[str, float]]

Per-pair score breakdowns from the most recent match call. Empty by default; populated by matchers that combine multiple sub-scorers. The contents are propagated into MatcherResults.details by valentine_match.

`Coma`¶

Coma(
    max_n: int = 0,
    use_instances: bool = False,
    use_schema: bool = True,
    delta: float = 0.15,
    threshold: float = 0.0,
    instance_weight: float = 1.0,
)

Pure-Python COMA 3.0 implementation. Combines schema-based matchers (name, path, leaves, parents) with an optional TF-IDF instance matcher and selects results using bidirectional best-match logic.

Parameter	Type	Default	Description
`max_n`	`int`	`0`	Maximum number of matches to keep per column. `0` means unlimited. Must be `>= 0`.
`use_instances`	`bool`	`False`	Enable TF-IDF instance-based matching.
`use_schema`	`bool`	`True`	Enable schema-based matching. At least one of `use_schema` and `use_instances` must be `True`.
`delta`	`float`	`0.15`	Fraction from the best per-column score within which matches are kept (e.g. `0.15` keeps all within 15% of the column's best). Must be in `[0, 1]`.
`threshold`	`float`	`0.0`	Absolute minimum similarity to keep a match. Must be in `[0, 1]`.
`instance_weight`	`float`	`1.0`	Relative weight of the instance matcher score when combining with schema scores. Must be `>= 0`.

Populates MatcherResults.details with {name, path, leaves, parents, instances} sub-scores.

`Cupid`¶

Cupid(
    leaf_w_struct: float = 0.2,
    w_struct: float = 0.2,
    th_accept: float = 0.7,
    th_high: float = 0.6,
    th_low: float = 0.35,
    c_inc: float = 1.2,
    c_dec: float = 0.9,
    th_ns: float = 0.7,
    process_num: int = 1,
)

Python implementation of Cupid (Madhavan, Bernstein & Rahm, VLDB 2001): combines linguistic similarity of column names with structural similarity derived from schema tree shape.

Parameter	Type	Default	Description
`leaf_w_struct`	`float`	`0.2`	Weight of structural similarity at leaf level. Must be in `[0, 1]`.
`w_struct`	`float`	`0.2`	Weight of structural similarity at inner-node level. Must be in `[0, 1]`.
`th_accept`	`float`	`0.7`	Acceptance similarity threshold for the final mapping. Must be in `[0, 1]`.
`th_high`	`float`	`0.6`	High-confidence threshold used during structural propagation. Must be in `[0, 1]`.
`th_low`	`float`	`0.35`	Low-confidence threshold used during structural propagation. Must be in `[0, 1]`.
`c_inc`	`float`	`1.2`	Positive reinforcement coefficient for matching children. Must be `> 0`.
`c_dec`	`float`	`0.9`	Negative reinforcement coefficient for non-matching children. Must be `> 0`.
`th_ns`	`float`	`0.7`	Name-similarity threshold. Must be in `[0, 1]`.
`process_num`	`int`	`1`	Number of worker processes. Must be `>= 1`.

`DistributionBased`¶

DistributionBased(
    threshold1: float = 0.15,
    threshold2: float = 0.15,
    quantiles: int = 256,
    process_num: int = 1,
    use_bloom_filters: bool = False,
)

Instance-based matcher from Automatic Discovery of Attributes in Relational Databases (Zhang et al., SIGMOD 2011). Compares quantile histograms with Earth Mover's Distance.

Parameter	Type	Default	Description
`threshold1`	`float`	`0.15`	Distance threshold for phase 1 distribution clustering. Must be in `[0, 1]`.
`threshold2`	`float`	`0.15`	Distance threshold for phase 2 attribute clustering. Must be in `[0, 1]`.
`quantiles`	`int`	`256`	Number of quantiles for histogram summaries. Must be `>= 1`.
`process_num`	`int`	`1`	Number of worker processes. Must be `>= 1`.
`use_bloom_filters`	`bool`	`False`	Use Bloom filters for approximate set intersection in phase 2. Trades a small false-positive rate for cheaper cost.

Overrides get_matches_batch to compute global distribution ranks across all tables.

`JaccardDistanceMatcher`¶

JaccardDistanceMatcher(
    threshold_dist: float = 0.8,
    distance_fun: StringDistanceFunction = StringDistanceFunction.Levenshtein,
    process_num: int = 1,
    embedding_model: str = "all-MiniLM-L6-v2",
    embedding_device: str | None = None,
    embedding_batch_size: int = 64,
    tversky_alpha: float = 0.5,
    tversky_beta: float = 0.5,
)

Instance-based matcher using Jaccard (or Tversky) similarity of column value sets, with configurable string-distance or embedding-based element equality. Overrides get_matches_batch to share a single embedding pass across all column pairs when distance_fun=StringDistanceFunction.Embedding.

Parameter	Type	Default	Description
`threshold_dist`	`float`	`0.8`	Threshold above which two strings are considered equal under `distance_fun`. For `Embedding` mode this is a cosine-similarity threshold. Ignored for `Exact`. `[0, 1]`.
`distance_fun`	`StringDistanceFunction`	`StringDistanceFunction.Levenshtein`	Element-equality function. See `StringDistanceFunction`.
`process_num`	`int`	`1`	Number of worker processes. Must be `>= 1`.
`embedding_model`	`str`	`"all-MiniLM-L6-v2"`	Sentence-transformer model name. Only used when `distance_fun=Embedding`. Requires `pip install valentine[embeddings]`.
`embedding_device`	`str \\| None`	`None`	Device for embedding inference (`"cuda"`, `"mps"`, `"cpu"`). `None` auto-selects: CUDA → MPS → CPU.
`embedding_batch_size`	`int`	`64`	Batch size for embedding inference. Only used when `distance_fun=Embedding`.
`tversky_alpha`	`float`	`0.5`	Tversky α weight (false positives). Default `0.5` reproduces Jaccard exactly. Set to `0.0` for set containment; `1.0` for Dice.
`tversky_beta`	`float`	`0.5`	Tversky β weight (false negatives). Default `0.5` reproduces Jaccard exactly.

`StringDistanceFunction`¶

Enum of supported element-equality functions for JaccardDistanceMatcher:

Value	Description
`StringDistanceFunction.Levenshtein`	Normalized Levenshtein ratio (default).
`StringDistanceFunction.DamerauLevenshtein`	Normalized Damerau–Levenshtein ratio.
`StringDistanceFunction.Hamming`	Normalized Hamming distance (strings of equal length).
`StringDistanceFunction.Jaro`	Jaro similarity.
`StringDistanceFunction.JaroWinkler`	Jaro–Winkler similarity.
`StringDistanceFunction.Exact`	Exact string equality (forces threshold to 1.0).
`StringDistanceFunction.Embedding`	Cosine similarity of sentence-transformer embeddings. Requires `pip install valentine[embeddings]`.

from valentine.algorithms.jaccard_distance import StringDistanceFunction
from valentine.algorithms import JaccardDistanceMatcher

m = JaccardDistanceMatcher(
    threshold_dist=0.9,
    distance_fun=StringDistanceFunction.JaroWinkler,
)

`SimilarityFlooding`¶

SimilarityFlooding(
    coeff_policy: Policy = Policy.INVERSE_AVERAGE,
    formula: Formula = Formula.FORMULA_C,
    string_matcher: StringMatcher = StringMatcher.PREFIX_SUFFIX,
    tfidf_corpus: list[BaseTable] | None = None,
)

Python implementation of Similarity Flooding (Melnik, Garcia-Molina & Rahm, ICDE 2002). Treats each schema as a labelled graph and iteratively propagates an initial element-level similarity to a fixpoint.

Parameter	Type	Default	Description
`coeff_policy`	`Policy`	`Policy.INVERSE_AVERAGE`	Coefficient policy for the propagation graph.
`formula`	`Formula`	`Formula.FORMULA_C`	Fixpoint iteration formula.
`string_matcher`	`StringMatcher`	`StringMatcher.PREFIX_SUFFIX`	String similarity function for the initial element-level mapping.
`tfidf_corpus`	`list[BaseTable] \\| None`	`None`	Additional tables to include when computing IDF weights for the `PREFIX_SUFFIX_TFIDF` matcher. Ignored otherwise.

Overrides get_matches_batch to compute a global IDF across all tables when string_matcher=PREFIX_SUFFIX_TFIDF.

`Policy`¶

Value	Description
`Policy.INVERSE_AVERAGE`	Inverse of the average in-degree (default).
`Policy.INVERSE_PRODUCT`	Inverse of the product of in-degrees.

`Formula`¶

Value	Description
`Formula.BASIC`	Basic fixpoint formula.
`Formula.FORMULA_A`	Variant A from the Similarity Flooding paper.
`Formula.FORMULA_B`	Variant B from the Similarity Flooding paper.
`Formula.FORMULA_C`	Variant C (default in Valentine).

`StringMatcher`¶

Value	Description
`StringMatcher.PREFIX_SUFFIX`	Prefix/suffix trigram matcher (default).
`StringMatcher.PREFIX_SUFFIX_TFIDF`	Prefix/suffix matcher weighted by IDF computed from the corpus.
`StringMatcher.LEVENSHTEIN`	Normalized Levenshtein similarity on node labels.

Metrics (`valentine.metrics`)¶

from valentine.metrics import (
    Metric,                       # abstract base class
    Precision,
    Recall,
    F1Score,
    PrecisionTopNPercent,
    RecallAtSizeofGroundTruth,
    MeanReciprocalRank,
    METRICS_CORE,
    METRICS_ALL,
    METRICS_PRECISION_RECALL,
    METRICS_PRECISION_INCREASING_N,
)

`Metric`¶

Abstract base class (@dataclass(frozen=True)). Subclass to implement custom metrics:

@dataclass(eq=True, frozen=True)
class MyMetric(Metric):
    threshold: float = 0.5

    def apply(self, matches, ground_truth):
        # ... compute score ...
        return self.return_format(score)

`apply`¶

@abstractmethod
def apply(
    matches: MatcherResults,
    ground_truth: list[tuple[str, str]] | list[ColumnPair],
) -> dict[str, Any]

Compute the metric value. ground_truth accepts either column-name pairs (table-agnostic) or full ColumnPair tuples (table-aware).

`name`¶

def name() -> str

Default: the class name. Override to parameterize the reported name (e.g. PrecisionTopNPercent substitutes the current n into its name).

`return_format`¶

@final
def return_format(value: Any) -> dict[str, Any]

Final helper that formats a metric value as {self.name(): value}.

Built-in metrics¶

All built-in metrics are @dataclass(frozen=True) and hashable, so they can live in the predefined metric sets.

`Precision`¶

Precision(one_to_one: bool = True)

TP / (TP + FP). When one_to_one=True (default), applies MatcherResults.one_to_one_hungarian() before counting.

`Recall`¶

Recall(one_to_one: bool = True)

TP / (TP + FN). Honors one_to_one the same way as Precision.

`F1Score`¶

F1Score(one_to_one: bool = True)

Harmonic mean of precision and recall. Honors one_to_one.

`PrecisionTopNPercent`¶

PrecisionTopNPercent(one_to_one: bool = True, n: int = 10)

Precision restricted to the top n% of predictions by score. n is clamped to [0, 100]. The reported metric name reflects the chosen percentage (e.g. PrecisionTop10Percent).

`RecallAtSizeofGroundTruth`¶

RecallAtSizeofGroundTruth(one_to_one: bool = False)

Recall at the top len(ground_truth) predictions — i.e. what fraction of gold pairs you recover if you select as many predictions as there are gold matches. One-to-one filtering is off by default here.

`MeanReciprocalRank`¶

MeanReciprocalRank()

For each source column, finds the rank of the first correct target column in the matcher's ranked output and averages the reciprocal of that rank across all source columns — a standard information-retrieval metric that rewards having the right answer near the top rather than merely present somewhere in the list.

Predefined metric sets¶

Set	Contents
`METRICS_CORE`	`Precision`, `Recall`, `F1Score`, `PrecisionTopNPercent`, `RecallAtSizeofGroundTruth`, `MeanReciprocalRank` (defaults).
`METRICS_ALL`	Both `one_to_one=True` and `one_to_one=False` variants of `Precision`, `Recall`, `F1Score`, plus `PrecisionTopNPercent`, `RecallAtSizeofGroundTruth`, and `MeanReciprocalRank`.
`METRICS_PRECISION_RECALL`	`{Precision(), Recall()}`.
`METRICS_PRECISION_INCREASING_N`	`PrecisionTopNPercent` for `n ∈ {10, 20, 30, …, 100}`.

Data sources (`valentine.data_sources`)¶

Valentine wraps each DataFrame in a DataframeTable (pandas) or PolarsTable (Polars) before handing it to a matcher. Most users never touch this layer — valentine_match auto-detects the frame type and builds the tables for you — but the classes are public so that custom matchers and custom data sources can be written against the abstractions.

from valentine.data_sources import (
    BaseTable,
    BaseColumn,
    DataframeTable,
    DataframeColumn,
)

# With the polars extra installed:
from valentine.data_sources import PolarsTable, PolarsColumn

`BaseTable`¶

Abstract base for a table-like data source. Implement this to plug a non-DataFrame backend (e.g. SQL cursor, Parquet file, Arrow table) into Valentine's matchers.

Abstract members (must be provided by subclasses):

Member	Kind	Description
`name`	`property -> str`	Table name. Becomes `source_table`/`target_table` in emitted `ColumnPair`s.
`unique_identifier`	`property -> object`	Stable identifier used internally to key per-table state.
`get_columns()`	`method -> list[BaseColumn]`	All columns in the table.
`get_df()`	`method -> pd.DataFrame`	Full DataFrame view of the table.
`is_empty`	`property -> bool`	Whether the table has zero rows.

Concrete members (provided by BaseTable, override if needed):

Member	Kind	Description
`get_instances_df()`	`method -> pd.DataFrame`	DataFrame used for instance-based sampling. Defaults to `get_df()`.
`get_instances_columns()`	`method -> list[BaseColumn]`	Columns built from the instance-sampled DataFrame. Defaults to `get_columns()`.
`get_guid_column_lookup()`	`method -> dict[str, object]`	`{column_name: column.unique_identifier}` lookup.
`get_data_type(data, d_type)`	`staticmethod -> str`	Normalize a pandas dtype into one of `"varchar"`, `"int"`, `"float"`, or `"date"`.

`BaseColumn`¶

Abstract base for a single column. A BaseColumn knows its name, its values, and its detected data type.

Abstract members:

Member	Kind	Description
`name`	`property -> str`	Column name.
`unique_identifier`	`property -> object`	Stable identifier used internally.
`data_type`	`property -> str`	Detected type: one of `"varchar"`, `"int"`, `"float"`, `"date"`.
`data`	`property -> list`	The column's values.

Concrete members:

Member	Kind	Description
`size`	`property -> int`	Number of elements in `data`.
`is_empty`	`property -> bool`	`True` when `size == 0`.

`DataframeTable`¶

DataframeTable(
    df: pd.DataFrame,
    name: str,
    instance_sample_size: int | None = 1000,
)

BaseTable adapter for a pandas DataFrame — the concrete implementation used by valentine_match.

Parameter	Type	Default	Description
`df`	`pd.DataFrame`	—	The DataFrame to wrap.
`name`	`str`	—	Name of the table. Used as `source_table` / `target_table` in emitted `ColumnPair`s.
`instance_sample_size`	`int \\| None`	`1000`	Cap on the number of non-empty rows sampled per column. Pass `None` to use the full DataFrame; pass `0` to expose no instance data at all. Must be `>= 0` or `None`; other values raise `ValueError`.

Automatic data-type detection classifies each column as "varchar", "int", "float", or "date" based on the DataFrame's dtype and content.

`DataframeColumn`¶

BaseColumn adapter for a single pandas Series. Constructed internally by DataframeTable; exposes the column name, detected data type, unique identifier, and sampled instance values via the standard BaseColumn interface.

`PolarsTable`¶

PolarsTable(
    df: pl.DataFrame,
    name: str,
    instance_sample_size: int | None = 1000,
)

BaseTable adapter for a Polars DataFrame. Requires the polars extra (pip install valentine[polars]). Has the same interface as DataframeTable.

Parameter	Type	Default	Description
`df`	`pl.DataFrame`	—	The Polars DataFrame to wrap.
`name`	`str`	—	Name of the table.
`instance_sample_size`	`int \\| None`	`1000`	Cap on the number of non-empty rows sampled per column. Pass `None` to use the full DataFrame; pass `0` to expose no instance data at all.

`PolarsColumn`¶

BaseColumn adapter for a single Polars Series. Constructed internally by PolarsTable; exposes the column name, detected data type, unique identifier, and sampled instance values via the standard BaseColumn interface.

Writing a custom data source¶

If your data doesn't live in a pandas DataFrame, implement BaseTable and BaseColumn directly. A minimal custom source just needs a name, a unique identifier, and the ability to enumerate its columns:

import uuid
import pandas as pd

from valentine import valentine_match
from valentine.algorithms import Coma
from valentine.data_sources import BaseColumn, BaseTable


class DictColumn(BaseColumn):
    def __init__(self, name: str, data: list, data_type: str = "varchar"):
        self._name = name
        self._data = data
        self._data_type = data_type
        self._guid = str(uuid.uuid4())

    @property
    def name(self) -> str:
        return self._name

    @property
    def unique_identifier(self) -> str:
        return self._guid

    @property
    def data_type(self) -> str:
        return self._data_type

    @property
    def data(self) -> list:
        return self._data


class DictTable(BaseTable):
    def __init__(self, name: str, columns: dict[str, list]):
        self._name = name
        self._guid = str(uuid.uuid4())
        self._columns = [DictColumn(k, v) for k, v in columns.items()]

    @property
    def name(self) -> str:
        return self._name

    @property
    def unique_identifier(self) -> str:
        return self._guid

    def get_columns(self) -> list[BaseColumn]:
        return self._columns

    def get_df(self) -> pd.DataFrame:
        return pd.DataFrame({c.name: c.data for c in self._columns})

    @property
    def is_empty(self) -> bool:
        return all(len(c.data) == 0 for c in self._columns)


# Matchers call get_matches / get_matches_batch directly on BaseTable
# instances, so custom sources bypass valentine_match:
source = DictTable("hr", {"emp_id": [1, 2, 3], "fname": ["a", "b", "c"]})
target = DictTable("payroll", {"employee_number": [1, 2, 3], "first_name": ["a", "b", "c"]})

raw = Coma().get_matches_batch([source, target])

If you want to reuse Valentine's instance-sampling logic, override get_instances_df to return a capped DataFrame. Custom sources are accepted by every built-in matcher — only valentine_match itself is DataFrame-specific.

API reference¶

valentine_match¶

ColumnPair¶

MatcherResults¶

Mapping protocol¶

Details¶

details¶

get_details¶

Transformations¶

one_to_one_hungarian¶

one_to_one_greedy¶

one_to_one_mutual_top¶

filter¶

take_top_n¶

take_top_percent¶

get_copy¶

Metrics¶

get_metrics¶

InvalidMatcherError¶

Match (internal)¶

Matchers (valentine.algorithms)¶

BaseMatcher¶

get_matches¶

get_matches_batch¶

match_details¶

Coma¶

Cupid¶

DistributionBased¶

JaccardDistanceMatcher¶

StringDistanceFunction¶

SimilarityFlooding¶

Policy¶

Formula¶

StringMatcher¶

Metrics (valentine.metrics)¶

Metric¶

apply¶

name¶

return_format¶