Mapping primer#

Mapping is performed by the Mapper class. The general procedure is the same for the Name-to-source and Placeholder mapping processes.

See also

If you haven’t already, consider checking out the Translation primer before continuing.

There are two principal steps involved in the mapping procedure: The Step 1/2: Scoring procedure (compute_scores) and the subsequent Step 2/2: Matching procedure (to_directional_mapping). The Translator and AbstractFetcher classes use Mapper.apply(), which combines these two methods.

Step 1/2: Scoring procedure#

The Mapper first applies Overrides and filtering, after which the actual Score computations are performed.

../_images/mapping.png

Colours mapped by
spectral distance (RGB).

#

Overrides and filtering#

Overrides and filtering adhere to a strict hierarchy (the one presented below). Overrides take precedence over filters, and runtime overrides takes precedence over static overrides.

  1. Runtime overrides (type: UserOverrideFunction); set score=∞ for the chosen candidate, and score=-∞ for others.

  2. Static overrides (type: dict or InheritedKeysDict); set score=∞ for the chosen candidate, and score=-∞ for others.

  3. Filtering (type: FilterFunction); set score=-∞ for undesirable matches only.

Hint

Score-based mapping trades precision for convenience. This may be undesirable, especially for fetching as this may incur additional costs. See the Override-only mapping section for details.

Score computations#

  1. Compute value-candidate match scores (type: ScoreFunction). Higher is better.

  2. If there are any Heuristics (type: HeuristicScore), apply..

    1. Short-circuiting (type: FilterFunction); reinterpret a FilterFunction such that the returned candidates (if any) are treated as overrides.

    2. Aliasing (type: AliasFunction); try to improve ScoreFunction accuracy by applying heuristics to the (value, candidates)-argument pairs.

    3. Finally, select the best score at each stage (from no to all heuristics) for each pair.

The final output is a ScoreMatrix, which has been converted to an equivalent DataFrame below.

Partial mapping scores for the Sakila DVD Rental Database example.#

store

category

customer

staff

film

film_id

0.100

0.040

0.040

0.100

1.000 ★

category_id

0.125

1.000 ★

0.222

0.042

0.040

store_id

1.000 ★

0.125

0.042

0.500

0.100

rental_date

-∞

-∞

-∞

-∞

-∞

The 'rental_date'-value can be seen having only negative-infinity matching scores due to filtering.

Hint

The Translator.map_scores-method returns Name-to-source mapping scores.

Step 2/2: Matching procedure#

Given precomputed match scores (see the section above), make as many matches as possible given a Cardinality restriction. These may be summarized as:

  • OneToOne = ‘1:1’: Each value and candidate may be used at most once.

  • OneToMany = ‘1:N’: Values have exclusive ownership of matched candidate(s).

  • ManyToOne = ‘N:1’: Ensure that as many values as possible are unambiguously mapped (i.e. to a single candidate). This is the default for new Mapper instances.

  • ManyToMany = ‘M:N’: All matches above the score limit are kept.

In theory, OneToMany and ManyToOne are equally restrictive. During mapping however, the goal is usually to find matches for values, not candidates. With that in mind, the ordering above may be considered strictly decreasing in preciseness.

Conflict resolution#

When a single match out of multiple viable options must be chosen due to cardinality restrictions, priority is determined by the iteration order of values and candidates. The first value will prefer the first candidate, and so on. This logic does not consider future matches.

>>> mapper = Mapper(cardinality='1:1', score_function=lambda value, *_: [1, 0] if value == 'v1' else [1, 1])
>>> mapper.compute_scores(['v0', 'v1'], ['c0', 'c1'])
candidates   c0   c1
values
v0          1.0  1.0
v1          0.0  1.0
>>> mapper.apply(['v0', 'v1'], ['c0', 'c1']).flatten()
{'val0': 'cand0'}

Note that val1 was left without a match, even though it could’ve been assigned to cand0 if the equally viable matching val0 → cand1 had been chosen first.

Note

A score matrix like this will raise AmbiguousScoreError for any cardinality that requires a single candidate (including 1:1).

Troubleshooting#

Unmapped values are allowed by default. If mapping failure is not an acceptable outcome for your application, initialize the Mapper with on_unmapped='raise' to ensure that an error is raised for unmapped values, along with more detailed log messages which are emitted on the error level.

Verbose logging#

The mapper can emit per-combination mapping scores when matches are made or when values are left without a match. These messages are gated behind ENABLE_VERBOSE_LOGGING.

The messages below are from a test case in a strange world where only one kind of animal (cardinality=1:1) is allowed to have a specific number of legs.

A listing of matches that were rejected in favour of the current match.#
id_translation.mapping.Mapper: Accepted: 'dog' -> '4'; score=inf (short-circuit or
  override). This match supersedes 1 other matches:
    'cat' -> '4'; score=1.000 (superseded on candidate=4).

In the case above, dog was selected over cat to because it was given first in the values vector. Matches that would not have been made regardless (e.g. score below min_score are not shown in the accept-message.

Explanation of why a match was not made.#
id_translation.mapping.Mapper: Could not map value='cat'. Rejected matches:
     'cat' -> '4'; score=1.000 (superseded on candidate=4: 'dog' -> '4'; score=inf).
     'cat' -> '0'; score=0.000 < 0.9 (below threshold).
     'cat' -> '2'; score=0.000 < 0.9 (below threshold).
     'cat' -> '3'; score=0.000 < 0.9 (below threshold).

The severity of unmapped values is determined by the Mapper.on_unmapped attribute. The ENABLE_VERBOSE_LOGGING flag also enables detailed output from a other loggers in the mapping namespace.

Messages from the scoring procedure.#
id_translation.mapping.HeuristicScore: Heuristics scores for value='name': [
  'last_update': 0.06 -> 0.10 (+0.03), 'first_name': 0.14 -> 0.99 (+0.85),
  'email': 0.12 -> 0.12 (+0.00), 'address_id': -0.00 -> -0.00 (+0.00),
  'create_date': 0.06 -> 0.14 (+0.08), 'last_name': 0.16 -> 0.38 (+0.22),
  'store_id': -0.01 -> -0.01 (+0.00), 'active': 0.08 -> 0.08 (+0.00),
  'customer_id': -0.01 -> -0.01 (+0.00)]
id_translation.mapping.filter_functions.filter_names: Refuse matching for
  name='return_date': Does not match pattern=re.compile('.*_id$', re.IGNORECASE).

The mapping procedure may emit a large amount of records in verbose mode.

Override-only mapping#

Score-based mapping is a convenient solution, especially for name-to-source mapping since the names (e.g. pandas.DataFrame.columns) that should be translated have a tendency to change.

Note

Identity mappings always kept (no need for id = "id" overrides). To block these matches, you may create a dummy override such as id = "_" for affected sources.

Names in sources (e.g. SQL table column names), on the other hand, tend to change a lot less. Scoring may then add an unnecessary element of uncertainty. To ensure that mapping is done “manually”, you may use the included score_functions.disabled()-function to disable the scoring logic.

A conservative override-only mapping configuration for an SqlFetcher.#
 1[fetching.SqlFetcher]
 2connection_string = "postgresql+pg8000://postgres:Sofia123!@localhost:5002/sakila"
 3allow_fetch_all = false
 4whitelist_tables = ["customer", "category", "country"]
 5
 6[fetching.mapping.score_function.disabled]
 7strict = true  # raise instead of silently ignoring
 8
 9[fetching.mapping.overrides.customer]
10id = "customer_id"
11name = "first_name"
12[fetching.mapping.overrides.category]
13id = "category_id"
14[fetching.mapping.overrides.country]
15id = "country_id"
16name = "country"

In strict mode (the default), a ScoringDisabledError is raised if there are any names left to map once all Overrides and filtering and short-circuiting logic has been applied.

See also

In non-strict mode (strict=False), any name left to map once the scoring phase begins will be silently discarded by returning \(-\infty\) for all value/candidate-pairs.