Translation primer#

This document will be dedicated to a toy example – translating a “Bite report” from a misfortunate petting zoo – in order to demonstrate some key concepts. Each new component will be presented in the order in which they are used during normal operation.

To keep things simple, we will keep everything in a single folder – the current working directory – for this example. The file structure is as follows:

. # current working directory
├── translation-primer.py
├── biting-victims-2019-05-11.csv
├── biting-victims-2019-05-11-translated.csv
└── sources
    ├── animals.csv
    └── humans.csv

This example uses the API to construct the Translator instance, but the recommended way of creating instances are Configuration. Condensed versions for creating an equivalent Translator using the either the API or TOML configuration is available in the Notebooks section.

Call diagram#

The Translator either performs or coordinates most task. A notable exception the placeholder mapping subprocess, which is handled internally by AbstractFetcher.map_placeholders.

Green indicates a Translator member function.
Red denotes fetcher ownership.
Blue indicates a task that is delegated to an object owned by the Translator.

../_images/translation-flow.drawio.png — Simplified call diagram for a translation task. Optional paths and error handling are omitted, as well as most details that are internal to the mapping and fetching processes.#

A fetching.PandasFetcher is used in the example below, meaning that sources are resolved by searching for files in a given directory, and actually fetching translations means reading files in this directory.

Real applications typically use something like a SQL database instead. Underlying concepts remain the same, no matter how translation data is retrieved.

Translatable data#

The “Bite report” to translate is shown below.

biting-victims-2019-05-11.csv#
human_id	bitten_by
1904	1
1991	0
1991	2
1999	0

The first columns indicates who was bitten (a human), the second who bit them (an animal). Since bites are a frequent occurrence, the zoo uses integers IDs instead of plaintext for their bite reports to save space. The Translator doesn’t work on files, so we’ll translate a pandas.DataFrame instead.

from pandas import DataFrame, read_csv
bite_report: DataFrame = read_csv("biting-victims-2019-05-11.csv")

The Translator knows what a DataFrame is, and will assume that the columns are names to translate.

Note

In the language of the Translator, the bite report is a Translatable. The columns 'human_id' and 'bitten_by' are the names that must be translated.

Translation sources#

The zoo provides reference tables which allows us to make sense of the data. These tables are stored as regular CSV files and contain some basic information about the humans and animals that are referenced in the (for now) unintelligible bite report.

Sources#

humans.csv#
id	name	title
1991	Richard	Mr
1999	Sofia	Dr
1904	Fred	Mr

animals.csv#
animal_id	name	species
0	Tarzan	cat
1	Morris	dog
2	Simba	lion

To access these tables, the Translator needs a Fetcher that can read and interpret CSV files. The PandasFetcher is built to perform such tasks.

from fetching import PandasFetcher
fetcher = PandasFetcher(
    read_function=read_csv,
    # Look for .csv-files in the 'sources' sub folder of the current working directory
    read_path_format='./sources/{}.csv'
)

This fetcher will look for CSV files in the sources sub folder of the current working directory, using pandas.read_csv() to deserialize them. Source names will be filenames without the .csv-suffix.

Note

In the language of the Translator, the CSV files 'animals.csv' and 'humans.csv' are translation sources. All fetching is done through the Fetcher interface.

Name-to-source mapping#

The mapping namespace modules are used to perform name-to-source mapping. By default, names and sources must match exactly which is rarely the case in practice. In our case, there are two names that should be matched to one source each.

Mapping human_id → humans. Mappings like these are common and may be solved using the built-in like_database_table() heuristic.
```
from mapping import HeuristicScore
score_function = HeuristicScore("equality", heuristics=["like_database_table"])
```
Mapping bitten_by → animals. This is an impossible mapping without high-level understanding of the context. Using and override is the best solution in this case.
```
overrides = {"bitten_by": "animals"}
```

We’re now ready to create the Mapper instance.

from mapping import Mapper
mapper = Mapper(score_function, overrides=overrides)

Note

In the language of the Mapper, names become values and the sources are referred to as the candidates. See the Mapping primer page for more information.

Translation format#

We must now decide what we want our report to look like once it’s translated. First, we note that the first two columns, 'id' and 'name', are the same for humans and animals. The 'humans' source also has a unique 'title' column (or placeholder). The 'animals' source has a unique 'species' placeholder.

We would like the translations to include as much information as possible, and as such we will use a flexible Format that includes two optional placeholders.

translation_format = "[{title}. ]{name} (id={id})[ the {species}]"

The use of optional blocks (placeholders and string literals surrounded by angle brackets [..]) allows us to use the same translation format for humans and animals.

Note

The translation Format specifies how translated IDs should be represented. The elements 'title', 'name', 'id', and 'species' are called placeholders.

The 'name' and 'id' placeholders are required_placeholders; translation will fail if they cannot be retrieved. The others – 'title' and 'species' – are optional_placeholders.

Placeholder mapping#

Analogous to name-to-source mapping, placeholder mapping binds the wanted placeholders of the translation Format to the actual placeholders found in the source.

Note

In the language of the Mapper, wanted placeholders become values and the actual placeholders are referred to as the candidates. The source or file which we are performing mapping for is referred to as the context.

All placeholder names also match exactly, except for the 'animal_id' placeholder in the 'animals' source. The easiest solution is to use an override. However, as this kind of naming is common, a more generic solution makes sense.

A custom AliasFunction heuristic to turn 'animal_id' into just 'id'.#

def smurf_column_heuristic(value, candidates, context):
    """Heuristic for matching columns that use the "smurf" convention."""
    return (
        # Handles plural form that ends with or without an s.
        f"{context[:-1]}_{value}" if context[-1] == "s" else f"{context}_{value}",
        candidates,  # unchanged
    )

smurf_score = HeuristicScore('equality', heuristics=[smurf_column_heuristic])

Placeholder mapping is the responsibility of the Fetcher. The reason for this is that the required mappings are often specific to a single source collection (such as a database). Having separate mappers makes fetching configuration easier to maintain for applications that use multiple fetchers.

# Amend the fetcher we created earlier.
fetcher = PandasFetcher(
   read_function=read_csv,
   read_path_format=".source/{}.csv",
   mapper=Mapper(smurf_score),  # Add the mapper.
)

With placeholder mapping in place, all the remains is to create the Translator.

Putting it all together#

from id_translation import Translator
translator = Translator(fetcher, fmt=translation_format, mapper=mapper)
translated_bite_report = translator.translate(bite_report)

Unless inplace=True is passed translate(), always returns a copy.

Translated data#

biting-victims-2019-05-11-translated.csv#
human_id	bitten_by
Mr. Fred (id=1904)	Morris (id=1) the dog
Mr. Richard (id=1991)	Tarzan (id=0) the cat
Mr. Richard (id=1991)	Simba (id=2) the lion
Dr. Sofia (id=1999)	Tarzan (id=0) the cat

Staying true to his reputation, Tarzan the cat has claimed the most victims.

Notebooks#

Implementations may be found in the following notebooks: