Translation primer#
This document will be dedicated to a toy example – translating a “Bite report” from a misfortunate petting zoo – in order to demonstrate some key concepts. Each new component will be presented in the order in which they are used during normal operation.
To keep things simple, we will keep everything in a single folder – the current working directory – for this example. The file structure is as follows:
. # current working directory
├── translation-primer.py
├── biting-victims-2019-05-11.csv
├── biting-victims-2019-05-11-translated.csv
└── sources
├── animals.csv
└── humans.csv
This example uses the API to construct the Translator
instance, but the recommended way of creating instances are
Configuration. Condensed versions for creating an equivalent Translator
using the either the API or TOML
configuration is available in the Notebooks section.
Call diagram#
The Translator
either performs or coordinates most task. A
notable exception the placeholder mapping subprocess, which is
handled internally by AbstractFetcher.map_placeholders
.
Green indicates a
Translator
member function.Red denotes
fetcher
ownership.Blue indicates a task that is delegated to an object owned by the
Translator
.
A fetching.PandasFetcher
is used in the example below, meaning that
sources
are resolved by searching for files in a given directory, and
actually fetching translations
means reading files
in this directory.
Real applications typically use something like a SQL database
instead. Underlying
concepts remain the same, no matter how translation data is retrieved.
Translatable data#
The “Bite report” to translate is shown below.
human_id |
bitten_by |
---|---|
1904 |
1 |
1991 |
0 |
1991 |
2 |
1999 |
0 |
The first columns indicates who was bitten (a human), the second who bit them (an animal). Since bites are a frequent
occurrence, the zoo uses integers IDs instead of plaintext for their bite reports to save space. The Translator
doesn’t work on files, so we’ll translate a pandas.DataFrame
instead.
from pandas import DataFrame, read_csv
bite_report: DataFrame = read_csv("biting-victims-2019-05-11.csv")
The Translator
knows what a DataFrame
is, and will assume that the columns are names to translate.
Note
In the language of the Translator
, the bite report is a Translatable
. The columns
'human_id'
and 'bitten_by'
are the names
that must be
translated
.
Translation sources#
The zoo provides reference tables which allows us to make sense of the data. These tables are stored as regular CSV files and contain some basic information about the humans and animals that are referenced in the (for now) unintelligible bite report.
|
|
To access these tables, the Translator
needs a Fetcher
that can read and
interpret CSV files. The PandasFetcher
is built to perform such tasks.
from fetching import PandasFetcher
fetcher = PandasFetcher(
read_function=read_csv,
# Look for .csv-files in the 'sources' sub folder of the current working directory
read_path_format='./sources/{}.csv'
)
This fetcher will look for CSV files in the sources sub folder of the current working directory, using
pandas.read_csv()
to deserialize them. Source names will be filenames without the .csv-suffix.
Note
In the language of the Translator
, the CSV files 'animals.csv'
and 'humans.csv'
are translation
sources
. All fetching is done through the
Fetcher
interface.
Name-to-source mapping#
The mapping
namespace modules are used to perform name-to-source mapping. By default, names and
sources must match exactly which is rarely the case in practice. In our case, there are two names that should be matched
to one source each.
Mapping human_id → humans. Mappings like these are common and may be solved using the built-in
like_database_table()
heuristic.from mapping import HeuristicScore score_function = HeuristicScore("equality", heuristics=["like_database_table"])
Mapping bitten_by → animals. This is an impossible mapping without high-level understanding of the context. Using and override is the best solution in this case.
overrides = {"bitten_by": "animals"}
We’re now ready to create the Mapper
instance.
from mapping import Mapper
mapper = Mapper(score_function, overrides=overrides)
Note
In the language of the Mapper
, names become values
and the
sources are referred to as the candidates
. See the
Mapping primer page for more information.
Translation format#
We must now decide what we want our report to look like once it’s translated. First, we note that the first two columns,
'id'
and 'name'
, are the same for humans and animals. The 'humans'
source also has a unique 'title'
column (or placeholder). The 'animals'
source has a unique 'species'
placeholder.
We would like the translations to include as much information as possible, and as such we will use a flexible
Format
that includes two
optional
placeholders.
translation_format = "[{title}. ]{name} (id={id})[ the {species}]"
The use of optional blocks (placeholders and string literals surrounded by angle brackets [..]
) allows us to use the
same translation format for humans and animals.
Note
The translation Format
specifies how translated IDs should be represented. The
elements 'title'
, 'name'
, 'id'
, and 'species'
are called placeholders
.
The 'name'
and 'id'
placeholders are required_placeholders
;
translation will fail if they cannot be retrieved. The others – 'title'
and 'species'
– are
optional_placeholders
.
Placeholder mapping#
Analogous to name-to-source mapping, placeholder mapping binds the wanted placeholders
of the translation Format
to the actual placeholders
found in the source.
Note
In the language of the Mapper
, wanted placeholders become values
and the actual placeholders are referred to as the candidates
.
The source or file which we are performing mapping for is referred to as the
context
.
All placeholder names also match exactly, except for the 'animal_id'
placeholder in the 'animals'
source. The
easiest solution is to use an override. However, as this kind of naming is common, a more generic solution makes sense.
def smurf_column_heuristic(value, candidates, context):
"""Heuristic for matching columns that use the "smurf" convention."""
return (
# Handles plural form that ends with or without an s.
f"{context[:-1]}_{value}" if context[-1] == "s" else f"{context}_{value}",
candidates, # unchanged
)
smurf_score = HeuristicScore('equality', heuristics=[smurf_column_heuristic])
Placeholder mapping is the responsibility of the Fetcher
. The reason for this is that the required mappings are
often specific to a single source collection (such as a database). Having separate
mappers
makes fetching configuration easier to maintain for
applications that use multiple fetchers.
# Amend the fetcher we created earlier.
fetcher = PandasFetcher(
read_function=read_csv,
read_path_format=".source/{}.csv",
mapper=Mapper(smurf_score), # Add the mapper.
)
With placeholder mapping in place, all the remains is to create the Translator
.
Putting it all together#
from id_translation import Translator
translator = Translator(fetcher, fmt=translation_format, mapper=mapper)
translated_bite_report = translator.translate(bite_report)
Unless inplace=True
is passed translate()
, always returns a copy.
Translated data#
human_id |
bitten_by |
---|---|
Mr. Fred (id=1904) |
Morris (id=1) the dog |
Mr. Richard (id=1991) |
Tarzan (id=0) the cat |
Mr. Richard (id=1991) |
Simba (id=2) the lion |
Dr. Sofia (id=1999) |
Tarzan (id=0) the cat |
Staying true to his reputation, Tarzan the cat has claimed the most victims.
Notebooks#
Implementations may be found in the following notebooks: