Configuration#
This document describes the TOML format used by the
Translator.from_config()
-method.
Hint
Functions or classes are resolved by name using rics.misc.get_by_full_name()
.
Unqualified names are assumed to
belong to an appropriate id_translation
module. To specify a custom implementation, use
'fully.qualified.names'
(in quotation marks).
Meta configuration#
The metaconf.toml
-file must be placed next to the main TOML configuration file, and determines how other files are
processed by the the factory.
Top-level section |
Description |
Details |
---|---|---|
|
Control environment-variable interpolation;
${VAR} or |
Note
The metaconf.toml
-file is always read as-is, without any processing.
Sections#
The only valid top-level keys are translator
, unknown_ids
, and fetching
. Only the fetching
section is
required, though it may be left out of the main configuration file if fetching is configured separately. Other top-level
keys will raise a ConfigurationError
if present.
Section: Translator#
Key |
Type |
Description |
---|---|---|
fmt |
Specify how translated IDs are displayed. |
|
enable_uuid_heuristics |
Enabling may improve matching when |
Parameters for
Name
-to-source
mapping are specified in a[translator.mapping]
-subsection. See: Subsection: Mapping for details (context =source
).
Section: Unknown IDs#
Key |
Type |
Description |
Comments |
---|---|---|---|
fmt |
Specify a format for untranslated IDs. |
Can be a plain string |
Alternative
placeholder
-values for unknown IDs can be declared in a[unknown_ids.overrides]
-subsection. See: Subsection: Overrides for details (context =source
).
Note
Sources that are translated using default placeholders count as successful translations when using
Translator.translate(maximal_untranslated_fraction != 1)
.
Section: Transformations#
You may specify one Transformer
per source. Subsection keys are passed directly to the init
-method of the
chosen transformer type. For available transformers, see the API documentation
.
Note
You may add [transform.'<source>']
-sections either in the main configuration file, or in an auxiliary fetcher
configuration. It is a ConfigurationError
to specify transformations for the same
source more than once.
For example, to configure a BitmaskTransformer
, add a section on the form
[transform.'<source>'.BitmaskTransformer]
to an appropriate configuration file:
[transform.'<source>'.BitmaskTransformer]
joiner = " AND "
overrides = [
{ id = 0, override = "NOT_SET" },
{ id = 0b1000, override = "OVERFLOW" },
]
This will create a transform that formats bitmasks such as 0b101
in the following way:
translator.translate((0b000, 0b101, 8), name="<source>")
("NOT_SET", "1:name-of-1 AND 4:name-of-4", "OVERFLOW")
Hint
Custom transformers may be initialized by using sections with fully qualified type names.
For example, a [transform.'<source>'.'my.library.SuperTransformer']
-section would import and initialize a
SuperTransformer
from the my.library
module.
Section: Fetching#
The type of the fetcher is determined by the second-level key (other than mapping
, which is reserved). For example,
a MemoryFetcher
would be created by adding a [fetching.MemoryFetcher]
-section.
Key |
Type |
Description |
Comments |
---|---|---|---|
allow_fetch_all |
Control access to |
Some fetchers types redefine or ignore this key. |
|
fetch_all_unmapped
_values_action
|
raise | warn | ignore |
Special action level for |
Interacts with selective_fetch_all. |
selective_fetch_all |
Sources without required keys are are not fetched. |
Implicit fetch_all_unmapped
_values_action=’ignore’
|
|
fetch_all_cache
_max_age
|
Specified as a string, eg ‘12h’ or ‘30d’. |
Set to non-zero value to enable. |
|
cache_keys |
Hierarchical identifier for the cache. |
Provided automatically if not given. |
|
optional |
If |
Multi-fetcher mode only. |
The keys listed above are for the AbstractFetcher
class, which all fetchers created by
TOML configuration must inherit. Additional parameters vary based on the chosen implementation. See the
id_translation.fetching
module for choices.
The AbstractFetcher
uses a a Mapper
to bind actual
placeholders names in
sources to desired
placeholder names
requested by the calling Translator instance.
See: Subsection: Mapping for details. For all mapping operations performed by the AbstractFetcher
, context =
source
.
Hint
Custom fetchers may be initialized by using sections with fully qualified type names in single quotation marks. For
example, a [fetching.'my.library.SuperFetcher']
-section would import and initialize a SuperFetcher
from the
my.library
module.
Under the hood, this will call get_by_full_name()
using name="my.library.SuperFetcher"
.
Multiple fetchers#
Complex applications may require multiple fetchers. These may be specified in auxiliary config files, one fetcher per
file. Only the fetching
key will be considered in these files. If multiple fetchers are defined, a
MultiFetcher
is created. Fetchers defined this way are hierarchical. The input
order determines rank, affecting Name-to-sources mapping. For
example, for a Translator
created by running
>>> from id_translation import Translator
>>> extra_fetchers=["primary-fetcher.toml", "secondary-fetcher.toml"]
>>> Translator.from_config("translation.toml", extra_fetchers=extra_fetchers)
the Translator.map
-function will first consider the sources of the fetcher
defined in translation.toml (if there is one), then primary-fetcher.toml and finally secondary-fetcher.toml.
Key |
Type |
Description |
---|---|---|
max_workers |
Maximum number of individual child fetchers to call in parallel. |
|
duplicate_translation_action |
raise | warn | ignore |
Action to take when multiple fetchers return translations for the same source. |
duplicate_source_discovered_action |
raise | warn | ignore |
Action to take when multiple fetchers claim the same source. |
The [fetching.MultiFetcher]
section is permitted only in the main configuration file.
Subsection: Mapping#
For more information about the mapping procedure, please refer to the Mapping primer page.
Key |
Type |
Description |
Comments |
---|---|---|---|
score_function |
Compute value/candidate-likeness |
||
unmapped_values_action |
raise | warn | ignore |
Handle unmatched values. |
|
cardinality |
OneToOne | ManyToOne |
Determine how many candidates to map a single value to. |
Score functions which take additional keyword arguments should be specified in a child section, eg
[*.mapping.<score-function-name>]
. See:id_translation.mapping.score_functions
for options.External functions may be used by putting fully qualified names in single quotation marks. Names which do not contain any dot characters (
'.'
) are assumed to refer to functions in the appropriateid_translation.mapping
submodule.
Hint
For difficult matches, consider using overrides instead.
Filter functions#
Filters are given in [[*.mapping.filter_functions]]
list-subsections. These may be used to remove undesirable
matches, for example SQL tables which should not be used or a DataFrame
column that should not be translated.
Key |
Type |
Description |
Comments |
---|---|---|---|
function |
Function name. |
Note
Additional keys depend on the chosen function implementation.
As an example, the next snippet ensures that only names ending with an '_id'
-suffix will be translated by using a
filter_names()
-filter.
[[translator.mapping.filter_functions]]
function = "filter_names"
regex = ".*_id$"
remove = false # This is the default (like the built-in filter).
Score function#
There are some ScoreFunction
s which take additional keyword arguments. These must
be declared in a [*.overrides.<score-function-name>]
-subsection. See: id_translation.mapping.score_functions
for options.
Score function heuristics#
Heuristics may be used to aid an underlying score_function to make more difficult matches. There are two types of
heuristic functions: AliasFunction
s and Short-circuiting functions (which are
really just differently interpreted FilterFunction
s).
Heuristics are given in [[*.mapping.score_function_heuristics]]
list-subsections (note the double brackets) and
are applied in the order in which they are given by the HeuristicScore
wrapper
class.
Key |
Type |
Description |
Comments |
---|---|---|---|
function |
Function name. |
||
mutate |
Keep changes made by function. |
Disabled by default. |
Note
Additional keys depend on the chosen function implementation.
As an example, the next snippet lets us match table columns such as animal_id to the id placeholder by using a
value_fstring_alias()
heuristic.
[[fetching.mapping.score_function_heuristics]]
function = "value_fstring_alias"
fstring = "{context}_{value}"
Hint
For difficult matches, consider using overrides instead.
Subsection: Overrides#
Shared or context-specific key-value pairs implemented by the InheritedKeysDict
class. When used in config files, these appear as [*.overrides]
-sections. Top-level override items are given in the
[*.overrides]
-section, while context-specific items are specified using a subsection, eg
[*.overrides.<context-name>]
.
Note
The type of context
is determined by the class that owns the overrides.
This next snipped is from another example. For unknown IDs, the name is set to ‘Name unknown’ for the ‘name_basics’ source and ‘Title unknown’ for the ‘title_basics’ source, respectively. They both inherit the from and to keys which rare set to ‘?’.
[unknown_ids.overrides]
from = "?"
to = "?"
[unknown_ids.overrides.name_basics]
name = "Name unknown"
[unknown_ids.overrides.title_basics]
name = "Title unknown"
Warning
Overrides have no fixed keys. No validation is performed and errors may be silent. The
mapping process
provides detailed information in debug mode, which may
be used to discover issues.
Hint
Overrides may also be used to prevent mapping certain values.
Preventing unwanted mappings#
For example, let’s assume that a SQL source table called title_basics with two columns title and name with
identical contents. We would like to use a format '[{title}. ]{name}'
to output translations such as
‘Mr. Astaire’. To avoid output such as ‘Top Hat. Top Hat’ for movies, we may add
[fetching.mapping.overrides.movies]
title = "_"
to force the fetcher to inform the Translator
that the title placeholder (column) does not exist for the
title_basics source (we used ‘_’ since TOML does not have a
null
-type).