Primer: TOML implementation#
This notebook reconstructs the Translator showcased in the Translation primer using the a TOML configuration.
[1]:
import sys
import rics
import id_translation
# Print relevant versions
print(f"{id_translation.__version__=}")
print(f"{sys.version=}")
id_translation.__version__='1.0.1.dev1'
sys.version='3.14.0 (main, Oct 7 2025, 16:05:28) [GCC 13.3.0]'
[2]:
rics.configure_stuff()
👻 Configured some stuff just the way I like it!
Translatable data#
[3]:
import pandas as pd
bite_report = pd.read_csv("biting-victims-2019-05-11.csv")
bite_report
[3]:
| human_id | bitten_by | |
|---|---|---|
| 0 | 1904 | 1 |
| 1 | 1991 | 0 |
| 2 | 1991 | 2 |
| 3 | 1999 | 0 |
Mapping#
Define heuristic function#
This will map to map id to animal_id when context="animals".
It will remap the correctly named id column in humans.csv as well, but this is not a problem since the best match will be used.
[4]:
def smurf_column_heuristic(value, candidates, context):
"""Heuristic for matching columns that use the "smurf" convention."""
return (
# Handles plural form that ends with or without an s.
f"{context[:-1]}_{value}" if context[-1] == "s" else f"{context}_{value}",
candidates, # unchanged
)
Moment of truth#
[5]:
from id_translation import Translator
translated_bite_report = Translator.from_config("config.toml").translate(bite_report)
translated_bite_report
2025-12-03T23:24:21.058 [id_translation.fetching:INFO] Finished initialization of 'PandasFetcher' in 3 ms: PandasFetcher(sources=['animals', 'humans'])
2025-12-03T23:24:21.059 [id_translation.Translator.map:INFO] Finished mapping of 2/2 names in 'DataFrame' in 201 μs: {'bitten_by': 'animals', 'human_id': 'humans'}.
2025-12-03T23:24:21.063 [id_translation.fetching:INFO] Finished fetching from 2 sources in 3 ms: ['humans' x ('id', 'name', 'title') x 3/3 IDs], ['animals' x ('id', 'name', 'species') x 3/3 IDs].
2025-12-03T23:24:21.065 [id_translation.Translator:INFO] Finished translation of 6 unique IDs (2 names) in 'DataFrame' in 6 ms.
[5]:
| human_id | bitten_by | |
|---|---|---|
| 0 | Mr. Fred (id=1904) | Morris (id=1) the dog |
| 1 | Mr. Richard (id=1991) | Tarzan (id=0) the cat |
| 2 | Mr. Richard (id=1991) | Simba (id=2) the lion |
| 3 | Dr. Sofia (id=1999) | Tarzan (id=0) the cat |
[6]:
expected = pd.read_csv("biting-victims-2019-05-11-translated.csv")
pd.testing.assert_frame_equal(translated_bite_report, expected)
Print the config#
Click here to download.
[7]:
!pygmentize config.toml
################################################################################
# For help, see https://id-translation.readthedocs.io #
################################################################################
[translator]
fmt = "[{title}. ]{name} (id={id})[ the {species}]"
# ------------------------------------------------------------------------------
# Name-to-source mapping configuration. Binds names to source, eg 'cute_animals'
# -> 'my_database.animals'. Overrides take precedence over scoring logic.
[translator.mapping]
score_function = "equality"
[[translator.mapping.score_function_heuristics]]
function = "like_database_table"
[translator.mapping.overrides]
bitten_by = "animals"
################################################################################
# Fetching configuration.
################################################################################
[fetching.PandasFetcher]
read_path_format = "./sources/{}.csv"
[fetching.mapping]
# ------------------------------------------------------------------------------
# Placeholder mapping configuration. Binds actual names in sources (such as
# column names in an SQL table) to wanted names, eg id = 'animal_id'.
[[fetching.mapping.score_function_heuristics]]
function = "__main__.smurf_column_heuristic"
[ ]: