Primer: TOML implementation#

This notebook reconstructs the Translator showcased in the Translation primer using the a TOML configuration.

[1]:
import sys
import rics
import id_translation

# Print relevant versions
print(f"{rics.__version__=}")
print(f"{id_translation.__version__=}")
print(f"{sys.version=}")
rics.__version__='3.2.0'
id_translation.__version__='0.5.1.dev1'
sys.version='3.11.6 (main, Oct 23 2023, 22:48:54) [GCC 11.4.0]'
[2]:
rics.configure_stuff(format="[%(name)s:%(levelname)s] %(message)s")
👻 Configured some stuff just the way I like it!

Translatable data#

[3]:
import pandas as pd

bite_report = pd.read_csv("biting-victims-2019-05-11.csv")
bite_report
[3]:
human_id bitten_by
0 1904 1
1 1991 0
2 1991 2
3 1999 0

Mapping#

Define heuristic function#

This will map to map id to animal_id when context="animals".

It will remap the correctly named id column in humans.csv as well, but this is not a problem since the best match will be used.

[4]:
def smurf_column_heuristic(value, candidates, context):
    """Heuristic for matching columns that use the "smurf" convention."""
    return (
        # Handles plural form that ends with or without an s.
        f"{context[:-1]}_{value}" if context[-1] == "s" else f"{context}_{value}",
        candidates,  # unchanged
    )

Moment of truth#

[5]:
from id_translation import Translator

translated_bite_report = Translator.from_config("config.toml").translate(bite_report)
translated_bite_report
[id_translation.Translator.translate:INFO] Finished translation of 2 names in 'DataFrame'-type data in 7ms, using name-to-source mapping: {'human_id': 'humans', 'bitten_by': 'animals'}.
[5]:
human_id bitten_by
0 Mr. Fred (id=1904) Morris (id=1) the dog
1 Mr. Richard (id=1991) Tarzan (id=0) the cat
2 Mr. Richard (id=1991) Simba (id=2) the lion
3 Dr. Sofia (id=1999) Tarzan (id=0) the cat
[6]:
expected = pd.read_csv("biting-victims-2019-05-11-translated.csv")
pd.testing.assert_frame_equal(translated_bite_report, expected)