Fetching data using SqlFetcher#
Translating using a SQL database. This notebook assumes that the Prepare for ``SqlFetcher`` demo-step from the PickleFetcher demo notebook has been completed.
[1]:
import sys
import rics
import id_translation
# Print relevant versions
print(f"{id_translation.__version__=}")
print(f"{sys.version=}")
rics.configure_stuff(id_translation_level="DEBUG")
id_translation.__version__='0.15.0.dev1'
sys.version='3.11.13 (main, Jun 4 2025, 08:57:30) [GCC 13.3.0]'
đź‘» Configured some stuff just the way I like it!
Create translator from config#
Click here to see the file.
[2]:
from id_translation import Translator
translator = Translator.from_config("config.toml")
translator
[2]:
Translator(online=True: fetcher=SqlFetcher('postgresql+pg8000://postgres:***@localhost:5002/imdb', whitelist=['name_basics', 'title_basics']))
Load database#
Using the SqlFetcher engine.
[3]:
from data import load_imdb
for source in ["name.basics", "title.basics"]:
df = load_imdb(source)[0]
df.to_sql(source.replace(".", "_"), translator.fetcher.engine, if_exists="replace")
2025-07-20T16:26:21.354 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/.id-translation/notebooks/cache/clean_and_fix_ids/name.basics.tsv.pkl'.
2025-07-20T16:28:42.426 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/.id-translation/notebooks/cache/clean_and_fix_ids/title.basics.tsv.pkl'.
Make some data to translate#
[4]:
import pandas as pd
engine = translator._fetcher._engine
def first_title(seed=5, n=1000):
df = pd.read_sql("SELECT * FROM name_basics;", engine).sample(n, random_state=seed)
df["firstTitle"] = df.knownForTitles.str.split(",").str[0]
return df[["nconst", "firstTitle"]]
[5]:
translator.go_offline()
2025-07-20T16:29:26.769 [id_translation.fetching:DEBUG] Metadata for 'postgresql+pg8000://postgres:***@localhost:5002/imdb' created in 38 ms.
2025-07-20T16:29:26.770 [id_translation.fetching:INFO] Finished initialization of 'SqlFetcher' in 39 ms: SqlFetcher('postgresql+pg8000://postgres:***@localhost:5002/imdb', whitelist=['name_basics', 'title_basics'])
2025-07-20T16:29:26.771 [id_translation.Translator:DEBUG] Begin going offline with 2 sources provided by: SqlFetcher('postgresql+pg8000://postgres:***@localhost:5002/imdb', whitelist=['name_basics', 'title_basics'])
2025-07-20T16:29:26.772 [id_translation.fetching:DEBUG] Begin fetching all IDs for placeholders=('id', 'name', 'original_name', 'from', 'to') for 2/2: ['name_basics', 'title_basics'].
2025-07-20T16:29:26.773 [id_translation.fetching:DEBUG] Begin mapping of wanted placeholders={'to', 'name', 'from', 'id'} to actual placeholders={'primaryName', 'index', 'knownForTitles', 'nconst', 'primaryProfession', 'birthYear', 'int_id_nconst', 'deathYear'} for source='name_basics'.
2025-07-20T16:29:26.777 [id_translation.fetching.map:DEBUG] Computed 4x8 match scores in context='name_basics' in 72 ÎĽs:
candidates primaryName index knownForTitles nconst primaryProfession birthYear int_id_nconst deathYear
values
to -inf -inf -inf -inf -inf -inf -inf inf
name inf -inf -inf -inf -inf -inf -inf -inf
from -inf -inf -inf -inf -inf inf -inf -inf
id -inf -inf -inf inf -inf -inf -inf -inf
2025-07-20T16:29:26.778 [id_translation.fetching:DEBUG] Finished placeholder mapping for source='name_basics': {'name': 'primaryName', 'from': 'birthYear', 'to': 'deathYear', 'id': 'nconst'}.
2025-07-20T16:29:26.779 [id_translation.fetching:DEBUG] Begin mapping of wanted placeholders={'to', 'name', 'from', 'id'} to actual placeholders={'tconst', 'primaryTitle', 'startYear', 'runtimeMinutes', 'index', 'endYear', 'isAdult', 'genres', 'titleType', 'originalTitle', 'int_id_tconst'} for source='title_basics'.
2025-07-20T16:29:26.781 [id_translation.fetching.map:DEBUG] Computed 4x11 match scores in context='title_basics' in 54 ÎĽs:
candidates tconst primaryTitle startYear runtimeMinutes index endYear isAdult genres titleType originalTitle int_id_tconst
values
to -inf -inf -inf -inf -inf inf -inf -inf -inf -inf -inf
name -inf inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
from -inf -inf inf -inf -inf -inf -inf -inf -inf -inf -inf
id inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
2025-07-20T16:29:26.782 [id_translation.fetching:DEBUG] Finished placeholder mapping for source='title_basics': {'name': 'primaryTitle', 'from': 'startYear', 'to': 'endYear', 'id': 'tconst'}.
2025-07-20T16:29:26.783 [id_translation.fetching:DEBUG] Begin mapping of wanted placeholders={'from', 'to', 'original_name', 'id', 'name'} to actual placeholders={'primaryName', 'index', 'knownForTitles', 'nconst', 'primaryProfession', 'birthYear', 'int_id_nconst', 'deathYear'} for source='name_basics'.
2025-07-20T16:29:26.785 [id_translation.fetching.map:DEBUG] Computed 5x8 match scores in context='name_basics' in 92 ÎĽs:
candidates primaryName index knownForTitles nconst primaryProfession birthYear int_id_nconst deathYear
values
from -inf -inf -inf -inf -inf inf -inf -inf
to -inf -inf -inf -inf -inf -inf -inf inf
original_name 0.18 0.02 -0.00 -0.00 0.01 -0.01 0.07 0.02
id -inf -inf -inf inf -inf -inf -inf -inf
name inf -inf -inf -inf -inf -inf -inf -inf
2025-07-20T16:29:26.786 [id_translation.fetching:DEBUG] Finished placeholder mapping for source='name_basics': {'to': 'deathYear', 'name': 'primaryName', 'from': 'birthYear', 'id': 'nconst', 'original_name': None}.
2025-07-20T16:29:26.786 [id_translation.fetching:DEBUG] Begin fetching all IDs from source='name_basics'. Placeholders: ('deathYear', 'birthYear', 'nconst', 'primaryName')
2025-07-20T16:29:27.623 [id_translation.fetching:DEBUG] Finished fetching 199200 IDs from source='name_basics' in 0.84 sec. Placeholders: ('deathYear', 'birthYear', 'nconst', 'primaryName').
2025-07-20T16:29:27.624 [id_translation.fetching:DEBUG] Begin mapping of wanted placeholders={'from', 'to', 'original_name', 'id', 'name'} to actual placeholders={'tconst', 'primaryTitle', 'startYear', 'runtimeMinutes', 'index', 'endYear', 'isAdult', 'genres', 'titleType', 'originalTitle', 'int_id_tconst'} for source='title_basics'.
2025-07-20T16:29:27.626 [id_translation.fetching.map:DEBUG] Computed 5x11 match scores in context='title_basics' in 44 ÎĽs:
candidates tconst primaryTitle startYear runtimeMinutes index endYear isAdult genres titleType originalTitle int_id_tconst
values
from -inf -inf inf -inf -inf -inf -inf -inf -inf -inf -inf
to -inf -inf -inf -inf -inf inf -inf -inf -inf -inf -inf
original_name -inf -inf -inf -inf -inf -inf -inf -inf -inf inf -inf
id inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
name -inf inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
2025-07-20T16:29:27.627 [id_translation.fetching:DEBUG] Finished placeholder mapping for source='title_basics': {'to': 'endYear', 'from': 'startYear', 'original_name': 'originalTitle', 'id': 'tconst', 'name': 'primaryTitle'}.
2025-07-20T16:29:27.628 [id_translation.fetching:DEBUG] Begin fetching all IDs from source='title_basics'. Placeholders: ('endYear', 'startYear', 'originalTitle', 'tconst', 'primaryTitle')
2025-07-20T16:29:27.935 [id_translation.fetching:DEBUG] Finished fetching 64264 IDs from source='title_basics' in 307 ms. Placeholders: ('endYear', 'startYear', 'originalTitle', 'tconst', 'primaryTitle').
2025-07-20T16:29:27.936 [id_translation.fetching:INFO] Finished fetching all IDs from 2/2 sources in 1.2 sec: ['name_basics' x ('to', 'from', 'id', 'name') x 199200 IDs], ['title_basics' x ('to', 'from', 'original_name', 'id', 'name') x 64264 IDs].
2025-07-20T16:29:27.937 [id_translation.fetching:DEBUG] Dispose 'postgresql+pg8000://postgres:***@localhost:5002/imdb'.
2025-07-20T16:29:27.938 [id_translation.Translator:INFO] Went offline with 2 sources in 1.2 sec: TranslationMap('name_basics': 199200 IDs, 'title_basics': 64264 IDs).
[5]:
Translator(online=False: cache=TranslationMap('name_basics': 199200 IDs, 'title_basics': 64264 IDs))
Get the name and the “first” appearance for actors#
In the IMDb list anyway. I have no idea how they’re ordered in “knownForTitles”.
[6]:
df = first_title()
df.head()
[6]:
| nconst | firstTitle | |
|---|---|---|
| 173580 | nm2552605 | tt35682577 |
| 179836 | nm3225652 | tt0045332 |
| 123452 | nm0891531 | tt0047376 |
| 22918 | nm0154105 | tt0203564 |
| 168112 | nm2121904 | tt0171952 |
Translate#
[7]:
translator.translate(df).head(5)
2025-07-20T16:29:29.510 [id_translation.dio:DEBUG] Using rank-0 (priority=1999) implementation 'id_translation.dio.integration.pandas.PandasIO' for translatable of type='pandas.DataFrame'.
2025-07-20T16:29:29.511 [id_translation.Translator:DEBUG] Begin translation of 'DataFrame'-type data. Names to translate: Derive based on type.
2025-07-20T16:29:29.512 [id_translation.Translator:DEBUG] Name extraction complete. Found names=['nconst', 'firstTitle'] for 'DataFrame'-type data.
2025-07-20T16:29:29.512 [id_translation.Translator.map:DEBUG] Begin name-to-source mapping of names=['nconst', 'firstTitle'] in 'DataFrame' against sources=['name_basics', 'title_basics'].
2025-07-20T16:29:29.514 [id_translation.Translator.map:DEBUG] Computed 2x2 match scores in context=None in 32 ÎĽs:
candidates name_basics title_basics
values
nconst inf -inf
firstTitle -inf inf
2025-07-20T16:29:29.514 [id_translation.Translator.map:INFO] Finished mapping of 2/2 names in 'DataFrame' in 2 ms: {'nconst': 'name_basics', 'firstTitle': 'title_basics'}.
2025-07-20T16:29:29.736 [id_translation.Translator:DEBUG] Failed to translate 874/1000 (87.40% <= max_fails=100.00%) of IDs for name='firstTitle' using source='title_basics'. Sample IDs: ['tt35682577', 'tt0045332', 'tt0047376', 'tt0203564', 'tt0171952', 'tt3972722', 'tt0078465', 'tt1116026', 'tt0203019', 'tt2993774'].
2025-07-20T16:29:29.972 [id_translation.Translator:INFO] Finished translation of 'DataFrame' in 461 ms.
[7]:
| nconst | firstTitle | |
|---|---|---|
| 173580 | nm2552605:Guy Corneau *1951†2016 | tt35682577 not translated; default name=Title unknown |
| 179836 | nm3225652:Aimee Alexander *1893†1979 | tt0045332 not translated; default name=Title unknown |
| 123452 | nm0891531:John Veale *1922†2006 | tt0047376 not translated; default name=Title unknown |
| 22918 | nm0154105:Arthur Chatterdon *1885†1947 | tt0203564 not translated; default name=Title unknown |
| 168112 | nm2121904:Martin Liska *1924†1998 | tt0171952 not translated; default name=Title unknown |
[ ]: