Fetching data using SqlFetcher#

Translating using a SQL database. This notebook assumes that the Prepare for ``SqlFetcher`` demo-step from the PickleFetcher demo notebook has been completed.

[1]:
import sys

import rics

import id_translation

# Print relevant versions
print(f"{id_translation.__version__=}")
print(f"{sys.version=}")
rics.configure_stuff(id_translation_level="DEBUG")
id_translation.__version__='0.15.0.dev1'
sys.version='3.11.13 (main, Jun  4 2025, 08:57:30) [GCC 13.3.0]'
đź‘» Configured some stuff just the way I like it!

Create translator from config#

Click here to see the file.

[2]:
from id_translation import Translator

translator = Translator.from_config("config.toml")
translator
[2]:
Translator(online=True: fetcher=SqlFetcher('postgresql+pg8000://postgres:***@localhost:5002/imdb', whitelist=['name_basics', 'title_basics']))

Load database#

Using the SqlFetcher engine.

[3]:
from data import load_imdb


for source in ["name.basics", "title.basics"]:
    df = load_imdb(source)[0]
    df.to_sql(source.replace(".", "_"), translator.fetcher.engine, if_exists="replace")
2025-07-20T16:26:21.354 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/.id-translation/notebooks/cache/clean_and_fix_ids/name.basics.tsv.pkl'.
2025-07-20T16:28:42.426 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/.id-translation/notebooks/cache/clean_and_fix_ids/title.basics.tsv.pkl'.

Make some data to translate#

[4]:
import pandas as pd

engine = translator._fetcher._engine


def first_title(seed=5, n=1000):
    df = pd.read_sql("SELECT * FROM name_basics;", engine).sample(n, random_state=seed)
    df["firstTitle"] = df.knownForTitles.str.split(",").str[0]
    return df[["nconst", "firstTitle"]]
[5]:
translator.go_offline()
2025-07-20T16:29:26.769 [id_translation.fetching:DEBUG] Metadata for 'postgresql+pg8000://postgres:***@localhost:5002/imdb' created in 38 ms.
2025-07-20T16:29:26.770 [id_translation.fetching:INFO] Finished initialization of 'SqlFetcher' in 39 ms: SqlFetcher('postgresql+pg8000://postgres:***@localhost:5002/imdb', whitelist=['name_basics', 'title_basics'])
2025-07-20T16:29:26.771 [id_translation.Translator:DEBUG] Begin going offline with 2 sources provided by: SqlFetcher('postgresql+pg8000://postgres:***@localhost:5002/imdb', whitelist=['name_basics', 'title_basics'])
2025-07-20T16:29:26.772 [id_translation.fetching:DEBUG] Begin fetching all IDs for placeholders=('id', 'name', 'original_name', 'from', 'to') for 2/2: ['name_basics', 'title_basics'].
2025-07-20T16:29:26.773 [id_translation.fetching:DEBUG] Begin mapping of wanted placeholders={'to', 'name', 'from', 'id'} to actual placeholders={'primaryName', 'index', 'knownForTitles', 'nconst', 'primaryProfession', 'birthYear', 'int_id_nconst', 'deathYear'} for source='name_basics'.
2025-07-20T16:29:26.777 [id_translation.fetching.map:DEBUG] Computed 4x8 match scores in context='name_basics' in 72 ÎĽs:
candidates  primaryName  index  knownForTitles  nconst  primaryProfession  birthYear  int_id_nconst  deathYear
values
to                 -inf   -inf            -inf    -inf               -inf       -inf           -inf        inf
name                inf   -inf            -inf    -inf               -inf       -inf           -inf       -inf
from               -inf   -inf            -inf    -inf               -inf        inf           -inf       -inf
id                 -inf   -inf            -inf     inf               -inf       -inf           -inf       -inf
2025-07-20T16:29:26.778 [id_translation.fetching:DEBUG] Finished placeholder mapping for source='name_basics': {'name': 'primaryName', 'from': 'birthYear', 'to': 'deathYear', 'id': 'nconst'}.
2025-07-20T16:29:26.779 [id_translation.fetching:DEBUG] Begin mapping of wanted placeholders={'to', 'name', 'from', 'id'} to actual placeholders={'tconst', 'primaryTitle', 'startYear', 'runtimeMinutes', 'index', 'endYear', 'isAdult', 'genres', 'titleType', 'originalTitle', 'int_id_tconst'} for source='title_basics'.
2025-07-20T16:29:26.781 [id_translation.fetching.map:DEBUG] Computed 4x11 match scores in context='title_basics' in 54 ÎĽs:
candidates  tconst  primaryTitle  startYear  runtimeMinutes  index  endYear  isAdult  genres  titleType  originalTitle  int_id_tconst
values
to            -inf          -inf       -inf            -inf   -inf      inf     -inf    -inf       -inf           -inf           -inf
name          -inf           inf       -inf            -inf   -inf     -inf     -inf    -inf       -inf           -inf           -inf
from          -inf          -inf        inf            -inf   -inf     -inf     -inf    -inf       -inf           -inf           -inf
id             inf          -inf       -inf            -inf   -inf     -inf     -inf    -inf       -inf           -inf           -inf
2025-07-20T16:29:26.782 [id_translation.fetching:DEBUG] Finished placeholder mapping for source='title_basics': {'name': 'primaryTitle', 'from': 'startYear', 'to': 'endYear', 'id': 'tconst'}.
2025-07-20T16:29:26.783 [id_translation.fetching:DEBUG] Begin mapping of wanted placeholders={'from', 'to', 'original_name', 'id', 'name'} to actual placeholders={'primaryName', 'index', 'knownForTitles', 'nconst', 'primaryProfession', 'birthYear', 'int_id_nconst', 'deathYear'} for source='name_basics'.
2025-07-20T16:29:26.785 [id_translation.fetching.map:DEBUG] Computed 5x8 match scores in context='name_basics' in 92 ÎĽs:
candidates     primaryName  index  knownForTitles  nconst  primaryProfession  birthYear  int_id_nconst  deathYear
values
from                  -inf   -inf            -inf    -inf               -inf        inf           -inf       -inf
to                    -inf   -inf            -inf    -inf               -inf       -inf           -inf        inf
original_name         0.18   0.02           -0.00   -0.00               0.01      -0.01           0.07       0.02
id                    -inf   -inf            -inf     inf               -inf       -inf           -inf       -inf
name                   inf   -inf            -inf    -inf               -inf       -inf           -inf       -inf
2025-07-20T16:29:26.786 [id_translation.fetching:DEBUG] Finished placeholder mapping for source='name_basics': {'to': 'deathYear', 'name': 'primaryName', 'from': 'birthYear', 'id': 'nconst', 'original_name': None}.
2025-07-20T16:29:26.786 [id_translation.fetching:DEBUG] Begin fetching all IDs from source='name_basics'. Placeholders: ('deathYear', 'birthYear', 'nconst', 'primaryName')
2025-07-20T16:29:27.623 [id_translation.fetching:DEBUG] Finished fetching 199200 IDs from source='name_basics' in 0.84 sec. Placeholders: ('deathYear', 'birthYear', 'nconst', 'primaryName').
2025-07-20T16:29:27.624 [id_translation.fetching:DEBUG] Begin mapping of wanted placeholders={'from', 'to', 'original_name', 'id', 'name'} to actual placeholders={'tconst', 'primaryTitle', 'startYear', 'runtimeMinutes', 'index', 'endYear', 'isAdult', 'genres', 'titleType', 'originalTitle', 'int_id_tconst'} for source='title_basics'.
2025-07-20T16:29:27.626 [id_translation.fetching.map:DEBUG] Computed 5x11 match scores in context='title_basics' in 44 ÎĽs:
candidates     tconst  primaryTitle  startYear  runtimeMinutes  index  endYear  isAdult  genres  titleType  originalTitle  int_id_tconst
values
from             -inf          -inf        inf            -inf   -inf     -inf     -inf    -inf       -inf           -inf           -inf
to               -inf          -inf       -inf            -inf   -inf      inf     -inf    -inf       -inf           -inf           -inf
original_name    -inf          -inf       -inf            -inf   -inf     -inf     -inf    -inf       -inf            inf           -inf
id                inf          -inf       -inf            -inf   -inf     -inf     -inf    -inf       -inf           -inf           -inf
name             -inf           inf       -inf            -inf   -inf     -inf     -inf    -inf       -inf           -inf           -inf
2025-07-20T16:29:27.627 [id_translation.fetching:DEBUG] Finished placeholder mapping for source='title_basics': {'to': 'endYear', 'from': 'startYear', 'original_name': 'originalTitle', 'id': 'tconst', 'name': 'primaryTitle'}.
2025-07-20T16:29:27.628 [id_translation.fetching:DEBUG] Begin fetching all IDs from source='title_basics'. Placeholders: ('endYear', 'startYear', 'originalTitle', 'tconst', 'primaryTitle')
2025-07-20T16:29:27.935 [id_translation.fetching:DEBUG] Finished fetching 64264 IDs from source='title_basics' in 307 ms. Placeholders: ('endYear', 'startYear', 'originalTitle', 'tconst', 'primaryTitle').
2025-07-20T16:29:27.936 [id_translation.fetching:INFO] Finished fetching all IDs from 2/2 sources in 1.2 sec: ['name_basics' x ('to', 'from', 'id', 'name') x 199200 IDs], ['title_basics' x ('to', 'from', 'original_name', 'id', 'name') x 64264 IDs].
2025-07-20T16:29:27.937 [id_translation.fetching:DEBUG] Dispose 'postgresql+pg8000://postgres:***@localhost:5002/imdb'.
2025-07-20T16:29:27.938 [id_translation.Translator:INFO] Went offline with 2 sources in 1.2 sec: TranslationMap('name_basics': 199200 IDs, 'title_basics': 64264 IDs).
[5]:
Translator(online=False: cache=TranslationMap('name_basics': 199200 IDs, 'title_basics': 64264 IDs))

Get the name and the “first” appearance for actors#

In the IMDb list anyway. I have no idea how they’re ordered in “knownForTitles”.

[6]:
df = first_title()
df.head()
[6]:
nconst firstTitle
173580 nm2552605 tt35682577
179836 nm3225652 tt0045332
123452 nm0891531 tt0047376
22918 nm0154105 tt0203564
168112 nm2121904 tt0171952

Translate#

[7]:
translator.translate(df).head(5)
2025-07-20T16:29:29.510 [id_translation.dio:DEBUG] Using rank-0 (priority=1999) implementation 'id_translation.dio.integration.pandas.PandasIO' for translatable of type='pandas.DataFrame'.
2025-07-20T16:29:29.511 [id_translation.Translator:DEBUG] Begin translation of 'DataFrame'-type data. Names to translate: Derive based on type.
2025-07-20T16:29:29.512 [id_translation.Translator:DEBUG] Name extraction complete. Found names=['nconst', 'firstTitle'] for 'DataFrame'-type data.
2025-07-20T16:29:29.512 [id_translation.Translator.map:DEBUG] Begin name-to-source mapping of names=['nconst', 'firstTitle'] in 'DataFrame' against sources=['name_basics', 'title_basics'].
2025-07-20T16:29:29.514 [id_translation.Translator.map:DEBUG] Computed 2x2 match scores in context=None in 32 ÎĽs:
candidates  name_basics  title_basics
values
nconst              inf          -inf
firstTitle         -inf           inf
2025-07-20T16:29:29.514 [id_translation.Translator.map:INFO] Finished mapping of 2/2 names in 'DataFrame' in 2 ms: {'nconst': 'name_basics', 'firstTitle': 'title_basics'}.
2025-07-20T16:29:29.736 [id_translation.Translator:DEBUG] Failed to translate 874/1000 (87.40% <= max_fails=100.00%) of IDs for name='firstTitle' using source='title_basics'. Sample IDs: ['tt35682577', 'tt0045332', 'tt0047376', 'tt0203564', 'tt0171952', 'tt3972722', 'tt0078465', 'tt1116026', 'tt0203019', 'tt2993774'].
2025-07-20T16:29:29.972 [id_translation.Translator:INFO] Finished translation of 'DataFrame' in 461 ms.
[7]:
nconst firstTitle
173580 nm2552605:Guy Corneau *1951†2016 tt35682577 not translated; default name=Title unknown
179836 nm3225652:Aimee Alexander *1893†1979 tt0045332 not translated; default name=Title unknown
123452 nm0891531:John Veale *1922†2006 tt0047376 not translated; default name=Title unknown
22918 nm0154105:Arthur Chatterdon *1885†1947 tt0203564 not translated; default name=Title unknown
168112 nm2121904:Martin Liska *1924†1998 tt0171952 not translated; default name=Title unknown
[ ]: