Fetching data using SqlFetcher#

Translating using a SQL database. This notebook assumes that the Prepare for ``SqlFetcher`` demo-step from the PickleFetcher demo notebook has been completed.

[1]:
import sys
import rics
import id_translation

# Print relevant versions
print(f"{rics.__version__=}")
print(f"{id_translation.__version__=}")
print(f"{sys.version=}")
rics.configure_stuff(rics_level="DEBUG", id_translation_level="DEBUG")
!git log --pretty=oneline --abbrev-commit -1
rics.__version__='3.0.0'
id_translation.__version__='0.3.1.dev1'
sys.version='3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]'
đź‘» Configured some stuff just the way I like it!
d2093a4 (HEAD, origin/main, origin/HEAD, main) Update help link in TOML files

Load database#

[2]:
import tomli

with open("config.toml", "rb") as f:
    connection_string = tomli.load(f)["fetching"]["SqlFetcher"]["connection_string"]
    connection_string = connection_string.format(password="your_password")
    print(f"{connection_string=}")
connection_string='postgresql+pg8000://postgres:your_password@localhost:5432/imdb'
[3]:
import sqlalchemy
import tomli
from data import load_imdb


engine = sqlalchemy.create_engine(connection_string)

for source in ["name.basics", "title.basics"]:
    df = load_imdb(source)[0]
    df.to_sql(source.replace(".", "_"), engine, if_exists="replace")
2023-03-25T11:35:34.942 [rics.utility.misc.get_local_or_remote:DEBUG] Local file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/name.basics.tsv.gz'.
2023-03-25T11:35:34.943 [rics.utility.misc.get_local_or_remote:DEBUG] Remote file path: 'https://datasets.imdbws.com/name.basics.tsv.gz'.
2023-03-25T11:35:34.943 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/clean_and_fix_ids/name.basics.tsv.pkl'.
2023-03-25T11:36:11.181 [rics.utility.misc.get_local_or_remote:DEBUG] Local file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/title.basics.tsv.gz'.
2023-03-25T11:36:11.182 [rics.utility.misc.get_local_or_remote:DEBUG] Remote file path: 'https://datasets.imdbws.com/title.basics.tsv.gz'.
2023-03-25T11:36:11.183 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/clean_and_fix_ids/title.basics.tsv.pkl'.

Create translator from config#

Click here to see the file.

[4]:
from id_translation import Translator

translator = Translator.from_config("config.toml")
translator
2023-03-25T11:36:21.816 [id_translation.fetching.config-toml.sql.discovery:DEBUG] Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb): Metadata created in 0.0342428 sec.
2023-03-25T11:36:21.817 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'id'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics'.
2023-03-25T11:36:21.818 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='title_basics' for ['id']x['isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:21.819 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 1 matches={'id': 'tconst'} in the given values=['id'].
2023-03-25T11:36:21.821 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'id'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics': {'id': ('tconst',)}.
2023-03-25T11:36:21.829 [id_translation.fetching.config-toml.sql:DEBUG] Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb): Counted 48979 rows of table 'title_basics' in 0.00790173 sec.
2023-03-25T11:36:21.830 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'id'} to actual placeholders={'int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'} for source='name_basics'.
2023-03-25T11:36:21.830 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='name_basics' for ['id']x['int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:21.831 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 1 matches={'id': 'nconst'} in the given values=['id'].
2023-03-25T11:36:21.832 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'id'} to actual placeholders={'int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'} for source='name_basics': {'id': ('nconst',)}.
2023-03-25T11:36:21.843 [id_translation.fetching.config-toml.sql:DEBUG] Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb): Counted 172326 rows of table 'name_basics' in 0.00998208 sec.
[4]:
Translator(online=True: fetcher=SqlFetcher(Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb), tables=['title_basics', 'name_basics']))

Make some data to translate#

[5]:
import pandas as pd

engine = translator._fetcher._engine


def first_title(seed=None, n=1000):
    df = pd.read_sql("SELECT * FROM name_basics;", engine).sample(n, random_state=seed)
    df["firstTitle"] = df.knownForTitles.str.split(",").str[0]
    return df[["nconst", "firstTitle"]]
[6]:
translator.store().cache
2023-03-25T11:36:21.853 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'to', 'name', 'from'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics'.
2023-03-25T11:36:21.854 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='title_basics' for ['to', 'name', 'from']x['isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:21.855 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 3 matches={'to': 'endYear', 'name': 'primaryTitle', 'from': 'startYear'} in the given values=['to', 'name', 'from'].
2023-03-25T11:36:21.857 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'to', 'name', 'from'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics': {'to': ('endYear',), 'name': ('primaryTitle',), 'from': ('startYear',)}.
2023-03-25T11:36:21.858 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'to', 'name', 'from'} to actual placeholders={'int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'} for source='name_basics'.
2023-03-25T11:36:21.859 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='name_basics' for ['to', 'name', 'from']x['int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:21.859 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 3 matches={'to': 'deathYear', 'name': 'primaryName', 'from': 'birthYear'} in the given values=['to', 'name', 'from'].
2023-03-25T11:36:21.861 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'to', 'name', 'from'} to actual placeholders={'int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'} for source='name_basics': {'to': ('deathYear',), 'name': ('primaryName',), 'from': ('birthYear',)}.
2023-03-25T11:36:21.861 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'original_name'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics'.
2023-03-25T11:36:21.862 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='title_basics' for ['original_name']x['isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:21.863 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 1 matches={'original_name': 'originalTitle'} in the given values=['original_name'].
2023-03-25T11:36:21.864 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'original_name'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics': {'original_name': ('originalTitle',)}.
2023-03-25T11:36:21.864 [id_translation.fetching.config-toml:DEBUG] Begin fetching placeholders=('id', 'name', 'original_name', 'from', 'to') from source='title_basics' for all IDs.
2023-03-25T11:36:22.068 [id_translation.fetching.config-toml:DEBUG] Finished fetching placeholders=('index', 'tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear', 'endYear', 'runtimeMinutes', 'genres', 'int_id_tconst') for 48979 IDs from source 'title_basics' in 0.203781 sec using SqlFetcher(Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb), tables=['title_basics', 'name_basics']).
2023-03-25T11:36:22.069 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'original_name'} to actual placeholders={'int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'} for source='name_basics'.
2023-03-25T11:36:22.069 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='name_basics' for ['original_name']x['int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:22.070 [id_translation.mapping.placeholders.config-toml:DEBUG] Applied 2 overrides, but none were a match for the given values=['original_name'].
2023-03-25T11:36:22.071 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'original_name'} to actual placeholders={'int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'} for source='name_basics': {}.
2023-03-25T11:36:22.072 [id_translation.fetching.config-toml:DEBUG] Begin fetching placeholders=('id', 'name', 'original_name', 'from', 'to') from source='name_basics' for all IDs.
2023-03-25T11:36:22.749 [id_translation.fetching.config-toml:DEBUG] Finished fetching placeholders=('index', 'nconst', 'primaryName', 'birthYear', 'deathYear', 'primaryProfession', 'knownForTitles', 'int_id_nconst') for 172326 IDs from source 'name_basics' in 0.677103 sec using SqlFetcher(Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb), tables=['title_basics', 'name_basics']).
2023-03-25T11:36:22.750 [id_translation.fetching.config-toml.sql:DEBUG] Dispose Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb)
2023-03-25T11:36:22.750 [id_translation.Translator:INFO] Created Translator(online=False: cache=TranslationMap('name_basics': 172326 IDs, 'title_basics': 48979 IDs)) in 0.897753 sec.
[6]:
TranslationMap('name_basics': 172326 IDs, 'title_basics': 48979 IDs)

Get the name and the “first” appearance for actors#

In the IMDb list anyway. I have no idea how they’re ordered in “knownForTitles”.

[7]:
df = first_title(seed=5)
df.head()
[7]:
nconst firstTitle
102407 nm0807090 tt0063794
105345 nm0831760 tt0019348
1893 nm0008280 tt0012541
107276 nm0845966 tt0066501
170646 nm8380982 tt5990574

Translate#

[8]:
translator.translate(df).head(5)
2023-03-25T11:36:23.814 [id_translation.Translator:DEBUG] Begin translation of 'DataFrame' using sources=['title_basics', 'name_basics']. Names to translate: Will be derived based on 'DataFrame'.
2023-03-25T11:36:23.815 [id_translation.Translator:DEBUG] Begin name-to-source mapping of names=['nconst', 'firstTitle'] in DataFrame against sources=['title_basics', 'name_basics'].
2023-03-25T11:36:23.816 [id_translation.mapping.name-to-source:DEBUG] Begin computing match scores for ['nconst', 'firstTitle']x['title_basics', 'name_basics'] using HeuristicScore([like_database_table()] -> modified_hamming).
2023-03-25T11:36:23.817 [id_translation.mapping.name-to-source:DEBUG] All values mapped by overrides. Applied 3 overrides, and found 2 matches={'nconst': 'name_basics', 'firstTitle': 'title_basics'} in the given values=['nconst', 'firstTitle'].
2023-03-25T11:36:23.818 [id_translation.Translator:DEBUG] Finished name-to-source mapping of names=['nconst', 'firstTitle'] in DataFrame against sources=['title_basics', 'name_basics']: {'nconst': ('name_basics',), 'firstTitle': ('title_basics',)}.
2023-03-25T11:36:23.883 [id_translation.Translator:DEBUG] Failed to translate 884/1000 (88.400%) of IDs for name='firstTitle' using source='title_basics'.
2023-03-25T11:36:23.883 [id_translation.Translator:DEBUG] Failed to translate 884/2000 (44.200%) of IDs extracted from 2 different names.
2023-03-25T11:36:24.018 [id_translation.Translator:DEBUG] Finished translation of 'DataFrame' in 0.203546 sec. Returning a translated copy since inplace=False.
[8]:
nconst firstTitle
102407 nm0807090:Zoya Smirnova-Nemirovich *1909†1986 tt0063794 not translated; default name=Title unknown
105345 nm0831760:Billy Stone *1884†1931 tt0019348 not translated; default name=Title unknown
1893 nm0008280:Achmed Abdullah *1881†1945 tt0012541 not translated; default name=Title unknown
107276 nm0845966:André Tabet *1902†1981 tt0066501 not translated; default name=Title unknown
170646 nm8380982:Lee Botts *1928†2019 tt5990574 not translated; default name=Title unknown
[9]:
translator.translate(df, inplace=True)  # returns None
df.head(5)
2023-03-25T11:36:24.024 [id_translation.Translator:DEBUG] Begin translation of 'DataFrame' using sources=['title_basics', 'name_basics']. Names to translate: Will be derived based on 'DataFrame'.
2023-03-25T11:36:24.026 [id_translation.Translator:DEBUG] Begin name-to-source mapping of names=['nconst', 'firstTitle'] in DataFrame against sources=['title_basics', 'name_basics'].
2023-03-25T11:36:24.027 [id_translation.mapping.name-to-source:DEBUG] Begin computing match scores for ['nconst', 'firstTitle']x['title_basics', 'name_basics'] using HeuristicScore([like_database_table()] -> modified_hamming).
2023-03-25T11:36:24.029 [id_translation.mapping.name-to-source:DEBUG] All values mapped by overrides. Applied 3 overrides, and found 2 matches={'nconst': 'name_basics', 'firstTitle': 'title_basics'} in the given values=['nconst', 'firstTitle'].
2023-03-25T11:36:24.030 [id_translation.Translator:DEBUG] Finished name-to-source mapping of names=['nconst', 'firstTitle'] in DataFrame against sources=['title_basics', 'name_basics']: {'nconst': ('name_basics',), 'firstTitle': ('title_basics',)}.
2023-03-25T11:36:24.093 [id_translation.Translator:DEBUG] Failed to translate 884/1000 (88.400%) of IDs for name='firstTitle' using source='title_basics'.
2023-03-25T11:36:24.094 [id_translation.Translator:DEBUG] Failed to translate 884/2000 (44.200%) of IDs extracted from 2 different names.
2023-03-25T11:36:24.231 [id_translation.Translator:DEBUG] Finished translation of 'DataFrame' in 0.206744 sec. Values in 'DataFrame' have been replaced  since inplace=True.
[9]:
nconst firstTitle
102407 nm0807090:Zoya Smirnova-Nemirovich *1909†1986 tt0063794 not translated; default name=Title unknown
105345 nm0831760:Billy Stone *1884†1931 tt0019348 not translated; default name=Title unknown
1893 nm0008280:Achmed Abdullah *1881†1945 tt0012541 not translated; default name=Title unknown
107276 nm0845966:André Tabet *1902†1981 tt0066501 not translated; default name=Title unknown
170646 nm8380982:Lee Botts *1928†2019 tt5990574 not translated; default name=Title unknown
[ ]: