Fetching data using SqlFetcher#
Translating using a SQL database. This notebook assumes that the Prepare for ``SqlFetcher`` demo-step from the PickleFetcher demo notebook has been completed.
[1]:
import sys
import rics
import id_translation
# Print relevant versions
print(f"{rics.__version__=}")
print(f"{id_translation.__version__=}")
print(f"{sys.version=}")
rics.configure_stuff(rics_level="DEBUG", id_translation_level="DEBUG")
!git log --pretty=oneline --abbrev-commit -1
rics.__version__='3.0.0'
id_translation.__version__='0.3.1.dev1'
sys.version='3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]'
đź‘» Configured some stuff just the way I like it!
d2093a4 (HEAD, origin/main, origin/HEAD, main) Update help link in TOML files
Load database#
[2]:
import tomli
with open("config.toml", "rb") as f:
connection_string = tomli.load(f)["fetching"]["SqlFetcher"]["connection_string"]
connection_string = connection_string.format(password="your_password")
print(f"{connection_string=}")
connection_string='postgresql+pg8000://postgres:your_password@localhost:5432/imdb'
[3]:
import sqlalchemy
import tomli
from data import load_imdb
engine = sqlalchemy.create_engine(connection_string)
for source in ["name.basics", "title.basics"]:
df = load_imdb(source)[0]
df.to_sql(source.replace(".", "_"), engine, if_exists="replace")
2023-03-25T11:35:34.942 [rics.utility.misc.get_local_or_remote:DEBUG] Local file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/name.basics.tsv.gz'.
2023-03-25T11:35:34.943 [rics.utility.misc.get_local_or_remote:DEBUG] Remote file path: 'https://datasets.imdbws.com/name.basics.tsv.gz'.
2023-03-25T11:35:34.943 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/clean_and_fix_ids/name.basics.tsv.pkl'.
2023-03-25T11:36:11.181 [rics.utility.misc.get_local_or_remote:DEBUG] Local file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/title.basics.tsv.gz'.
2023-03-25T11:36:11.182 [rics.utility.misc.get_local_or_remote:DEBUG] Remote file path: 'https://datasets.imdbws.com/title.basics.tsv.gz'.
2023-03-25T11:36:11.183 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/clean_and_fix_ids/title.basics.tsv.pkl'.
Create translator from config#
Click here to see the file.
[4]:
from id_translation import Translator
translator = Translator.from_config("config.toml")
translator
2023-03-25T11:36:21.816 [id_translation.fetching.config-toml.sql.discovery:DEBUG] Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb): Metadata created in 0.0342428 sec.
2023-03-25T11:36:21.817 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'id'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics'.
2023-03-25T11:36:21.818 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='title_basics' for ['id']x['isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:21.819 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 1 matches={'id': 'tconst'} in the given values=['id'].
2023-03-25T11:36:21.821 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'id'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics': {'id': ('tconst',)}.
2023-03-25T11:36:21.829 [id_translation.fetching.config-toml.sql:DEBUG] Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb): Counted 48979 rows of table 'title_basics' in 0.00790173 sec.
2023-03-25T11:36:21.830 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'id'} to actual placeholders={'int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'} for source='name_basics'.
2023-03-25T11:36:21.830 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='name_basics' for ['id']x['int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:21.831 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 1 matches={'id': 'nconst'} in the given values=['id'].
2023-03-25T11:36:21.832 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'id'} to actual placeholders={'int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'} for source='name_basics': {'id': ('nconst',)}.
2023-03-25T11:36:21.843 [id_translation.fetching.config-toml.sql:DEBUG] Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb): Counted 172326 rows of table 'name_basics' in 0.00998208 sec.
[4]:
Translator(online=True: fetcher=SqlFetcher(Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb), tables=['title_basics', 'name_basics']))
Make some data to translate#
[5]:
import pandas as pd
engine = translator._fetcher._engine
def first_title(seed=None, n=1000):
df = pd.read_sql("SELECT * FROM name_basics;", engine).sample(n, random_state=seed)
df["firstTitle"] = df.knownForTitles.str.split(",").str[0]
return df[["nconst", "firstTitle"]]
[6]:
translator.store().cache
2023-03-25T11:36:21.853 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'to', 'name', 'from'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics'.
2023-03-25T11:36:21.854 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='title_basics' for ['to', 'name', 'from']x['isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:21.855 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 3 matches={'to': 'endYear', 'name': 'primaryTitle', 'from': 'startYear'} in the given values=['to', 'name', 'from'].
2023-03-25T11:36:21.857 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'to', 'name', 'from'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics': {'to': ('endYear',), 'name': ('primaryTitle',), 'from': ('startYear',)}.
2023-03-25T11:36:21.858 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'to', 'name', 'from'} to actual placeholders={'int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'} for source='name_basics'.
2023-03-25T11:36:21.859 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='name_basics' for ['to', 'name', 'from']x['int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:21.859 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 3 matches={'to': 'deathYear', 'name': 'primaryName', 'from': 'birthYear'} in the given values=['to', 'name', 'from'].
2023-03-25T11:36:21.861 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'to', 'name', 'from'} to actual placeholders={'int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'} for source='name_basics': {'to': ('deathYear',), 'name': ('primaryName',), 'from': ('birthYear',)}.
2023-03-25T11:36:21.861 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'original_name'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics'.
2023-03-25T11:36:21.862 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='title_basics' for ['original_name']x['isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:21.863 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 1 matches={'original_name': 'originalTitle'} in the given values=['original_name'].
2023-03-25T11:36:21.864 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'original_name'} to actual placeholders={'isAdult', 'primaryTitle', 'int_id_tconst', 'titleType', 'originalTitle', 'startYear', 'endYear', 'runtimeMinutes', 'tconst', 'index', 'genres'} for source='title_basics': {'original_name': ('originalTitle',)}.
2023-03-25T11:36:21.864 [id_translation.fetching.config-toml:DEBUG] Begin fetching placeholders=('id', 'name', 'original_name', 'from', 'to') from source='title_basics' for all IDs.
2023-03-25T11:36:22.068 [id_translation.fetching.config-toml:DEBUG] Finished fetching placeholders=('index', 'tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear', 'endYear', 'runtimeMinutes', 'genres', 'int_id_tconst') for 48979 IDs from source 'title_basics' in 0.203781 sec using SqlFetcher(Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb), tables=['title_basics', 'name_basics']).
2023-03-25T11:36:22.069 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'original_name'} to actual placeholders={'int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'} for source='name_basics'.
2023-03-25T11:36:22.069 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='name_basics' for ['original_name']x['int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:36:22.070 [id_translation.mapping.placeholders.config-toml:DEBUG] Applied 2 overrides, but none were a match for the given values=['original_name'].
2023-03-25T11:36:22.071 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'original_name'} to actual placeholders={'int_id_nconst', 'nconst', 'knownForTitles', 'primaryName', 'deathYear', 'primaryProfession', 'index', 'birthYear'} for source='name_basics': {}.
2023-03-25T11:36:22.072 [id_translation.fetching.config-toml:DEBUG] Begin fetching placeholders=('id', 'name', 'original_name', 'from', 'to') from source='name_basics' for all IDs.
2023-03-25T11:36:22.749 [id_translation.fetching.config-toml:DEBUG] Finished fetching placeholders=('index', 'nconst', 'primaryName', 'birthYear', 'deathYear', 'primaryProfession', 'knownForTitles', 'int_id_nconst') for 172326 IDs from source 'name_basics' in 0.677103 sec using SqlFetcher(Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb), tables=['title_basics', 'name_basics']).
2023-03-25T11:36:22.750 [id_translation.fetching.config-toml.sql:DEBUG] Dispose Engine(postgresql+pg8000://postgres:***@localhost:5432/imdb)
2023-03-25T11:36:22.750 [id_translation.Translator:INFO] Created Translator(online=False: cache=TranslationMap('name_basics': 172326 IDs, 'title_basics': 48979 IDs)) in 0.897753 sec.
[6]:
TranslationMap('name_basics': 172326 IDs, 'title_basics': 48979 IDs)
Get the name and the “first” appearance for actors#
In the IMDb list anyway. I have no idea how they’re ordered in “knownForTitles”.
[7]:
df = first_title(seed=5)
df.head()
[7]:
| nconst | firstTitle | |
|---|---|---|
| 102407 | nm0807090 | tt0063794 |
| 105345 | nm0831760 | tt0019348 |
| 1893 | nm0008280 | tt0012541 |
| 107276 | nm0845966 | tt0066501 |
| 170646 | nm8380982 | tt5990574 |
Translate#
[8]:
translator.translate(df).head(5)
2023-03-25T11:36:23.814 [id_translation.Translator:DEBUG] Begin translation of 'DataFrame' using sources=['title_basics', 'name_basics']. Names to translate: Will be derived based on 'DataFrame'.
2023-03-25T11:36:23.815 [id_translation.Translator:DEBUG] Begin name-to-source mapping of names=['nconst', 'firstTitle'] in DataFrame against sources=['title_basics', 'name_basics'].
2023-03-25T11:36:23.816 [id_translation.mapping.name-to-source:DEBUG] Begin computing match scores for ['nconst', 'firstTitle']x['title_basics', 'name_basics'] using HeuristicScore([like_database_table()] -> modified_hamming).
2023-03-25T11:36:23.817 [id_translation.mapping.name-to-source:DEBUG] All values mapped by overrides. Applied 3 overrides, and found 2 matches={'nconst': 'name_basics', 'firstTitle': 'title_basics'} in the given values=['nconst', 'firstTitle'].
2023-03-25T11:36:23.818 [id_translation.Translator:DEBUG] Finished name-to-source mapping of names=['nconst', 'firstTitle'] in DataFrame against sources=['title_basics', 'name_basics']: {'nconst': ('name_basics',), 'firstTitle': ('title_basics',)}.
2023-03-25T11:36:23.883 [id_translation.Translator:DEBUG] Failed to translate 884/1000 (88.400%) of IDs for name='firstTitle' using source='title_basics'.
2023-03-25T11:36:23.883 [id_translation.Translator:DEBUG] Failed to translate 884/2000 (44.200%) of IDs extracted from 2 different names.
2023-03-25T11:36:24.018 [id_translation.Translator:DEBUG] Finished translation of 'DataFrame' in 0.203546 sec. Returning a translated copy since inplace=False.
[8]:
| nconst | firstTitle | |
|---|---|---|
| 102407 | nm0807090:Zoya Smirnova-Nemirovich *1909†1986 | tt0063794 not translated; default name=Title unknown |
| 105345 | nm0831760:Billy Stone *1884†1931 | tt0019348 not translated; default name=Title unknown |
| 1893 | nm0008280:Achmed Abdullah *1881†1945 | tt0012541 not translated; default name=Title unknown |
| 107276 | nm0845966:André Tabet *1902†1981 | tt0066501 not translated; default name=Title unknown |
| 170646 | nm8380982:Lee Botts *1928†2019 | tt5990574 not translated; default name=Title unknown |
[9]:
translator.translate(df, inplace=True) # returns None
df.head(5)
2023-03-25T11:36:24.024 [id_translation.Translator:DEBUG] Begin translation of 'DataFrame' using sources=['title_basics', 'name_basics']. Names to translate: Will be derived based on 'DataFrame'.
2023-03-25T11:36:24.026 [id_translation.Translator:DEBUG] Begin name-to-source mapping of names=['nconst', 'firstTitle'] in DataFrame against sources=['title_basics', 'name_basics'].
2023-03-25T11:36:24.027 [id_translation.mapping.name-to-source:DEBUG] Begin computing match scores for ['nconst', 'firstTitle']x['title_basics', 'name_basics'] using HeuristicScore([like_database_table()] -> modified_hamming).
2023-03-25T11:36:24.029 [id_translation.mapping.name-to-source:DEBUG] All values mapped by overrides. Applied 3 overrides, and found 2 matches={'nconst': 'name_basics', 'firstTitle': 'title_basics'} in the given values=['nconst', 'firstTitle'].
2023-03-25T11:36:24.030 [id_translation.Translator:DEBUG] Finished name-to-source mapping of names=['nconst', 'firstTitle'] in DataFrame against sources=['title_basics', 'name_basics']: {'nconst': ('name_basics',), 'firstTitle': ('title_basics',)}.
2023-03-25T11:36:24.093 [id_translation.Translator:DEBUG] Failed to translate 884/1000 (88.400%) of IDs for name='firstTitle' using source='title_basics'.
2023-03-25T11:36:24.094 [id_translation.Translator:DEBUG] Failed to translate 884/2000 (44.200%) of IDs extracted from 2 different names.
2023-03-25T11:36:24.231 [id_translation.Translator:DEBUG] Finished translation of 'DataFrame' in 0.206744 sec. Values in 'DataFrame' have been replaced since inplace=True.
[9]:
| nconst | firstTitle | |
|---|---|---|
| 102407 | nm0807090:Zoya Smirnova-Nemirovich *1909†1986 | tt0063794 not translated; default name=Title unknown |
| 105345 | nm0831760:Billy Stone *1884†1931 | tt0019348 not translated; default name=Title unknown |
| 1893 | nm0008280:Achmed Abdullah *1881†1945 | tt0012541 not translated; default name=Title unknown |
| 107276 | nm0845966:André Tabet *1902†1981 | tt0066501 not translated; default name=Title unknown |
| 170646 | nm8380982:Lee Botts *1928†2019 | tt5990574 not translated; default name=Title unknown |
[ ]: