Fetching data using PandasFetcher#

Translating using pickle files.

[1]:
import sys
import rics
import id_translation

# Print relevant versions
print(f"{rics.__version__=}")
print(f"{id_translation.__version__=}")
print(f"{sys.version=}")
rics.configure_stuff(rics_level="DEBUG", id_translation_level="DEBUG")
!git log --pretty=oneline --abbrev-commit -1
rics.__version__='3.0.0'
id_translation.__version__='0.3.1.dev1'
sys.version='3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]'
👻 Configured some stuff just the way I like it!
d2093a4 (HEAD, origin/main, origin/HEAD, main) Update help link in TOML files

Make local Pickle files#

We’lll download data from https://datasets.imdbws.com and clean it to make sure all values are given (which means that for actors are dead and titles have stopped airing).

[2]:
sources = ["name.basics", "title.basics"]
[3]:
from data import load_imdb

for dataset in sources:
    load_imdb(dataset)
2023-03-25T11:23:35.590 [rics.utility.misc.get_local_or_remote:DEBUG] Local file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/name.basics.tsv.gz'.
2023-03-25T11:23:35.591 [rics.utility.misc.get_local_or_remote:DEBUG] Remote file path: 'https://datasets.imdbws.com/name.basics.tsv.gz'.
2023-03-25T11:23:35.614 [rics.utility.misc.get_local_or_remote:INFO] Fetching data from 'https://datasets.imdbws.com/name.basics.tsv.gz'..
2023-03-25T11:23:39.196 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/clean_and_fix_ids/name.basics.tsv.pkl'.
2023-03-25T11:23:39.197 [rics.utility.misc.get_local_or_remote:INFO] Running clean_and_fix_ids..
2023-03-25T11:23:57.322 [rics.utility.misc.get_local_or_remote:INFO] Serializing processed data to '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/clean_and_fix_ids/name.basics.tsv.pkl'..
2023-03-25T11:23:57.433 [rics.utility.misc.get_local_or_remote:DEBUG] Local file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/title.basics.tsv.gz'.
2023-03-25T11:23:57.434 [rics.utility.misc.get_local_or_remote:DEBUG] Remote file path: 'https://datasets.imdbws.com/title.basics.tsv.gz'.
2023-03-25T11:23:57.434 [rics.utility.misc.get_local_or_remote:INFO] Fetching data from 'https://datasets.imdbws.com/title.basics.tsv.gz'..
2023-03-25T11:23:59.305 [rics.utility.misc.get_local_or_remote:INFO] Local processed file path: '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/clean_and_fix_ids/title.basics.tsv.pkl'.
2023-03-25T11:23:59.306 [rics.utility.misc.get_local_or_remote:INFO] Running clean_and_fix_ids..
/home/dev/git/id-translation/jupyterlab/id-translation/data.py:36: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(input_path, sep="\t", header=0, engine="c")
2023-03-25T11:24:16.107 [rics.utility.misc.get_local_or_remote:INFO] Serializing processed data to '/home/dev/git/id-translation/jupyterlab/id-translation/data-cache/clean_and_fix_ids/title.basics.tsv.pkl'..

Create translator from config#

Click here to see the file.

[4]:
from id_translation import Translator

translator = Translator.from_config("config.toml")
translator
2023-03-25T11:24:16.151 [id_translation.fetching.config-toml.pandas.discovery:DEBUG] Sources initialized: ['name.basics', 'title.basics']
[4]:
Translator(online=True: fetcher=PandasFetcher(sources=['name.basics', 'title.basics']))
[5]:
tmap = translator.store().cache
2023-03-25T11:24:16.262 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'id', 'to', 'from', 'name'} to actual placeholders={'nconst', 'int_id_nconst', 'knownForTitles', 'deathYear', 'primaryName', 'primaryProfession', 'birthYear'} for source='name.basics'.
2023-03-25T11:24:16.263 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='name.basics' for ['id', 'to', 'from', 'name']x['nconst', 'int_id_nconst', 'knownForTitles', 'deathYear', 'primaryName', 'primaryProfession', 'birthYear'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:24:16.264 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 4 matches={'id': 'nconst', 'to': 'deathYear', 'from': 'birthYear', 'name': 'primaryName'} in the given values=['id', 'to', 'from', 'name'].
2023-03-25T11:24:16.266 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'id', 'to', 'from', 'name'} to actual placeholders={'nconst', 'int_id_nconst', 'knownForTitles', 'deathYear', 'primaryName', 'primaryProfession', 'birthYear'} for source='name.basics': {'id': ('nconst',), 'to': ('deathYear',), 'from': ('birthYear',), 'name': ('primaryName',)}.
2023-03-25T11:24:16.266 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'id', 'to', 'from', 'name'} to actual placeholders={'startYear', 'titleType', 'int_id_tconst', 'tconst', 'runtimeMinutes', 'isAdult', 'primaryTitle', 'endYear', 'genres', 'originalTitle'} for source='title.basics'.
2023-03-25T11:24:16.267 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='title.basics' for ['id', 'to', 'from', 'name']x['startYear', 'titleType', 'int_id_tconst', 'tconst', 'runtimeMinutes', 'isAdult', 'primaryTitle', 'endYear', 'genres', 'originalTitle'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:24:16.268 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 4 matches={'id': 'tconst', 'to': 'endYear', 'from': 'startYear', 'name': 'primaryTitle'} in the given values=['id', 'to', 'from', 'name'].
2023-03-25T11:24:16.270 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'id', 'to', 'from', 'name'} to actual placeholders={'startYear', 'titleType', 'int_id_tconst', 'tconst', 'runtimeMinutes', 'isAdult', 'primaryTitle', 'endYear', 'genres', 'originalTitle'} for source='title.basics': {'id': ('tconst',), 'to': ('endYear',), 'from': ('startYear',), 'name': ('primaryTitle',)}.
2023-03-25T11:24:16.270 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'original_name'} to actual placeholders={'nconst', 'int_id_nconst', 'knownForTitles', 'deathYear', 'primaryName', 'primaryProfession', 'birthYear'} for source='name.basics'.
2023-03-25T11:24:16.270 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='name.basics' for ['original_name']x['nconst', 'int_id_nconst', 'knownForTitles', 'deathYear', 'primaryName', 'primaryProfession', 'birthYear'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:24:16.271 [id_translation.mapping.placeholders.config-toml:DEBUG] Applied 2 overrides, but none were a match for the given values=['original_name'].
2023-03-25T11:24:16.273 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'original_name'} to actual placeholders={'nconst', 'int_id_nconst', 'knownForTitles', 'deathYear', 'primaryName', 'primaryProfession', 'birthYear'} for source='name.basics': {}.
2023-03-25T11:24:16.273 [id_translation.fetching.config-toml:DEBUG] Begin fetching placeholders=('id', 'name', 'original_name', 'from', 'to') from source='name.basics' for all IDs.
2023-03-25T11:24:16.411 [id_translation.fetching.config-toml:DEBUG] Finished fetching placeholders=('nconst', 'primaryName', 'birthYear', 'deathYear', 'primaryProfession', 'knownForTitles', 'int_id_nconst') for 172326 IDs from source 'name.basics' in 0.13715 sec using PandasFetcher(sources=['name.basics', 'title.basics']).
2023-03-25T11:24:16.412 [id_translation.fetching.config-toml:DEBUG] Begin wanted-to-actual placeholder mapping of placeholders={'original_name'} to actual placeholders={'startYear', 'titleType', 'int_id_tconst', 'tconst', 'runtimeMinutes', 'isAdult', 'primaryTitle', 'endYear', 'genres', 'originalTitle'} for source='title.basics'.
2023-03-25T11:24:16.412 [id_translation.mapping.placeholders.config-toml:DEBUG] Begin computing match scores in context='title.basics' for ['original_name']x['startYear', 'titleType', 'int_id_tconst', 'tconst', 'runtimeMinutes', 'isAdult', 'primaryTitle', 'endYear', 'genres', 'originalTitle'] using HeuristicScore([force_lower_case()] -> AbstractFetcher.default_score_function).
2023-03-25T11:24:16.413 [id_translation.mapping.placeholders.config-toml:DEBUG] All values mapped by overrides. Applied 2 overrides, and found 1 matches={'original_name': 'originalTitle'} in the given values=['original_name'].
2023-03-25T11:24:16.415 [id_translation.fetching.config-toml:DEBUG] Finished wanted-to-actual placeholder mapping of placeholders={'original_name'} to actual placeholders={'startYear', 'titleType', 'int_id_tconst', 'tconst', 'runtimeMinutes', 'isAdult', 'primaryTitle', 'endYear', 'genres', 'originalTitle'} for source='title.basics': {'original_name': ('originalTitle',)}.
2023-03-25T11:24:16.416 [id_translation.fetching.config-toml:DEBUG] Begin fetching placeholders=('id', 'name', 'original_name', 'from', 'to') from source='title.basics' for all IDs.
2023-03-25T11:24:16.488 [id_translation.fetching.config-toml:DEBUG] Finished fetching placeholders=('tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear', 'endYear', 'runtimeMinutes', 'genres', 'int_id_tconst') for 48979 IDs from source 'title.basics' in 0.0710693 sec using PandasFetcher(sources=['name.basics', 'title.basics']).
2023-03-25T11:24:16.488 [id_translation.Translator:INFO] Created Translator(online=False: cache=TranslationMap('title.basics': 48979 IDs, 'name.basics': 172326 IDs)) in 0.267461 sec.
[6]:
for source in tmap:
    translations = tmap[source]
    print(f"Translations for {source=};")
    for i, (idx, translation) in enumerate(tmap[source].items()):
        print(f"    {repr(idx)} -> {repr(translation)}")
        if i == 2:
            break
Translations for source='title.basics';
    'tt0035803' -> 'tt0035803:The German Weekly Review (original: Die Deutsche Wochenschau) *1940†1945'
    'tt0038276' -> 'tt0038276:You Are an Artist (original: You Are an Artist) *1946†1955'
    'tt0039120' -> 'tt0039120:Americana (original: Americana) *1947†1949'
Translations for source='name.basics';
    'nm0000001' -> 'nm0000001:Fred Astaire *1899†1987'
    'nm0000002' -> 'nm0000002:Lauren Bacall *1924†2014'
    'nm0000004' -> 'nm0000004:John Belushi *1949†1982'
[ ]: