id_translation.fetching#

Fetching of translation data.

Composite:
  • MultiFetcher: Solution for using multiple simple fetchers, e.g. multiple databases or file-system locations. Or a combination thereof!

Simple fetchers:
  • SqlFetcher: Fetching from a single SQL database or schema.

  • PandasFetcher: File-system fetching based on pandas read-functions. Valid URL schemes include http, ftp, s3, gs, and file.

  • MemoryFetcher: In-memory solution, used primarily for testing.

Base fetchers:

Classes

AbstractFetcher([mapper, allow_fetch_all, ...])

Base class for retrieving translations from an external source.

CacheAccess(max_age, metadata)

Utility class for managing the FETCH_ALL-cache.

CacheMetadata(*, cache_keys, placeholders, ...)

Metadata pertaining to fetcher caching logic.

Fetcher()

Interface for fetching translations from an external source.

MemoryFetcher([data, return_all])

Fetch from memory.

MultiFetcher(*children[, max_workers, ...])

Fetcher which combines the results of other fetchers.

PandasFetcher([read_function, ...])

Fetcher implementation using pandas DataFrame s as the data format.

SqlFetcher(connection_string[, password, ...])

Fetch data from a SQL source.

class AbstractFetcher(mapper=None, allow_fetch_all=True, fetch_all_unmapped_values_action=None, selective_fetch_all=True, fetch_all_cache_max_age=None, cache_keys=None, optional=False)[source]#

Bases: Fetcher[SourceType, IdType]

Base class for retrieving translations from an external source.

Hint

Parameters:
  • mapper – A Mapper instance used to adapt placeholder names in sources to wanted names, i.e. the names of the placeholders that are in the translation Format being used.

  • allow_fetch_all – If False, an error will be raised when fetch_all() is called.

  • fetch_all_unmapped_values_action – A temporary value to use for Mapper.unmapped_values_action while fetch_all() is executing. Setting fetch_all_unmapped_values_action='raise' is mutually exclusive with selective_fetch_all=True.

  • selective_fetch_all – If True, fetch only from those sources that contain the required placeholders (after mapping). May also reduce the number of placeholders retrieved.

  • fetch_all_cache_max_age – If given, determines validity lifetime of data cached when fetch_all()-calls are made. The regular fetch function will draw from this cache as well, but only fetch_all will update the cache. Furthermore, caching will never be used (read or write) if online is False.

  • cache_keys – A collection of hierarchical cache-key elements, see CacheMetadata. If given, element zero of the cache_keys is added to the logger name for the fetcher.

  • optional – If True, this fetcher may be discarded if source/placeholder-enumeration fails in multi-fetcher mode.

Raises:
property allow_fetch_all#

Flag indicating whether the fetch_all() operation is permitted.

assert_online()[source]#

Raise an error if offline.

Raises:

ConnectionStatusError – If not online.

property cache_enabled#

Return the caching status for the fetcher.

clear_cache(reason)[source]#
classmethod default_mapper_kwargs()[source]#

Return default Mapper arguments for AbstractFetcher implementations.

classmethod default_score_function(value, candidates, context)[source]#

Compute score for candidates.

fetch(ids_to_fetch, placeholders=(), required=(), task_id=None, enable_uuid_heuristics=False)[source]#

Retrieve placeholder translations from the source.

Parameters:
  • ids_to_fetch – An iterable of IdsToFetch.

  • placeholders – All desired placeholders in preferred order.

  • required – Placeholders that must be included in the response.

  • task_id – Used for logging.

  • enable_uuid_heuristics – If set, apply heuristics to improve matching with UUID-like IDs.

Returns:

A mapping {source: PlaceholderTranslations} of translation elements.

Raises:

Notes

Placeholders are usually columns in relational database applications. These are the components which are combined to create ID translations. See Format documentation for details.

fetch_all(placeholders=(), *, required=(), task_id=None, enable_uuid_heuristics=False)[source]#

Fetch as much data as possible.

Parameters:
  • placeholders – All desired placeholders in preferred order.

  • required – Placeholders that must be included in the response.

  • task_id – Used for logging.

  • enable_uuid_heuristics – If set, apply heuristics to improve matching with UUID-like IDs.

Returns:

A mapping {source: PlaceholderTranslations} of translation elements.

Raises:
abstract fetch_translations(instr)[source]#

Retrieve placeholder translations from the source.

Parameters:

instr – A single FetchInstruction for IDs to fetch. If IDs is None, the fetcher should retrieve data for as many IDs as possible.

Returns:

Placeholder translation elements.

Raises:

UnknownPlaceholderError – If the placeholder is unknown to the fetcher.

See also

🔑 This is a key event method. See Key Event Records for details.

get_placeholders(source)[source]#

Get placeholders for source.

id_column(source, *, candidates=None, task_id=None)[source]#

Return the ID column for source.

final initialize_sources(task_id=-1, *, force=False)[source]#

Perform source discovery.

Parameters:
  • task_id – Used for logging.

  • force – If True, perform full discovery even if sources are already known.

property logger#

Return the Logger that is used by this instance.

map_placeholders(source, placeholders, *, candidates=None, clear_cache=False, task_id=None)[source]#

Map placeholder names to the actual names seen in source.

This method calls Mapper.apply(values=placeholders, candidates=candidates, context=source) using this fetchers AbstractFetcher.mapper instance. It is assumed that names in sources rarely change, so mappings are cached until the fetcher is recreated or until this method is called with clear_cache=True.

Placeholder mapping caching should not be confused with FETCH_ALL data caching.

Parameters:
  • source – The source to map placeholders for.

  • placeholders – Desired placeholders.

  • candidates – A subset of candidates (placeholder names) in source to map with placeholders.

  • clear_cache – If True, force a full remap.

  • task_id – Used for logging purposes.

Returns:

A dict {wanted_placeholder_name: actual_placeholder_name_in_source}, where actual_placeholder_name_in_source will be None if the wanted placeholder could not be mapped to any of the candidates available for the source.

Raises:

UnknownPlaceholderError – If any of required_placeholders are incorrectly mapped, or not mapped at all.

See also

🔑 This is a key event method. See Key Event Records for details.

property mapper#

Return the Mapper instance used for placeholder name mapping.

property online#

Return connectivity status. If False, no new translations may be fetched.

property optional#

Return True if this fetcher has been marked as optional.

In multi-fetcher mode, optional fetchers may be discarded if sources cannot be resolved (raises an exception). Default value is False.

Returns:

Optionality status.

property placeholders#

Placeholders for all known Source names, such as id or name.

These are the (possibly unmapped) placeholders that may be used for translation.

Returns:

A dict {source: [placeholders..]}.

property selective_fetch_all#

If set, reduce the amount of data fetched by fetch_all().

property sources#

A list of known Source names, such as cities or languages.

class CacheAccess(max_age, metadata)[source]#

Bases: Generic[SourceType, IdType]

Utility class for managing the FETCH_ALL-cache.

Parameters:
  • max_age – Cache timeout.

  • metadata – Metadata object used to determine cache validity.

BASE_CACHE_PATH = PosixPath('/home/docs/.cache/id-translation/cached-fetcher-data')#

Top-level cache dir for all fetchers managed by any CacheAccess-instance.

CLEAR_CACHE_EXCEPTION_TYPES = (<class '_pickle.UnpicklingError'>,)#

Error types which trigger cache deletion

classmethod base_cache_dir_for_all_fetchers()[source]#

Top-level cache dir for all fetchers managed by any CacheAccess-instance.

property cache_dir#

Get the cache directory used by this CacheAccess.

Created from BASE_CACHE_PATH and the first value of CacheMetadata.cache_keys.

Returns:

Cache dir for a single fetcher.

clear(reason, log_level=10, *, exc_info=False)[source]#

Remove cached data for the current instance.

classmethod clear_all_cache_data()[source]#

Remove the entire cache directory tree for ALL instances.

property data_dir#
property metadata_path#
read_cache(source)[source]#

Read cached translation data for source.

source_path(source)[source]#

Get the data path for source.

write_cache(data)[source]#

Overwrite the current cache (if it exists) with new and update metadata.

class CacheMetadata(*, cache_keys, placeholders, **kwargs)[source]#

Bases: BaseMetadata, Generic[SourceType, IdType], HasSources[SourceType]

Metadata pertaining to fetcher caching logic.

Parameters:
  • cache_keys – Hierarchical identifiers for the cache. The first key is used to determine storage location, while the rest are used to detect configuration changes (which invalidate the cache). A typical key would be [config-file-name, config-file-sha].

  • placeholders – A Source-to-placeholder dict.

  • **kwargs – Forwarded to base classes.

property cache_keys#

Yields hierarchical cache keys for this metadata.

property placeholders#

Placeholders for all known Source names, such as id or name.

These are the (possibly unmapped) placeholders that may be used for translation.

Returns:

A dict {source: [placeholders..]}.

class Fetcher[source]#

Bases: Generic[SourceType, IdType], HasSources[SourceType]

Interface for fetching translations from an external source.

abstract property allow_fetch_all#

Flag indicating whether the fetch_all() operation is permitted.

close()[source]#

Close the Fetcher. Does nothing by default.

abstract fetch(ids_to_fetch, placeholders=(), *, required=(), task_id=None, enable_uuid_heuristics=False)[source]#

Retrieve placeholder translations from the source.

Parameters:
  • ids_to_fetch – An iterable of IdsToFetch.

  • placeholders – All desired placeholders in preferred order.

  • required – Placeholders that must be included in the response.

  • task_id – Used for logging.

  • enable_uuid_heuristics – If set, apply heuristics to improve matching with UUID-like IDs.

Returns:

A mapping {source: PlaceholderTranslations} of translation elements.

Raises:

Notes

Placeholders are usually columns in relational database applications. These are the components which are combined to create ID translations. See Format documentation for details.

abstract fetch_all(placeholders=(), *, required=(), task_id=None, enable_uuid_heuristics=False)[source]#

Fetch as much data as possible.

Parameters:
  • placeholders – All desired placeholders in preferred order.

  • required – Placeholders that must be included in the response.

  • task_id – Used for logging.

  • enable_uuid_heuristics – If set, apply heuristics to improve matching with UUID-like IDs.

Returns:

A mapping {source: PlaceholderTranslations} of translation elements.

Raises:
abstract initialize_sources(task_id=-1, *, force=False)[source]#

Perform source discovery.

Parameters:
  • task_id – Used for logging.

  • force – If True, perform full discovery even if sources are already known.

abstract property online#

Return connectivity status. If False, no new translations may be fetched.

property optional#

Return True if this fetcher has been marked as optional.

In multi-fetcher mode, optional fetchers may be discarded if sources cannot be resolved (raises an exception). Default value is False.

Returns:

Optionality status.

class MemoryFetcher(data=None, return_all=True, **kwargs)[source]#

Bases: AbstractFetcher[SourceType, IdType]

Fetch from memory.

Parameters:
  • data – A dict {source: PlaceholderTranslations} to fetch from.

  • return_all – If False, return only the requested IDs and placeholders.

fetch_translations(instr)[source]#

Retrieve placeholder translations from the source.

Parameters:

instr – A single FetchInstruction for IDs to fetch. If IDs is None, the fetcher should retrieve data for as many IDs as possible.

Returns:

Placeholder translation elements.

Raises:

UnknownPlaceholderError – If the placeholder is unknown to the fetcher.

See also

🔑 This is a key event method. See Key Event Records for details.

class MultiFetcher(*children, max_workers=1, duplicate_translation_action=ActionLevel.WARN, duplicate_source_discovered_action=ActionLevel.WARN, optional_fetcher_discarded_log_level='DEBUG')[source]#

Bases: Fetcher[SourceType, IdType]

Fetcher which combines the results of other fetchers.

Parameters:
  • *children – Fetchers to wrap.

  • max_workers – Number of threads to use for fetching. Fetch instructions will be dispatched using a ThreadPoolExecutor. Individual fetchers will be called at most once per fetch() or fetch_all() call made with the MultiFetcher.

  • duplicate_translation_action – Action to take when multiple fetchers return translations for the same source.

  • duplicate_source_discovered_action – Action to take when multiple fetchers claim the same source.

  • optional_fetcher_discarded_log_level – Log level used when discarding optional fetchers for any reason.

property allow_fetch_all#

Flag indicating whether the fetch_all() operation is permitted.

property children#

Return child fetchers.

property duplicate_source_discovered_action#

Return action to take when multiple fetchers claim the same source.

property duplicate_translation_action#

Return action to take when multiple fetchers return translations for the same source.

fetch(ids_to_fetch, placeholders=(), *, required=(), task_id=None, enable_uuid_heuristics=False)[source]#

Retrieve placeholder translations from the source.

Parameters:
  • ids_to_fetch – An iterable of IdsToFetch.

  • placeholders – All desired placeholders in preferred order.

  • required – Placeholders that must be included in the response.

  • task_id – Used for logging.

  • enable_uuid_heuristics – If set, apply heuristics to improve matching with UUID-like IDs.

Returns:

A mapping {source: PlaceholderTranslations} of translation elements.

Raises:

Notes

Placeholders are usually columns in relational database applications. These are the components which are combined to create ID translations. See Format documentation for details.

fetch_all(placeholders=(), *, required=(), task_id=None, enable_uuid_heuristics=False)[source]#

Fetch as much data as possible.

Parameters:
  • placeholders – All desired placeholders in preferred order.

  • required – Placeholders that must be included in the response.

  • task_id – Used for logging.

  • enable_uuid_heuristics – If set, apply heuristics to improve matching with UUID-like IDs.

Returns:

A mapping {source: PlaceholderTranslations} of translation elements.

Raises:
initialize_sources(task_id=-1, *, force=False)[source]#

Perform source discovery.

Parameters:
  • task_id – Used for logging.

  • force – If True, perform full discovery even if sources are already known.

property online#

Return connectivity status. If False, no new translations may be fetched.

property placeholders#

Placeholders for all known Source names, such as id or name.

These are the (possibly unmapped) placeholders that may be used for translation.

Returns:

A dict {source: [placeholders..]}.

class PandasFetcher(read_function='read_csv', read_path_format='data/{}.csv', read_function_kwargs=None, online=False, **kwargs)[source]#

Bases: AbstractFetcher[str, IdType]

Fetcher implementation using pandas DataFrame s as the data format.

Fetch data from serialized DataFrame s. How this is done is determined by the read_function. This is typically a Pandas function such as pandas.read_csv() or pandas.read_parquet(), but any function that accepts a string source as the first argument and returns a data frame can be used.

Hint

When using remote file systems, sources are resolved using AbstractFileSystem.glob(). If resolution fails, consider overriding the find_sources()-method.

Parameters:
  • read_function – A Pandas read-function. If a string is given, the function is resolved using rics.misc.get_by_full_name(). Unqualified names are assumed to belong to the pandas namespace.

  • read_path_format – A string on the form protocol://path/to/sources/{}.ext, or a callable to apply to a source before passing them to read_function.

  • read_function_kwargs – Additional keyword arguments for read_function.

  • online – Setting online=False typically indicates that files are hosted at a location where there are access limitations, e.g. through data transfer fees.

See also

The official Pandas IO documentation

fetch_translations(instr)[source]#

Retrieve placeholder translations from the source.

Parameters:

instr – A single FetchInstruction for IDs to fetch. If IDs is None, the fetcher should retrieve data for as many IDs as possible.

Returns:

Placeholder translation elements.

Raises:

UnknownPlaceholderError – If the placeholder is unknown to the fetcher.

See also

🔑 This is a key event method. See Key Event Records for details.

find_sources(task_id=-1)[source]#

Resolve sources and their associated paths.

Parameters:

task_id – Used for logging.

Sources are resolved in three steps:

  1. Create glob pattern by calling format_source() with source='*'.

  2. Glob files using AbstractFileSystem.glob() (requires fsspec) or Path.glob().

  3. Strip the directory and file suffix from the globbed paths to create source names.

Returns:

A dict {source: path}.

format_source(source)[source]#

Get the path for source.

property online#

Return connectivity status. If False, no new translations may be fetched.

read(source_path)[source]#

Read a DataFrame from a source path.

Parameters:

source_path – Path to serialized DataFrame.

Returns:

A deserialized DataFrame.

class SqlFetcher(connection_string, password=None, whitelist_tables=None, blacklist_tables=(), schema=None, include_views=False, engine_kwargs=None, **kwargs)[source]#

Bases: AbstractFetcher[str, IdType]

Fetch data from a SQL source.

Parameters:
  • connection_string – A SQLAlchemy connection string.

  • password – Password to insert into the connection string. Will be escaped to allow for special characters. If given, the connection string must contain a password key, eg; dialect://user:{password}@host:port.

  • whitelist_tables – The only tables the fetcher may access.

  • blacklist_tables – The only tables the fetcher may not access.

  • schema – Database schema to use. Typically needed only if schema is not the default schema for the user specified in the connection string.

  • include_views – If True, the fetcher will discover and query views as well.

  • engine_kwargs – A dict of keyword arguments for sqlalchemy.create_engine().

  • **kwargs – See AbstractFetcher.

Raises:

ValueError – If both whitelist_tables and blacklist_tables are given.

Notes

Inheriting classes may override on or more of the following methods to further customize operation.

Overriding should be done with care, as methods may call each other internally.

class TableSummary(name, columns, fetch_all_permitted, id_column)#

Bases: Generic[IdType]

Brief description of a known table.

columns#

A flag indicating that the FETCH_ALL-operation is permitted for this table.

fetch_all_permitted#

A flag indicating that the FETCH_ALL-operation is permitted for this table.

id_column#

The ID column of the table.

name#

Name of the table.

property allow_fetch_all#

Flag indicating whether the fetch_all() operation is permitted.

cast_id_column_to_uuid(id_column, *, ids_are_uuid_like)[source]#

Apply UUID heuristics to the ID column.

This function attempts cast the id_column to a suitable type by looking at the type of the column and the ids_are_uuid_like-flag.

If the column is already UUID-like (as determined by get_metadata()), the column is always returned as-is.

Parameters:
  • id_column – The ID sqlalchemy.sql.Column of the table.

  • ids_are_uuid_like – One of True and 'unknown' (never False). The latter typically means that fetch_all() was called, but could also be a normal “translation” call without IDs.

Returns:

The id_column with or without a cast applied.

close()[source]#

Close the fetcher, discarding the engine.

classmethod create_engine(connection_string, password, engine_kwargs)[source]#

Factory method used by __init__.

For a more detailed description of the arguments and the behaviour of this function, see the class docstring.

Parameters:
  • connection_string – A SQLAlchemy connection string.

  • password – Password to insert into the connection string.

  • engine_kwargs – A dict of keyword arguments for sqlalchemy.create_engine().

Returns:

A new Engine.

property engine#

The Engine used by this fetcher.

fetch_translations(instr)[source]#

Retrieve placeholder translations from the source.

Parameters:

instr – A single FetchInstruction for IDs to fetch. If IDs is None, the fetcher should retrieve data for as many IDs as possible.

Returns:

Placeholder translation elements.

Raises:

UnknownPlaceholderError – If the placeholder is unknown to the fetcher.

See also

🔑 This is a key event method. See Key Event Records for details.

get_metadata()[source]#

Create a populated metadata object.

make_table_summary(table, id_column)[source]#

Create a table summary.

This function is called as a part of the fetcher initialization process.

Parameters:
  • table – The table (source) which is currently being processed.

  • id_column – The ID column of table

Returns:

A summary object for table.

property online#

Return connectivity status. If False, no new translations may be fetched.

classmethod parse_connection_string(connection_string, password)[source]#

Parse a connection string.

classmethod select_where(select, *, ids, id_column, table)[source]#

Add WHERE clause(s) to an ID select statement.

Warning

When overriding, keep in mind that returning the select statement as-is will perform an unfiltered select.

Parameters:
  • select – A sqlalchemy.sql.Select element. If returned as-is, all IDs in the table will be fetched.

  • ids – Set of IDs to fetch. Will be None if fetch_all() was called.

  • id_column – The ID sqlalchemy.sql.Column of the table, from which ids are fetched.

  • table – Table to select from.

Returns:

The final statement object to use.

uuid_like(id_column, ids)[source]#

Determine whether id_column should be passed to cast_id_column_to_uuid().

Note

Return values:
Parameters:
  • id_column – The ID sqlalchemy.sql.Column of the table.

  • ids – Set of IDs to fetch. Will be None if fetch_all() was called.

Returns:

One of True, False and None. See above for explanation.

Modules

id_translation.fetching.exceptions

Errors and warnings related to fetching.

id_translation.fetching.types

Types related to translation fetching.