id_translation.fetching#

Fetching of translation data.

Composite:
  • MultiFetcher: Solution for using multiple simple fetchers, e.g. multiple databases or file-system locations. Or a combination thereof!

Simple fetchers:
  • SqlFetcher: Fetching from a single SQL database or schema.

  • PandasFetcher: File-system fetching based on pandas read-functions. Valid URL schemes include http, ftp, s3, gs, and file.

  • MemoryFetcher: In-memory solution, used primarily for testing.

Base fetchers:

Fetchers may have additional dependencies.

Classes

AbstractFetcher(*[, mapper, ...])

Base class for retrieving translations from an external source.

CacheAccess()

Interface for user-managed caching.

Fetcher()

Interface for fetching translations from an external source.

MemoryFetcher(data[, return_all])

Fetch from memory.

MultiFetcher(*children[, max_workers, ...])

Fetcher which combines the results of other fetchers.

PandasFetcher([read_function, ...])

Fetcher implementation using pandas.DataFrame as the data format.

SqlFetcher(connection_string[, password, ...])

Fetch data from a SQL source.

class AbstractFetcher(*, mapper=None, allow_fetch_all=True, selective_fetch_all=True, identifiers=None, optional=False, cache_access=None)[source]#

Bases: Fetcher[SourceType, IdType]

Base class for retrieving translations from an external source.

Parameters:
  • mapper – A Mapper instance used to adapt placeholder names in sources to wanted names, i.e. the names of the placeholders that are in the translation Format being used.

  • allow_fetch_all – If False, an error will be raised when fetch_all() is called.

  • selective_fetch_all – If True, fetch only from those sources that contain the required placeholders (after mapping). May reduce the number of sources retrieved.

  • identifiers – A collection of hierarchical identifiers. If given, element zero of the identifiers is added to the logger name for the fetcher.

  • optional – If True, this fetcher may be discarded if source/placeholder-enumeration fails in multi-fetcher mode.

  • cache_access – A CacheAccess instance. Defaults to a NOOP-implementation (i.e. always fetch new data).

property allow_fetch_all#

Flag indicating whether the fetch_all() operation is permitted.

assert_online()[source]#

Raise an error if offline.

Raises:

ConnectionStatusError – If not online.

property cache_access#

Return the CacheAccess for this fetcher.

classmethod default_mapper_kwargs()[source]#

Return default Mapper arguments for AbstractFetcher implementations.

classmethod default_score_function(value, candidates, context)[source]#

Compute score for candidates.

fetch(ids_to_fetch, placeholders=(), required=(), task_id=None, enable_uuid_heuristics=False)[source]#

Retrieve placeholder translations from the source.

Parameters:
  • ids_to_fetch – An iterable of IdsToFetch.

  • placeholders – All desired placeholders in preferred order.

  • required – Placeholders that must be included in the response.

  • task_id – Used for logging.

  • enable_uuid_heuristics – Improves matching when UUID-like IDs are in use.

Returns:

A mapping {source: PlaceholderTranslations} of translation elements.

Raises:

See also

🔑 This is a key event method. See Key Event Records for details.

Notes

Placeholders are usually columns in relational database applications. These are the components which are combined to create ID translations. See Format documentation for details.

fetch_all(placeholders=(), *, required=(), sources=None, task_id=None, enable_uuid_heuristics=False)[source]#

Fetch as much data as possible.

Parameters:
  • placeholders – All desired placeholders in preferred order.

  • required – Placeholders that must be included in the response.

  • sources – A subset of sources to fetch. Unknown sources are ignored. Set to None to fetch all sources.

  • task_id – Used for logging.

  • enable_uuid_heuristics – Improves matching when UUID-like IDs are in use.

Returns:

A mapping {source: PlaceholderTranslations} of translation elements.

See also

🔑 This is a key event method. See Key Event Records for details.

Raises:
abstract fetch_translations(instr)[source]#

Retrieve placeholder translations from the source.

Parameters:

instr – A single FetchInstruction for IDs to fetch. If IDs is None, the fetcher should retrieve data for as many IDs as possible.

Returns:

Placeholder translation elements.

Raises:

UnknownPlaceholderError – If the placeholder is unknown to the fetcher.

See also

🔑 This is a key event method. See Key Event Records for details.

id_column(source, *, candidates, task_id=None)[source]#

Return the ID column for source.

property identifiers#

A collection of hierarchical identifiers for this fetcher.

final initialize_sources(task_id=None, *, force=False)[source]#

Perform source discovery.

Parameters:
  • task_id – Used for logging.

  • force – If True, perform full discovery even if sources are already known.

See also

🔑 This is a key event method. See Key Event Records for details.

Notes

This function is called implicitly before every translation task. Result should be cached.

property logger#

Return the Logger that is used by this instance.

map_placeholders(source, placeholders, *, candidates=None, task_id=None)[source]#

Map placeholder names to the actual names seen in source.

This method calls Mapper.apply(values=placeholders, candidates=candidates, context=source) using the local AbstractFetcher.mapper instance.

Parameters:
  • source – The source to map placeholders for.

  • placeholders – Desired placeholders.

  • candidates – A subset of candidates (placeholder names) in source to map with placeholders.

  • task_id – Used for logging.

Returns:

A dict {wanted_placeholder_name: actual_placeholder_name_in_source}, where actual_placeholder_name_in_source will be None if the wanted placeholder could not be mapped to any of the candidates available for the source.

Raises:

UnknownSourceError – If source is not in sources.

See also

🔑 This is a key event method. See Key Event Records for details.

property mapper#

Return the Mapper instance used for placeholder name mapping.

property online#

Return connectivity status. If False, no new translations may be fetched.

property optional#

Return True if this fetcher has been marked as optional.

In multi-fetcher mode, optional fetchers may be discarded if sources cannot be resolved (raises an exception). Default value is False.

Returns:

Optionality status.

property placeholders#

Placeholders for all known Source names, such as id or name.

These are the (possibly unmapped) placeholders that may be used for translation.

Returns:

A dict {source: [placeholders..]}.

property selective_fetch_all#

If set, reduce the amount of data fetched by fetch_all().

property sources#

A list of known Source names, such as cities or languages.

class CacheAccess[source]#

Bases: ABC, Generic[SourceType, IdType]

Interface for user-managed caching.

To enable caching, implement the abstract methods of the CacheAccess interface and pass it to the fetcher. See the 🚀 examples page to get started.

property enabled#

Return the enabled status for this CacheAccess.

Returns True by default. If this property is False, no other methods will be called.

abstract load(instr)[source]#

Load cached translations.

If this method returns None, the AbstractFetcher will use fetch_translations() instead. The fetcher will then call store() using instr and the newly fetched translations.

Parameters:

instr – A FetchInstruction.

Returns:

Cached PlaceholderTranslations or None.

property parent#

Parent Fetcher instance.

The owner, typically an AbstractFetcher, should call set_parent() during initialization.

Returns:

The fetcher that owns this CacheAccess.

Raises:

RuntimeError – If called before the parent is set.

set_parent(parent)[source]#

Set parent instance.

Parameters:

parent – A Fetcher.

Raises:

RuntimeError – If a parent is already set.

abstract store(instr, translations)[source]#

Store fetched translations.

Note

This method will never be called with translations that were returned by load().

In other words, this method will only be called if CacheAccess.load(instr) returns None.

Hint

The CacheAccess is under no obligation to actually store translations.

For example, implementations may choose only to cache data when the FetchInstruction.fetch_all-property of the given instr is True.

Parameters:
class Fetcher[source]#

Bases: Generic[SourceType, IdType], HasSources[SourceType]

Interface for fetching translations from an external source.

abstract property allow_fetch_all#

Flag indicating whether the fetch_all() operation is permitted.

close()[source]#

Close the Fetcher. Does nothing by default.

abstract fetch(ids_to_fetch, placeholders=(), *, required=(), task_id=None, enable_uuid_heuristics=False)[source]#

Retrieve placeholder translations from the source.

Parameters:
  • ids_to_fetch – An iterable of IdsToFetch.

  • placeholders – All desired placeholders in preferred order.

  • required – Placeholders that must be included in the response.

  • task_id – Used for logging.

  • enable_uuid_heuristics – Improves matching when UUID-like IDs are in use.

Returns:

A mapping {source: PlaceholderTranslations} of translation elements.

Raises:

See also

🔑 This is a key event method. See Key Event Records for details.

Notes

Placeholders are usually columns in relational database applications. These are the components which are combined to create ID translations. See Format documentation for details.

abstract fetch_all(placeholders=(), *, required=(), sources=None, task_id=None, enable_uuid_heuristics=False)[source]#

Fetch as much data as possible.

Parameters:
  • placeholders – All desired placeholders in preferred order.

  • required – Placeholders that must be included in the response.

  • sources – A subset of sources to fetch. Unknown sources are ignored. Set to None to fetch all sources.

  • task_id – Used for logging.

  • enable_uuid_heuristics – Improves matching when UUID-like IDs are in use.

Returns:

A mapping {source: PlaceholderTranslations} of translation elements.

See also

🔑 This is a key event method. See Key Event Records for details.

Raises:
abstract initialize_sources(task_id=None, *, force=False)[source]#

Perform source discovery.

Parameters:
  • task_id – Used for logging.

  • force – If True, perform full discovery even if sources are already known.

See also

🔑 This is a key event method. See Key Event Records for details.

Notes

This function is called implicitly before every translation task. Result should be cached.

abstract property online#

Return connectivity status. If False, no new translations may be fetched.

property optional#

Return True if this fetcher has been marked as optional.

In multi-fetcher mode, optional fetchers may be discarded if sources cannot be resolved (raises an exception). Default value is False.

Returns:

Optionality status.

class MemoryFetcher(data, return_all=True, **kwargs)[source]#

Bases: AbstractFetcher[SourceType, IdType]

Fetch from memory.

This is essentially a thin wrapper for the PlaceholderTranslations class.

Parameters:
  • data – A dict {source: PlaceholderTranslations} to fetch from.

  • return_all – If False, return only the requested IDs and placeholders.

  • **kwargs – See AbstractFetcher.

fetch_translations(instr)[source]#

Retrieve placeholder translations from the source.

Parameters:

instr – A single FetchInstruction for IDs to fetch. If IDs is None, the fetcher should retrieve data for as many IDs as possible.

Returns:

Placeholder translation elements.

Raises:

UnknownPlaceholderError – If the placeholder is unknown to the fetcher.

See also

🔑 This is a key event method. See Key Event Records for details.

property return_all#

If True, fetch_translations() will filter by ID.

class MultiFetcher(*children, max_workers=1, on_source_conflict='raise', fetcher_discarded_log_level='DEBUG')[source]#

Bases: Fetcher[SourceType, IdType]

Fetcher which combines the results of other fetchers.

Parameters:
  • *children – Fetchers to wrap.

  • max_workers – Number of threads to use for fetching. Fetch instructions will be dispatched using a ThreadPoolExecutor. Individual fetchers will be called at most once per fetch() or fetch_all() call made with the MultiFetcher.

  • on_source_conflict – Action to take when multiple fetchers claim the same source.

  • fetcher_discarded_log_level – Level used when discarding optional fetchers.

property allow_fetch_all#

Flag indicating whether the fetch_all() operation is permitted.

property children#

Return child fetchers sorted by rank.

close()[source]#

Close all child fetchers.

fetch(ids_to_fetch, placeholders=(), *, required=(), task_id=None, enable_uuid_heuristics=False)[source]#

Retrieve placeholder translations from the source.

Parameters:
  • ids_to_fetch – An iterable of IdsToFetch.

  • placeholders – All desired placeholders in preferred order.

  • required – Placeholders that must be included in the response.

  • task_id – Used for logging.

  • enable_uuid_heuristics – Improves matching when UUID-like IDs are in use.

Returns:

A mapping {source: PlaceholderTranslations} of translation elements.

Raises:

See also

🔑 This is a key event method. See Key Event Records for details.

Notes

Placeholders are usually columns in relational database applications. These are the components which are combined to create ID translations. See Format documentation for details.

fetch_all(placeholders=(), *, required=(), sources=None, task_id=None, enable_uuid_heuristics=False)[source]#

Fetch as much data as possible.

Parameters:
  • placeholders – All desired placeholders in preferred order.

  • required – Placeholders that must be included in the response.

  • sources – A subset of sources to fetch. Unknown sources are ignored. Set to None to fetch all sources.

  • task_id – Used for logging.

  • enable_uuid_heuristics – Improves matching when UUID-like IDs are in use.

Returns:

A mapping {source: PlaceholderTranslations} of translation elements.

See also

🔑 This is a key event method. See Key Event Records for details.

Raises:
format_child(fetcher)[source]#

Format a managed fetcher with rank and hex ID.

get_child(source)[source]#

Return child fetcher for the given source.

get_sources(child)[source]#

Return sources for the given child.

initialize_sources(task_id=None, *, force=False)[source]#

Perform source discovery.

Perform source discovery for all children, discarding optional children that raise or do not return any sources when their respective Fetcher.initialize_sources() methods are called.

Parameters:
  • task_id – Used for logging.

  • force – If True, perform full discovery even if sources are already known.

See also

🔑 This is a key event method. See Key Event Records for details.

Notes

Calling this method multiple times will not recover previously discarded optional child fetchers.

property on_source_conflict#

Action to take when multiple fetchers claim the same source.

property online#

Return connectivity status. If False, no new translations may be fetched.

property placeholders#

Placeholders for all known Source names, such as id or name.

These are the (possibly unmapped) placeholders that may be used for translation.

Returns:

A dict {source: [placeholders..]}.

class PandasFetcher(read_function=None, read_path_format='data/{}.csv', read_function_kwargs=None, **kwargs)[source]#

Bases: AbstractFetcher[str, IdType]

Fetcher implementation using pandas.DataFrame as the data format.

Fetch data from serialized frames. How this is done is determined by the read_function. This is typically a Pandas function such as pandas.read_csv() or pandas.read_parquet(), but any function that accepts a string source as the first argument and returns a pandas.DataFrame can be used.

Hint

When using remote file systems, sources are resolved using AbstractFileSystem.glob(). If resolution fails, consider overriding the find_sources()-method.

Parameters:
  • read_function – A function (str) -> DataFrame. Derive from read_path_format if None. Strings are resolved by get_by_full_name() (with default_module=pandas).

  • read_path_format – A string on the form protocol://path/to/sources/{}.<ext>, or a callable to apply to a source before passing them to read_function.

  • read_function_kwargs – Additional keyword arguments for read_function.

  • **kwargs – See AbstractFetcher.

See also

The official Pandas IO documentation

fetch_translations(instr)[source]#

Retrieve placeholder translations from the source.

Parameters:

instr – A single FetchInstruction for IDs to fetch. If IDs is None, the fetcher should retrieve data for as many IDs as possible.

Returns:

Placeholder translation elements.

Raises:

UnknownPlaceholderError – If the placeholder is unknown to the fetcher.

See also

🔑 This is a key event method. See Key Event Records for details.

find_sources(task_id=None)[source]#

Resolve sources and their associated paths.

Parameters:

task_id – Used for logging.

Sources are resolved in three steps:

  1. Create glob pattern by calling format_source() with source='*'.

  2. Glob files using AbstractFileSystem.glob() (requires fsspec) or Path.glob().

  3. Strip the directory and file suffix from the globbed paths to create source names.

Returns:

A dict {source: path}.

format_source(source)[source]#

Get the path for source.

read(source_path)[source]#

Read a DataFrame from a source path.

Parameters:

source_path – Path to serialized DataFrame.

Returns:

A deserialized DataFrame.

class SqlFetcher(connection_string, password=None, whitelist_tables=None, blacklist_tables=(), schema=None, include_views=False, engine_kwargs=None, **kwargs)[source]#

Bases: AbstractFetcher[str, IdType]

Fetch data from a SQL source.

Parameters:
  • connection_string – A SQLAlchemy connection string.

  • password – Password to insert into the connection string. Will be escaped to allow for special characters. If given, the connection string must contain a password key, eg; dialect://user:{password}@host:port.

  • whitelist_tables – The only tables the fetcher may access.

  • blacklist_tables – The only tables the fetcher may not access.

  • schema – Database schema to use. Typically needed only if schema is not the default schema for the user specified in the connection string.

  • include_views – If True, the fetcher will discover and query views as well.

  • engine_kwargs – A dict of keyword arguments for sqlalchemy.create_engine().

  • **kwargs – See AbstractFetcher.

Raises:

ValueError – If both whitelist_tables and blacklist_tables are given.

Notes

Inheriting classes may override on or more of the following methods to further customize operation.

Overriding should be done with care, as methods may call each other internally.

class TableSummary(name, columns, fetch_all_permitted, id_column)#

Bases: Generic[IdType]

Brief description of a known table.

columns#

A flag indicating that the FETCH_ALL-operation is permitted for this table.

fetch_all_permitted#

A flag indicating that the FETCH_ALL-operation is permitted for this table.

id_column#

The ID column of the table.

name#

Name of the table.

property allow_fetch_all#

Flag indicating whether the fetch_all() operation is permitted.

cast_id_column_to_uuid(id_column, *, ids_are_uuid_like)[source]#

Apply UUID heuristics to the ID column.

This function attempts cast the id_column to a suitable type by looking at the type of the column and the ids_are_uuid_like-flag.

If the column is already UUID-like (as determined by get_metadata()), the column is always returned as-is.

Parameters:
  • id_column – The ID sqlalchemy.sql.Column of the table.

  • ids_are_uuid_like – One of True and 'unknown' (never False). The latter typically means that fetch_all() was called, but could also be a normal “translation” call without IDs.

Returns:

The id_column with or without a cast applied.

close()[source]#

Close the fetcher, discarding the engine.

classmethod create_engine(connection_string, password, engine_kwargs)[source]#

Factory method used by __init__.

For a more detailed description of the arguments and the behaviour of this function, see the class docstring.

Parameters:
  • connection_string – A SQLAlchemy connection string.

  • password – Password to insert into the connection string.

  • engine_kwargs – A dict of keyword arguments for sqlalchemy.create_engine().

Returns:

A new Engine.

property engine#

The Engine used by this fetcher.

fetch_translations(instr)[source]#

Retrieve placeholder translations from the source.

Parameters:

instr – A single FetchInstruction for IDs to fetch. If IDs is None, the fetcher should retrieve data for as many IDs as possible.

Returns:

Placeholder translation elements.

Raises:

UnknownPlaceholderError – If the placeholder is unknown to the fetcher.

See also

🔑 This is a key event method. See Key Event Records for details.

get_metadata()[source]#

Create a populated metadata object.

make_table_summary(table, id_column)[source]#

Create a table summary.

This function is called as a part of the fetcher initialization process.

Parameters:
  • table – The table (source) which is currently being processed.

  • id_column – The ID column of table

Returns:

A summary object for table.

property online#

Return connectivity status. If False, no new translations may be fetched.

classmethod parse_connection_string(connection_string, password)[source]#

Parse a connection string.

select_where(select, *, ids, id_column, table)[source]#

User method for modifying SELECT statements.

The default implementation returns select as-is. Selection based on IDs is done before this method is called. Users may override this method to change what and which data is returned, e.g. by additional WHERE-clauses.

Parameters:
  • select – A sqlalchemy.sql.Select element. If returned as-is, all IDs in the table will be fetched.

  • ids – Set of IDs to fetch. Will be None if fetch_all() was called.

  • id_column – The ID sqlalchemy.sql.Column of the table, from which ids are fetched.

  • table – Table to select from.

Returns:

The final statement object to use.

uuid_like(id_column, ids)[source]#

Determine whether id_column should be passed to cast_id_column_to_uuid().

Note

Return values:
Parameters:
  • id_column – The ID sqlalchemy.sql.Column of the table.

  • ids – Set of IDs to fetch. Will be None if fetch_all() was called.

Returns:

One of True, False and None. See above for explanation.

Modules

exceptions

Errors and warnings related to fetching.

types

Types related to translation fetching.