id_translation.fetching#
Fetching of translation data.
- Composite:
MultiFetcher
: Solution for using multiple simple fetchers, e.g. multiple databases or file-system locations. Or a combination thereof!
- Simple fetchers:
SqlFetcher
: Fetching from a single SQL database or schema.PandasFetcher
: File-system fetching based on pandas read-functions. Valid URL schemes include http, ftp, s3, gs, and file.MemoryFetcher
: In-memory solution, used primarily for testing.
- Base fetchers:
Fetcher
: Top-level interface definition. Base for all fetching implementations.AbstractFetcher
: Implements high-level operations such as placeholder mapping.
Classes
|
Base class for retrieving translations from an external source. |
|
Utility class for managing the FETCH_ALL-cache. |
|
Metadata pertaining to fetcher caching logic. |
|
Interface for fetching translations from an external source. |
|
Fetch from memory. |
|
Fetcher which combines the results of other fetchers. |
|
Fetcher implementation using pandas |
|
Fetch data from a SQL source. |
- class AbstractFetcher(mapper=None, allow_fetch_all=True, fetch_all_unmapped_values_action=None, selective_fetch_all=True, fetch_all_cache_max_age=None, cache_keys=None, optional=False)[source]#
Bases:
Fetcher
[SourceType
,IdType
]Base class for retrieving translations from an external source.
Hint
Clear caches with
CacheAccess.clear_all_cache_data()
.Change cache root directory with
CacheAccess.BASE_CACHE_PATH
.
- Parameters:
mapper – A
Mapper
instance used to adapt placeholder names in sources to wanted names, i.e. the names of the placeholders that are in the translationFormat
being used.allow_fetch_all – If
False
, an error will be raised whenfetch_all()
is called.fetch_all_unmapped_values_action – A temporary value to use for
Mapper.unmapped_values_action
whilefetch_all()
is executing. Settingfetch_all_unmapped_values_action='raise'
is mutually exclusive withselective_fetch_all=True
.selective_fetch_all – If
True
, fetch only from thosesources
that contain the requiredplaceholders
(after mapping). May also reduce the number of placeholders retrieved.fetch_all_cache_max_age – If given, determines validity lifetime of data cached when
fetch_all()
-calls are made. The regularfetch
function will draw from this cache as well, but onlyfetch_all
will update the cache. Furthermore, caching will never be used (read or write) ifonline
isFalse
.cache_keys – A collection of hierarchical cache-key elements, see
CacheMetadata
. If given, element zero of the cache_keys is added to thelogger
name for the fetcher.optional – If
True
, this fetcher may be discarded if source/placeholder-enumeration fails in multi-fetcher mode.
- Raises:
rics.action_level.BadActionLevelError – If selective_fetch_all is
True
and fetch_all_unmapped_values_action is'raise'
.ValueError – If only one of fetch_all_cache_max_age and cache_keys are given.
- property allow_fetch_all#
Flag indicating whether the
fetch_all()
operation is permitted.
- assert_online()[source]#
Raise an error if offline.
- Raises:
ConnectionStatusError – If not online.
- property cache_enabled#
Return the caching status for the fetcher.
- classmethod default_mapper_kwargs()[source]#
Return default
Mapper
arguments forAbstractFetcher
implementations.
- classmethod default_score_function(value, candidates, context)[source]#
Compute score for candidates.
- fetch(ids_to_fetch, placeholders=(), required=(), task_id=None, enable_uuid_heuristics=False)[source]#
Retrieve placeholder translations from the source.
- Parameters:
ids_to_fetch – An iterable of
IdsToFetch
.placeholders – All desired placeholders in preferred order.
required – Placeholders that must be included in the response.
task_id – Used for logging.
enable_uuid_heuristics – If set, apply heuristics to improve matching with
UUID
-like IDs.
- Returns:
A mapping
{source: PlaceholderTranslations}
of translation elements.- Raises:
UnknownPlaceholderError – For placeholder(s) that are unknown to the
Fetcher
.UnknownSourceError – For sources(s) that are unknown to the
Fetcher
.ForbiddenOperationError – If trying to fetch all IDs when not possible or permitted.
ImplementationError – For errors made by the inheriting implementation.
Notes
Placeholders are usually columns in relational database applications. These are the components which are combined to create ID translations. See
Format
documentation for details.
- fetch_all(placeholders=(), *, required=(), task_id=None, enable_uuid_heuristics=False)[source]#
Fetch as much data as possible.
- Parameters:
placeholders – All desired placeholders in preferred order.
required – Placeholders that must be included in the response.
task_id – Used for logging.
enable_uuid_heuristics – If set, apply heuristics to improve matching with
UUID
-like IDs.
- Returns:
A mapping
{source: PlaceholderTranslations}
of translation elements.- Raises:
ForbiddenOperationError – If fetching all IDs is not possible or permitted.
UnknownPlaceholderError – For placeholder(s) that are unknown to the
Fetcher
.ImplementationError – For errors made by the inheriting implementation.
- abstract fetch_translations(instr)[source]#
Retrieve placeholder translations from the source.
- Parameters:
instr – A single
FetchInstruction
for IDs to fetch. If IDs isNone
, the fetcher should retrieve data for as many IDs as possible.- Returns:
Placeholder translation elements.
- Raises:
UnknownPlaceholderError – If the placeholder is unknown to the fetcher.
See also
🔑 This is a key event method. See Key Event Records for details.
- final initialize_sources(task_id=-1, *, force=False)[source]#
Perform source discovery.
- Parameters:
task_id – Used for logging.
force – If
True
, perform full discovery even if sources are already known.
- property logger#
Return the
Logger
that is used by this instance.
- map_placeholders(source, placeholders, *, candidates=None, clear_cache=False, task_id=None)[source]#
Map placeholder names to the actual names seen in source.
This method calls
Mapper.apply(values=placeholders, candidates=candidates, context=source)
using this fetchersAbstractFetcher.mapper
instance. It is assumed that names in sources rarely change, so mappings are cached until the fetcher is recreated or until this method is called withclear_cache=True
.Placeholder mapping caching should not be confused with
FETCH_ALL
data caching.- Parameters:
source – The source to map placeholders for.
placeholders – Desired
placeholders
.candidates – A subset of candidates (placeholder names) in source to map with placeholders.
clear_cache – If
True
, force a full remap.task_id – Used for logging purposes.
- Returns:
A dict
{wanted_placeholder_name: actual_placeholder_name_in_source}
, where actual_placeholder_name_in_source will beNone
if the wanted placeholder could not be mapped to any of the candidates available for the source.- Raises:
UnknownPlaceholderError – If any of required_placeholders are incorrectly mapped, or not mapped at all.
See also
🔑 This is a key event method. See Key Event Records for details.
- property mapper#
Return the
Mapper
instance used for placeholder name mapping.
- property online#
Return connectivity status. If
False
, no new translations may be fetched.
- property optional#
Return
True
if this fetcher has been marked as optional.In multi-fetcher mode, optional fetchers may be discarded if
sources
cannot be resolved (raises an exception). Default value isFalse
.- Returns:
Optionality status.
- property placeholders#
Placeholders for all known Source names, such as
id
orname
.These are the (possibly unmapped) placeholders that may be used for translation.
- Returns:
A dict
{source: [placeholders..]}
.
- property selective_fetch_all#
If set, reduce the amount of data fetched by
fetch_all()
.
- property sources#
A list of known Source names, such as
cities
orlanguages
.
- class CacheAccess(max_age, metadata)[source]#
Bases:
Generic
[SourceType
,IdType
]Utility class for managing the FETCH_ALL-cache.
- Parameters:
max_age – Cache timeout.
metadata – Metadata object used to determine cache validity.
- BASE_CACHE_PATH = PosixPath('/home/docs/.cache/id-translation/cached-fetcher-data')#
Top-level cache dir for all fetchers managed by any
CacheAccess
-instance.
- CLEAR_CACHE_EXCEPTION_TYPES = (<class '_pickle.UnpicklingError'>,)#
Error types which trigger cache deletion
- classmethod base_cache_dir_for_all_fetchers()[source]#
Top-level cache dir for all fetchers managed by any
CacheAccess
-instance.
- property cache_dir#
Get the cache directory used by this
CacheAccess
.Created from
BASE_CACHE_PATH
and the first value ofCacheMetadata.cache_keys
.- Returns:
Cache dir for a single fetcher.
- clear(reason, log_level=10, *, exc_info=False)[source]#
Remove cached data for the current instance.
- classmethod clear_all_cache_data()[source]#
Remove the entire cache directory tree for ALL instances.
- property data_dir#
- property metadata_path#
- class CacheMetadata(*, cache_keys, placeholders, **kwargs)[source]#
Bases:
BaseMetadata
,Generic
[SourceType
,IdType
],HasSources
[SourceType
]Metadata pertaining to fetcher caching logic.
- Parameters:
cache_keys – Hierarchical identifiers for the cache. The first key is used to determine storage location, while the rest are used to detect configuration changes (which invalidate the cache). A typical key would be
[config-file-name, config-file-sha]
.placeholders – A Source-to-placeholder dict.
**kwargs – Forwarded to base classes.
- property cache_keys#
Yields hierarchical cache keys for this metadata.
- property placeholders#
Placeholders for all known Source names, such as
id
orname
.These are the (possibly unmapped) placeholders that may be used for translation.
- Returns:
A dict
{source: [placeholders..]}
.
- class Fetcher[source]#
Bases:
Generic
[SourceType
,IdType
],HasSources
[SourceType
]Interface for fetching translations from an external source.
- abstract property allow_fetch_all#
Flag indicating whether the
fetch_all()
operation is permitted.
- abstract fetch(ids_to_fetch, placeholders=(), *, required=(), task_id=None, enable_uuid_heuristics=False)[source]#
Retrieve placeholder translations from the source.
- Parameters:
ids_to_fetch – An iterable of
IdsToFetch
.placeholders – All desired placeholders in preferred order.
required – Placeholders that must be included in the response.
task_id – Used for logging.
enable_uuid_heuristics – If set, apply heuristics to improve matching with
UUID
-like IDs.
- Returns:
A mapping
{source: PlaceholderTranslations}
of translation elements.- Raises:
UnknownPlaceholderError – For placeholder(s) that are unknown to the
Fetcher
.UnknownSourceError – For sources(s) that are unknown to the
Fetcher
.ForbiddenOperationError – If trying to fetch all IDs when not possible or permitted.
ImplementationError – For errors made by the inheriting implementation.
Notes
Placeholders are usually columns in relational database applications. These are the components which are combined to create ID translations. See
Format
documentation for details.
- abstract fetch_all(placeholders=(), *, required=(), task_id=None, enable_uuid_heuristics=False)[source]#
Fetch as much data as possible.
- Parameters:
placeholders – All desired placeholders in preferred order.
required – Placeholders that must be included in the response.
task_id – Used for logging.
enable_uuid_heuristics – If set, apply heuristics to improve matching with
UUID
-like IDs.
- Returns:
A mapping
{source: PlaceholderTranslations}
of translation elements.- Raises:
ForbiddenOperationError – If fetching all IDs is not possible or permitted.
UnknownPlaceholderError – For placeholder(s) that are unknown to the
Fetcher
.ImplementationError – For errors made by the inheriting implementation.
- abstract initialize_sources(task_id=-1, *, force=False)[source]#
Perform source discovery.
- Parameters:
task_id – Used for logging.
force – If
True
, perform full discovery even if sources are already known.
- abstract property online#
Return connectivity status. If
False
, no new translations may be fetched.
- class MemoryFetcher(data=None, return_all=True, **kwargs)[source]#
Bases:
AbstractFetcher
[SourceType
,IdType
]Fetch from memory.
- Parameters:
data – A dict
{source: PlaceholderTranslations}
to fetch from.return_all – If
False
, return only the requested IDs and placeholders.
- fetch_translations(instr)[source]#
Retrieve placeholder translations from the source.
- Parameters:
instr – A single
FetchInstruction
for IDs to fetch. If IDs isNone
, the fetcher should retrieve data for as many IDs as possible.- Returns:
Placeholder translation elements.
- Raises:
UnknownPlaceholderError – If the placeholder is unknown to the fetcher.
See also
🔑 This is a key event method. See Key Event Records for details.
- class MultiFetcher(*children, max_workers=1, duplicate_translation_action=ActionLevel.WARN, duplicate_source_discovered_action=ActionLevel.WARN, optional_fetcher_discarded_log_level='DEBUG')[source]#
Bases:
Fetcher
[SourceType
,IdType
]Fetcher which combines the results of other fetchers.
- Parameters:
*children – Fetchers to wrap.
max_workers – Number of threads to use for fetching. Fetch instructions will be dispatched using a
ThreadPoolExecutor
. Individual fetchers will be called at most once perfetch()
orfetch_all()
call made with theMultiFetcher
.duplicate_translation_action – Action to take when multiple fetchers return translations for the same source.
duplicate_source_discovered_action – Action to take when multiple fetchers claim the same source.
optional_fetcher_discarded_log_level – Log level used when discarding optional fetchers for any reason.
- property allow_fetch_all#
Flag indicating whether the
fetch_all()
operation is permitted.
- property children#
Return child fetchers.
- property duplicate_source_discovered_action#
Return action to take when multiple fetchers claim the same source.
- property duplicate_translation_action#
Return action to take when multiple fetchers return translations for the same source.
- fetch(ids_to_fetch, placeholders=(), *, required=(), task_id=None, enable_uuid_heuristics=False)[source]#
Retrieve placeholder translations from the source.
- Parameters:
ids_to_fetch – An iterable of
IdsToFetch
.placeholders – All desired placeholders in preferred order.
required – Placeholders that must be included in the response.
task_id – Used for logging.
enable_uuid_heuristics – If set, apply heuristics to improve matching with
UUID
-like IDs.
- Returns:
A mapping
{source: PlaceholderTranslations}
of translation elements.- Raises:
UnknownPlaceholderError – For placeholder(s) that are unknown to the
Fetcher
.UnknownSourceError – For sources(s) that are unknown to the
Fetcher
.ForbiddenOperationError – If trying to fetch all IDs when not possible or permitted.
ImplementationError – For errors made by the inheriting implementation.
Notes
Placeholders are usually columns in relational database applications. These are the components which are combined to create ID translations. See
Format
documentation for details.
- fetch_all(placeholders=(), *, required=(), task_id=None, enable_uuid_heuristics=False)[source]#
Fetch as much data as possible.
- Parameters:
placeholders – All desired placeholders in preferred order.
required – Placeholders that must be included in the response.
task_id – Used for logging.
enable_uuid_heuristics – If set, apply heuristics to improve matching with
UUID
-like IDs.
- Returns:
A mapping
{source: PlaceholderTranslations}
of translation elements.- Raises:
ForbiddenOperationError – If fetching all IDs is not possible or permitted.
UnknownPlaceholderError – For placeholder(s) that are unknown to the
Fetcher
.ImplementationError – For errors made by the inheriting implementation.
- initialize_sources(task_id=-1, *, force=False)[source]#
Perform source discovery.
- Parameters:
task_id – Used for logging.
force – If
True
, perform full discovery even if sources are already known.
- property online#
Return connectivity status. If
False
, no new translations may be fetched.
- property placeholders#
Placeholders for all known Source names, such as
id
orname
.These are the (possibly unmapped) placeholders that may be used for translation.
- Returns:
A dict
{source: [placeholders..]}
.
- class PandasFetcher(read_function='read_csv', read_path_format='data/{}.csv', read_function_kwargs=None, online=False, **kwargs)[source]#
Bases:
AbstractFetcher
[str
,IdType
]Fetcher implementation using pandas
DataFrame
s as the data format.Fetch data from serialized
DataFrame
s. How this is done is determined by the read_function. This is typically a Pandas function such aspandas.read_csv()
orpandas.read_parquet()
, but any function that accepts a string source as the first argument and returns a data frame can be used.Hint
When using remote file systems,
sources
are resolved using AbstractFileSystem.glob(). If resolution fails, consider overriding thefind_sources()
-method.- Parameters:
read_function – A Pandas read-function. If a string is given, the function is resolved using
rics.misc.get_by_full_name()
. Unqualified names are assumed to belong to thepandas
namespace.read_path_format – A string on the form
protocol://path/to/sources/{}.ext
, or a callable to apply to a source before passing them to read_function.read_function_kwargs – Additional keyword arguments for read_function.
online – Setting
online=False
typically indicates that files are hosted at a location where there are access limitations, e.g. through data transfer fees.
See also
The official Pandas IO documentation
- fetch_translations(instr)[source]#
Retrieve placeholder translations from the source.
- Parameters:
instr – A single
FetchInstruction
for IDs to fetch. If IDs isNone
, the fetcher should retrieve data for as many IDs as possible.- Returns:
Placeholder translation elements.
- Raises:
UnknownPlaceholderError – If the placeholder is unknown to the fetcher.
See also
🔑 This is a key event method. See Key Event Records for details.
- find_sources(task_id=-1)[source]#
Resolve sources and their associated paths.
- Parameters:
task_id – Used for logging.
Sources are resolved in three steps:
Create glob pattern by calling
format_source()
withsource='*'
.Glob files using AbstractFileSystem.glob() (requires
fsspec
) orPath.glob()
.Strip the directory and file suffix from the globbed paths to create source names.
- Returns:
A dict
{source: path}
.
- property online#
Return connectivity status. If
False
, no new translations may be fetched.
- class SqlFetcher(connection_string, password=None, whitelist_tables=None, blacklist_tables=(), schema=None, include_views=False, engine_kwargs=None, **kwargs)[source]#
Bases:
AbstractFetcher
[str
,IdType
]Fetch data from a SQL source.
- Parameters:
connection_string – A SQLAlchemy connection string.
password – Password to insert into the connection string. Will be escaped to allow for special characters. If given, the connection string must contain a password key, eg;
dialect://user:{password}@host:port
.whitelist_tables – The only tables the fetcher may access.
blacklist_tables – The only tables the fetcher may not access.
schema – Database schema to use. Typically needed only if schema is not the default schema for the user specified in the connection string.
include_views – If
True
, the fetcher will discover and query views as well.engine_kwargs – A dict of keyword arguments for
sqlalchemy.create_engine()
.**kwargs – See
AbstractFetcher
.
- Raises:
ValueError – If both whitelist_tables and blacklist_tables are given.
Notes
Inheriting classes may override on or more of the following methods to further customize operation.
create_engine()
; initializes the SQLAlchemy engine. Callsparse_connection_string
.parse_connection_string()
; does basic URL encoding. Called bycreate_engine
.select_where()
; filter values on the id_column (fromcast_id_column_to_uuid
) of the current table.make_table_summary()
; createsTableSummary
instances.uuid_like()
; determine if casting (withcast_id_column_to_uuid
) is needed.cast_id_column_to_uuid()
; attempt to cast the id_column toUUID
.
Overriding should be done with care, as methods may call each other internally.
- class TableSummary(name, columns, fetch_all_permitted, id_column)#
-
Brief description of a known table.
- columns#
A flag indicating that the FETCH_ALL-operation is permitted for this table.
- fetch_all_permitted#
A flag indicating that the FETCH_ALL-operation is permitted for this table.
- id_column#
The ID column of the table.
- name#
Name of the table.
- property allow_fetch_all#
Flag indicating whether the
fetch_all()
operation is permitted.
- cast_id_column_to_uuid(id_column, *, ids_are_uuid_like)[source]#
Apply UUID heuristics to the ID column.
This function attempts cast the id_column to a suitable type by looking at the type of the column and the ids_are_uuid_like-flag.
If the column is already UUID-like (as determined by
get_metadata()
), the column is always returned as-is.- Parameters:
id_column – The ID
sqlalchemy.sql.Column
of the table.ids_are_uuid_like – One of
True
and'unknown'
(neverFalse
). The latter typically means thatfetch_all()
was called, but could also be a normal “translation” call without IDs.
- Returns:
The id_column with or without a cast applied.
- classmethod create_engine(connection_string, password, engine_kwargs)[source]#
Factory method used by
__init__
.For a more detailed description of the arguments and the behaviour of this function, see the
class docstring
.- Parameters:
connection_string – A SQLAlchemy connection string.
password – Password to insert into the connection string.
engine_kwargs – A dict of keyword arguments for
sqlalchemy.create_engine()
.
- Returns:
A new
Engine
.
- fetch_translations(instr)[source]#
Retrieve placeholder translations from the source.
- Parameters:
instr – A single
FetchInstruction
for IDs to fetch. If IDs isNone
, the fetcher should retrieve data for as many IDs as possible.- Returns:
Placeholder translation elements.
- Raises:
UnknownPlaceholderError – If the placeholder is unknown to the fetcher.
See also
🔑 This is a key event method. See Key Event Records for details.
- make_table_summary(table, id_column)[source]#
Create a table summary.
This function is called as a part of the fetcher initialization process.
- Parameters:
table – The table (source) which is currently being processed.
id_column – The ID column of table
- Returns:
A summary object for table.
- property online#
Return connectivity status. If
False
, no new translations may be fetched.
- classmethod parse_connection_string(connection_string, password)[source]#
Parse a connection string.
- classmethod select_where(select, *, ids, id_column, table)[source]#
Add
WHERE
clause(s) to an ID select statement.Warning
When overriding, keep in mind that returning the select statement as-is will perform an unfiltered select.
- Parameters:
select – A
sqlalchemy.sql.Select
element. If returned as-is, all IDs in the table will be fetched.ids – Set of IDs to fetch. Will be
None
iffetch_all()
was called.id_column – The ID
sqlalchemy.sql.Column
of the table, from which ids are fetched.table – Table to select from.
- Returns:
The final statement object to use.
- uuid_like(id_column, ids)[source]#
Determine whether id_column should be passed to
cast_id_column_to_uuid()
.Note
Will not be called unless
Translator.enable_uuid_heuristics
isTrue
.Only
False
will bypass callingcast_id_column_to_uuid()
.
- Return values:
True
: Attempt to cast usingcast_id_column_to_uuid()
withids_are_uuid_like=True
.False
: Do not cast;cast_id_column_to_uuid()
will not be called.None
: Attempt to cast usingcast_id_column_to_uuid()
withids_are_uuid_like='unknown'
.
- Parameters:
id_column – The ID
sqlalchemy.sql.Column
of the table.ids – Set of IDs to fetch. Will be
None
iffetch_all()
was called.
- Returns:
One of
True
,False
andNone
. See above for explanation.
Modules
Errors and warnings related to fetching. |
|
Types related to translation fetching. |