A CacheAccess implementation#

A CacheAccess solution that stores data locally on disk. Click here to download the full script.

Design goals#

We’ve arbitrarily decided on the following requirements:

  1. Data should only be cached if the fetcher is performing a fetch_all-operation.

  2. Cached data should be stored on disk using the feather format.

  3. Cached data should have a timeout (TTL), measured in seconds.

We’ll create a new class, MyCacheAccess, to meet these requirements.

Implementation#

The new class needs to know where to store data and how long to keep it.

The __init__ method.#
def __init__(self, root: str, ttl: int) -> None:
    super().__init__()
    self._root = Path(root)
    self._ttl = ttl  # In seconds

    self._root.mkdir(parents=True, exist_ok=True)

We can now start implementing the abstract methods in CacheAccess. We’ll start with CacheAccess.store():

The MyCacheAccess.store() method.#
def store(
    self,
    instr: FetchInstruction[SourceType, IdType],
    translations: PlaceholderTranslations[SourceType],
) -> None:
    if not instr.fetch_all:
        print(
            f"Refuse caching of source={instr.source!r}"
            " since FetchInstruction.fetch_all=False."
        )
        return

    df = translations.to_pandas()
    path = self._root / f"{translations.source}.ftr"
    print(f"Store cache at path='{path}'.")
    df.to_feather(path)

Requirement 1: If FetchInstruction.fetch_all is False, data should not be stored.

Otherwise, we use source as the file name and we convert the translations to a DataFrame using PlaceholderTranslations.to_pandas(). Requirement 2: The frame is witten to disk using pandas.DataFrame.to_feather().

We’re now ready to implement CacheAccess.load(), which will read, verify, and convert the stored data.

The MyCacheAccess.load() method.#
def load(
    self,
    instr: FetchInstruction[SourceType, IdType],
) -> PlaceholderTranslations[SourceType] | None:
    path = self._root / f"{instr.source}.ftr"

    if not path.exists():
        print(f"Cache at path='{path}' does not exist.")
        return None

    age = self.age_in_seconds(path)

    if age > self._ttl:
        print(f"Reject cache ({age=} > ttl={self._ttl}) at path='{path}'.")
        return None

    print(f"Load cache (age={age} <= {self._ttl}=ttl) at path='{path}'.")
    df = pd.read_feather(path)
    return PlaceholderTranslations.from_dataframe(instr.source, df)

As per Requirement 3, we should only return data that is newer than ttl seconds. We’ll use the modification time of the serialized data that is reported by the operating system.

The MyCacheAccess.age_in_seconds() method.#
@staticmethod
def age_in_seconds(path: Path) -> int:
    timestamp = path.stat().st_mtime
    modified = datetime.fromtimestamp(timestamp)
    seconds = (datetime.now() - modified).total_seconds()
    return round(seconds)

If the data is stale, we return None.

Hint

Returning None signals to the caller that data should be retrieved some other way; typically by using AbstractFetcher.fetch_translations() instead.

The data is read using pandas.read_feather(), then converted using PlaceholderTranslations.from_dataframe().

Creating a cached fetcher#

All AbstractFetcher implementations accept an optional cache_access keyword argument.

Creating a Translator with a cached fetcher.#
def create() -> Translator[str, str, int]:
    cache_access = MyCacheAccess(root="./cache/", ttl=3600)
    fetcher = MemoryFetcher(
        data={"people": {1904: "Fred"}},
        cache_access=cache_access,
    )
    return Translator(fetcher)

Using a CacheAccess with a MemoryFetcher doesn’t make much sense, but the caching procedure works just the same as it would for e.g. a SqlFetcher.

Hint

To configure caching using TOML, add a [fetching.cache]-section.

The type key is required. Other keys are determined by the implementation.

Equivalent caching section of a TOML fetcher config.#
[fetching.cache]
type = "__main__.MyCacheAccess"
root = "./cache/"
ttl = 3600

See the Configuration page for more information.

Caching in action#

We’ll use the create() function defined above to initialize new Translator instances.

Step 1#
translator = create()
print("person=", translator.translate(1904, "people"))

Initial creation. Data is retrieved from the source. There’s only one ID in the fetcher, but the cache implementation doesn’t know that. It refuses to store the data as per Requirement 1.

Output#
Cache at path='cache/people.ftr' does not exist.
Refuse caching of source='people' since FetchInstruction.fetch_all=False.
person= 1904:Fred

Using Translator.go_offline() without any explicit IDs will call fetch_all.

Step 2#
translator.go_offline()
print("person=", translator.translate(1904, "people"))

When going offline, the Translator will store translation data in-memory as a TranslationMap.

Output#
Cache at path='cache/people.ftr' does not exist.
Store cache at path='cache/people.ftr'.
person= 1904:Fred

By definition, a translator that is offline does not have a fetcher attached. The effects of this can be seen above: The cache was updated, but it wasn’t loaded again for the translate() call. There is no way to reconnect an offline Translator, so this instance will be limited to using it’s cache until it is destroyed.

Of course, deleting the MyCacheAccess instance doesn’t remove the files on disk.

Step 3#
print("person=", create().translate(1904, "people"))

If we create a new Translator and use it right away (or within ttl = 3600 seconds = 1 hour), the cached data will be used.

output#
Load cache (age=0 <= 3600=ttl) at path='cache/people.ftr'.
person= 1904:Fred

This concludes the example. Click here to download the full script.