Skip to content

Cache does not detect stale files #251

@arthurlorenzi

Description

@arthurlorenzi

Problem

When a file is updated on the DataSUS FTP server, PySUS will silently return the outdated local copy if a file with the same name already exists in the cache. There is no staleness check of any kind (it only checks if the file with the same name exits):

if existing.exists():

if existing.exists():

Proposed Solution

The FileInfo metadata fetched from the FTP server already contains the file's modification date and size. These can be compared against cached metadata to determine staleness.

A safe approach is to store the server's modification timestamp in a separate, metadata file at download time:

# on download, write a sidecar
meta = {
    "last_update": self.__info["last_update"].isoformat(),
    "size": int(self.__info["size"])
}
with open(str(filepath) + ".pysus_meta.json", "w") as f:
    json.dump(meta, f)

Then on cache check, compare server metadata against the local file. Something like:

meta_path = pathlib.Path(str(existing) + ".pysus_meta.json")
if meta_path.exists():
    meta = json.loads(meta_path.read_text())
    cached_modify = datetime.fromisoformat(meta["last_update"])
    if cached_modify >= self.__info["last_update"] and meta["size"] == int(self.__info["size"]):
        return Data(str(existing))
# otherwise, re-download

No additional FTP requests are needed. The metadata is already available at download time.

Possible API Extension

Beyond staleness detection, it would be worth exposing a use_cache parameter to allow users to bypass the cache entirely:

file.download(use_cache=False)
file.async_download(use_cache=False)

This is particularly useful for pipelines that need guaranteed fresh data regardless of cache state, without having to manually delete cached files.

Context

This was identified while building a platform that will periodically monitor DataSUS sources and downloads files when they are new or updated. The final goal is compiling a collection of epidemiological indicators. When PySUS returns stale data due to caching we might have an issue. We are currently wrapping every download in a temporary file block, but that is just a workaround:

with tempfile.TemporaryDirectory() as tmp:
    data = sinan.download(file, local_dir=tmp)

If you are OK with the proposed changes, I can write the PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions