Skip to content

Updating s3 backup scripts to work with boto3 INFRA-448#368

Open
ScottMillard wants to merge 13 commits into
masterfrom
scott/backup_scripts
Open

Updating s3 backup scripts to work with boto3 INFRA-448#368
ScottMillard wants to merge 13 commits into
masterfrom
scott/backup_scripts

Conversation

@ScottMillard

@ScottMillard ScottMillard commented Apr 9, 2026

Copy link
Copy Markdown

This updates the s3 backup scripts to work with boto3 so that they can take advantage of short lived credentials. It removes any codified reference to the AWS access/secret key and instead relies on boto's built in method to figure out how to authenticate, which will use many different methods which will allow this script to work in many ways for users that want to authenticate differently.

This will require boto3 to be installed in the environment it is running in.

For the redis backup, it removes the dependency for s3cmd and instead relies on boto for the upload. The redis backup did not previously use smart_open to stream direct to S3 but it has been updated to do so to match the others.

I tested this locally, results are below - I had to install python in each container and authorized to AWS by exporting short lived AWS credentials to the shell prior to running. I validated the file was in the bucket afterwards and deleted them.

I'm not a python dev so copilot did most the heavy lifting 💪

I also added the backups.md file with some information on how to enable/disable backups.

redis:

Backing up to "redis/yearly/redis-7.2.12-kobo.local-20260506_205501.gz"...
Wrote 1 chunks; 262.1 MB
Finished! Wrote 1 chunks; 262.1 MB
Backup `redis/yearly/redis-7.2.12-kobo.local-20260506_205501.gz` successfully sent to S3.
Done!

postgres:

acking up to "postgres/yearly/postgres-kc-14-kobo.local-20260506_210001.pg_dump"...
Backing up to "postgres/yearly/postgres-kpi-14-kobo.local-20260506_210001.pg_dump"...
Wrote 1 chunks; 262.1 MB
Finished! Wrote 1 chunks; 262.1 MB
Wrote 1 chunks; 262.1 MB
Finished! Wrote 1 chunks; 262.1 MB
Backup `postgres/yearly/postgres-kc-14-kobo.local-20260506_210001.pg_dump` successfully sent to S3.
Backup `postgres/yearly/postgres-kpi-14-kobo.local-20260506_210001.pg_dump` successfully sent to S3.
Done!

mongo:

Backing up to "mongo/yearly/mongo-5.0-kobo.local-20260506_203501.gz"...
Wrote 1 chunks; 262.1 MB
Finished! Wrote 1 chunks; 262.1 MB
Backup `mongo/yearly/mongo-5.0-kobo.local-20260506_203501.gz` successfully sent to S3.
Done!

@ScottMillard ScottMillard changed the title [DRAFT] Updating s3 backup scripts to work with boto3 Updating s3 backup scripts to work with boto3 INFRA-448 Apr 9, 2026
@ScottMillard ScottMillard marked this pull request as ready for review April 9, 2026 19:07
@ScottMillard ScottMillard requested a review from Copilot April 9, 2026 19:07

@jnm jnm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python dependency stuff

I had to install python [and boto3] in each container

@ScottMillard, there is a toggle-backup-activation.sh for each of the 3 databases that takes care of Python and Python dependency installation:

pip install --quiet boto

pip install --quiet boto

Those scripts are also still looking for the AWS_… access key environment variables to determine whether or not S3 is being used. You mention that you now use "boto's built in method to figure out how to authenticate"; maybe that could also be used to determine (in conjunction with BACKUP_AWS_STORAGE_BUCKET_NAME?) whether S3 backups are configured, dropping the reliance on AWS_… env. vars.

Streaming directly to S3

We did a fair amount of work to stream backups directly to S3 instead of writing them out completely to a local disk prior to upload, since that can require a very large amount of space to be available in /tmp/ or wherever. Can that still be done with boto3?

@ScottMillard

Copy link
Copy Markdown
Author

Python dependency stuff

@ScottMillard, there is a toggle-backup-activation.sh for each of the 3 databases that takes care of Python and Python dependency installation:

pip install --quiet boto

pip install --quiet boto

Those scripts are also still looking for the AWS_… access key environment variables to determine whether or not S3 is being used. You mention that you now use "boto's built in method to figure out how to authenticate"; maybe that could also be used to determine (in conjunction with BACKUP_AWS_STORAGE_BUCKET_NAME?) whether S3 backups are configured, dropping the reliance on AWS_… env. vars.

I've updated the files to install boto3 and tested them on my local kobo-install/kobo-docker. I've removed USE_S3 and now only use whether the bucket name was passed in or not.

Streaming directly to S3

We did a fair amount of work to stream backups directly to S3 instead of writing them out completely to a local disk prior to upload, since that can require a very large amount of space to be available in /tmp/ or wherever. Can that still be done with boto3?

Thank you for this John - I've updated the PR and added back in using smart_open to stream directly to S3 - I also added it for the redis backup which did not previously use it.

@ScottMillard ScottMillard requested a review from jnm May 6, 2026 21:23

@jnm jnm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottMillard sorry I haven't re-reviewed this for real, but here's Claude's take with light revisions just to avoid delays. I think the first two issues at least are valid:

Redis S3 backup is broken

The redis backup-to-s3.py uses Popen(..., stdout=subprocess.PIPE) to stream into S3 via smart_open. But BACKUP_COMMAND calls backup-to-disk.bash, which writes the gzipped file to /srv/backups/ and only prints a success message to stdout. The streaming code will capture that message text, not the actual backup data. The uploaded S3 "backup" will be empty/trivial.

Fix: either have the bash script output gzip data to stdout, or upload the resulting file from /srv/backups/ after the script completes.

Postgres timezone mismatch will crash

The postgres backup-to-s3.py uses naive datetime.datetime.now() but compares against backup.last_modified from boto3, which is timezone-aware (UTC). This will raise TypeError: can't subtract offset-naive and offset-aware datetimes when cleanup runs. The mongo and redis scripts correctly use datetime.datetime.now(datetime.timezone.utc) — postgres was missed.

Medium: No subprocess return code check

After streaming from process.stdout, none of the scripts call process.wait() or check process.returncode. If mongodump or pg_dump fails mid-stream, a truncated/corrupt file gets silently uploaded to S3 and counted as a valid backup by retention logic.

Low: smart_open unpinned

smart-open[s3] is installed without a version constraint. The smart_open.smart_open() function used in the scripts is deprecated (replaced by smart_open.open() in v5+). Not urgent but could break on a future pip install.

Pre-existing: keeps vs days semantic confusion in cleanup

The cleanup logic compares delta.days > keeps where keeps is a retention count (e.g., 2 yearly backups). This conflates "how many to keep" with "age in days." This is carried over from the old code, not introduced by this PR.


Backwards compatibility with kobo-install

Good news here — fully compatible:

  • AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from kobo-install's aws.txt.tpl are still passed through to cron and picked up by boto3's credential chain
  • They're just no longer required, so IAM roles work too
  • No docker-compose, volume, service name, or network changes
  • kobo-install doesn't need changes for existing key-based users

If kobo-install wants to support IAM-role-only deployments (no keys), it would need a minor update to make access keys optional in its config flow, but that's a separate concern.


Verdict

Two critical bugs block merge: the redis backup producing empty files and the postgres timezone crash. The subprocess return code check is a strong "should fix" to prevent silent data loss. The rest are low-priority.

@ScottMillard

Copy link
Copy Markdown
Author

Redis S3 backup is broken

The redis backup-to-s3.py uses Popen(..., stdout=subprocess.PIPE) to stream into S3 via smart_open. But BACKUP_COMMAND calls backup-to-disk.bash, which writes the gzipped file to /srv/backups/ and only prints a success message to stdout. The streaming code will capture that message text, not the actual backup data. The uploaded S3 "backup" will be empty/trivial.
Fix: either have the bash script output gzip data to stdout, or upload the resulting file from /srv/backups/ after the script completes.

I've fixed this, very good catch. Makes sense that previously the redis didn't use smart_open to stream to S3, but I left it in place anyway. I validated the file sent to S3 contains the actual rdb backup.

Postgres timezone mismatch will crash

The postgres backup-to-s3.py uses naive datetime.datetime.now() but compares against backup.last_modified from boto3, which is timezone-aware (UTC). This will raise TypeError: can't subtract offset-naive and offset-aware datetimes when cleanup runs. The mongo and redis scripts correctly use datetime.datetime.now(datetime.timezone.utc) — postgres was missed.

This is resolved also.

@ScottMillard ScottMillard requested a review from jnm May 20, 2026 18:50
@jnm

jnm commented Jun 9, 2026

Copy link
Copy Markdown
Member

@greptileai welcome to this repo. Please review!

@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown

Greptile Summary

This PR migrates all three backup scripts (mongo, postgres, redis) from the legacy boto library to boto3, enabling short-lived/instance-profile credential support via boto3's standard credential chain. The toggle shell scripts are also updated to install boto3 + smart-open[s3] in a stdlib venv instead of the old virtualenv, and a new doc/backups.md is added.

  • All three Python backup scripts drop S3Connection/parse_ts from boto and replace them with boto3.resource('s3') + direct ObjectSummary.last_modified comparisons; smart_open.open() (v2+ API) now constructs the S3 URI directly.
  • Redis backup is refactored from an os.system one-liner into structured run()/cleanup() functions matching the mongo/postgres pattern, with an explicit local-file cleanup step after the S3 upload.
  • Toggle scripts drop the USE_S3 flag that required both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to be present before triggering the S3 path; the gate is now simply whether BACKUP_AWS_STORAGE_BUCKET_NAME is set.

Confidence Score: 4/5

The boto3 migration is functionally correct and tested end-to-end by the author; the credential-chain approach is a clear improvement over hardcoded key requirements.

All three backup scripts correctly replace the legacy boto S3Connection/parse_ts calls with boto3 resource objects and native timezone-aware datetime comparisons. The delete path (l.delete() on ObjectSummary) is valid in boto3. The redis refactor follows the same run/cleanup pattern as mongo/postgres and adds proper local-file cleanup. No logic bugs were found in the changed paths; the only open item is the absence of version pinning for the newly installed packages.

The three toggle scripts (mongo, postgres, redis) each install smart-open[s3] and boto3 without version pins — worth a second look before merging if reproducible installs matter.

Sequence Diagram

sequenceDiagram
    participant C as cron
    participant T as toggle-backup-activation.sh
    participant P as backup-to-s3.py
    participant B3 as boto3 credential chain
    participant S3 as AWS S3

    C->>T: BACKUP_SCHEDULE set
    T->>T: write env vars to crontab
    T->>T: python3 -m venv /tmp/backup-virtualenv
    T->>T: pip install boto3 smart-open[s3]

    Note over C,P: At scheduled time
    C->>P: invoke via python interpreter

    P->>B3: boto3.resource('s3') — resolve credentials
    Note over B3: env vars → ~/.aws → instance profile → …
    B3-->>P: authenticated session

    P->>S3: "bucket.objects.filter(Prefix=...) — find latest backup"
    S3-->>P: ObjectSummary list

    P->>P: determine target directory (yearly/monthly/weekly/daily)

    alt mongo / postgres
        P->>P: "subprocess.Popen(dump_cmd, stdout=PIPE)"
        P->>S3: smart_open.open(s3://bucket/key, wb) — stream chunks
    else redis
        P->>P: subprocess.check_call(backup-to-disk.bash)
        P->>P: open local file /srv/backups/DUMPFILE
        P->>S3: smart_open.open(s3://bucket/key, wb) — stream chunks
        P->>P: os.remove(local file)
    end

    P->>S3: cleanup() — delete objects older than retention window
Loading

Reviews (2): Last reviewed commit: "Feedback" | Re-trigger Greptile

Comment thread redis/backup-to-s3.py
Comment thread postgres/scripts/backup-to-s3.py
Comment thread redis/backup-to-s3.py
Comment thread doc/backups.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants