Skip to content

Shrink flextool repo history (remove accidental / legacy blobs) #325

@jkiviluo

Description

@jkiviluo

Problem

.git/ currently weighs ~259 MB on disk (88 MB packed, 166 MB loose + pack),
dominated by blobs that are no longer in HEAD. Recent additions are clean
(no sqlite added to HEAD in the last 90 days — only one PNG in that window),
so the bloat is entirely legacy.

Biggest blobs observed in history by git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)':

Blob (history only, not in HEAD) Size Occurrences Notes
flextool.lp 310 MB One-off solver LP dump, never intended to be committed. Pure accident.
Input_data.sqlite 22 MB 20+ Old test/dev data, superseded by fixtures in HEAD.
.spinetoolbox/items/flextool3_test_data/FlexTool3_data.sqlite 22 MB Same story — old test data.
notebooks/RETO 5.7 MB Binary never intended for git.
notebooks/.ipynb_checkpoints/RETO 5.7 MB Checkpoint cache of the above.
Older highs.exe / libstdc++-6.dll versions 18 MB + 28 MB each a few Solver binaries that were rolled over when new versions landed.

Goals

  1. Remove from history the blobs that were never part of any build contract
    (pure accidents, clear to cut).
  2. Decide whether to also drop legacy solver binaries from history. This
    is a soft break for old-commit checkouts: running flextool at those
    commits would require providing a solver externally.
  3. Do not break anything currently in HEAD.

Scope

Must remove from history

  • flextool.lp
  • notebooks/RETO
  • notebooks/.ipynb_checkpoints/RETO

These were never load-bearing. Zero break risk.

Probably also remove

  • All 20+ historical Input_data.sqlite blobs (22 MB each).
  • Both .spinetoolbox/items/flextool3_test_data/FlexTool3_data.sqlite blobs.

These are old test/dev snapshots. Removing them from history cannot affect
current HEAD or any running tests — the fixtures those tests need live at
new paths under HEAD. The only impact is that someone doing git archaeology
on very old commits won't find them.

Deliberately leave alone (for this pass)

  • Older bin/highs.exe, libstdc++-6.dll, libopenblas*, etc. versions
    that are no longer in HEAD but were bundled with earlier flextool
    releases. The tradeoff:
    • Pro: removing saves ~90 MB from history.
    • Con: git checkout <old-hash> won't have the solver binary in
      the tree anymore. Running flextool at that commit requires pip-
      installing highspy or dropping in a current binary manually.
      Any tag/release built strictly from old commits would become
      incomplete. git bisect through those old commits may fail to run.
    • Verdict: solver binaries are external tools; skipping bisect on
      them is usually fine. But bundle it into a separate, announced
      repo-maintenance pass rather than folding into this cleanup, so the
      scope is obvious to collaborators.

Also in scope for a later revisit

  • how to example databases/*.sqlite (8 files, ~6.2 MB total in HEAD).
  • templates/examples.sqlite, templates/time_settings_only.sqlite
    (~1.5 MB in HEAD).
  • User has green-lit keeping these for now, flagged them for future
    re-consideration once alternatives are in place (e.g. generating them
    on demand from rivendell-to-flextool-style helper packages, matching
    the pattern now used for the continental benchmark).

Prerequisites

  • git-filter-repo is not currently installed:

    pip install git-filter-repo
    # or: apt install git-filter-repo
  • Remote is shared (origin: git@github.com:irena-flextool/flextool.git),
    so the rewrite requires a coordinated force-push window.

  • Current local branches that will be rewritten (all refs):
    bind-intraperiod-blocks, constraint-capacity-coeffs,
    db-api-use-fixing, dc-power-flow, delay-fix,
    new-outputs (current), etc. Each rewritten branch gets new SHAs.

  • Any open PRs on GitHub will be invalidated — must be closed/reopened
    or rebased against new HEAD after the push.

Plan

  1. Back up .git/ before touching anything:

    cp -r .git .git.backup-$(date -u +%Y%m%d)
  2. Install the tool and take an analysis snapshot:

    pip install git-filter-repo
    git filter-repo --analyze   # writes .git/filter-repo/analysis/

    Confirm the blob list matches expectations.

  3. Rewrite history (dry-run first, then execute). Keep the scope tight:

    # Dry-run to see what would change:
    git filter-repo --invert-paths \
        --path flextool.lp \
        --path notebooks/RETO \
        --path notebooks/.ipynb_checkpoints/RETO \
        --path Input_data.sqlite \
        --path .spinetoolbox/items/flextool3_test_data/FlexTool3_data.sqlite \
        --analyze-commits   # pseudo-flag: inspect output before re-running without it

    (Note: git filter-repo is not reversible without the .git.backup.
    Don't run without --analyze verification first.)

  4. Verify HEAD is unchanged — diff git ls-files and checksums of top
    tracked files against a fresh clone of the old origin.

  5. Verify size reduction:

    git gc --aggressive --prune=now
    git count-objects -vH

    Expect pack size to drop noticeably (order of ~50-100 MB based on
    current blob catalogue).

  6. Announce the force-push window; then, for each branch:

    git push --force-with-lease origin <branch>

    All collaborators must git fetch && git reset --hard origin/<branch>
    or re-clone. Any open PRs need to be rebased on the new HEAD.

Done when

  • .git/ size-pack is noticeably smaller (target: under 50 MB).
  • git log --all --oneline | wc -l is unchanged (no commits lost, only
    blob contents removed).
  • git diff <old-HEAD> <new-HEAD> is empty for every branch (HEAD
    trees unchanged).
  • tests/ pass on the rewritten HEAD.
  • All open PRs rebased; all collaborators on the new refs.

Owner

@jkiviluo (coordinated force-push window, collaborator notifications).

Related

  • Standalone generator repo Rivendell_to_FlexTool is now the canonical
    location for the Rivendell generator; flextool/rivendell/ was
    removed from the working tree in the same session that produced this
    spec.
  • benchmarks/scaling/scenarios/continental/generate.py now rebuilds
    its input.sqlite from rivendell-to-flextool on demand into
    ~/.cache/rivendell_to_flextool/ — example of the pattern for
    keeping dev sqlites out of the flextool tree going forward.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions