Hi @mrdevrobot ,
I finally successfully reproduced the recurring "Not enough space" exception.
After an intensive code audit combining our production logs and the engine's source code, we realized the core issue—the desync between the in-memory FSI and physical pages—has different architectural triggers that weren't fully covered by PR #61.
Here is the comprehensive breakdown of the remaining architectural blind spots.
1. Flaw A: Cross-Collection FSI Poisoning in Single-File Mode (The Crash Trigger)
This is the true root cause of our production crash. In Single-File (embedded) mode, data pages are physically shared, but the FreeSpaceIndex cache is instantiated per-collection.
The Log Evidence:
Need 615, Have 251 | PageId=301 | FSI=646
Need 328, Have 125 | PageId=485 | FSI=797
Notice the exact deltas (646 - 251 = 395 and 797 - 125 = 672). These represent the exact sizes of documents I had just inserted into a different collection within the same transaction!
The Mechanism:
- On startup,
RebuildFreeSpaceIndex() runs for every collection. Collection A and Collection B both scan Page X and load its free space into their isolated, per-collection _fsi instances.
Collection A executes a BulkInsert, reducing the shared physical space. Collection A updates its own _fsi.
Collection B attempts an insert. It checks its own _fsi, which still holds the stale value from startup (unaware of Collection A's consumption).
Collection B routes the document to the shared Page X, sees the true physical space is insufficient, and fatally crashes.
100% Reproducible MRE for Flaw A:
MRE
The Grand Unification (Connection to Issue #65):
Interestingly, this architectural blind spot is the exact same root cause for Issue #65 (Phantom Objects)! Because physical pages lack a CollectionId tag in single-file mode, they suffer from both Read-Leakage (Issue #65: deserializing wrong collection docs) and Write-Collisions (This FSI crash).
Suggested Fix: Implementing a CollectionId at the SlottedPageHeader or SlotEntry level, so RebuildFreeSpaceIndex only tracks a collection's exclusively owned pages, would permanently resolve BOTH critical issues with one elegant refactoring.
2. Flaw B: Missing SnapshotForTransaction in Insert Paths (The Space Leak)
While analyzing the code, we noticed another regression path. PR #61 brilliantly added FSI snapshots for DeleteCore. However, InsertIntoPage, InsertWithOverflow, and AllocateNewDataPage update the FSI but never call _fsi.SnapshotForTransaction().
If a transaction executes an insert (reducing the FSI), and is subsequently rolled back, the physical page reverts via WAL to a larger free space, but the in-memory FSI remains stuck at the post-insert smaller value. This doesn't cause a crash (since FSI < Actual is pessimistic), but it permanently leaks usable page space until the database is restarted.
Suggested Fix: Add _fsi.SnapshotForTransaction() inside the insert and allocation paths before modifying the FSI, ensuring rollbacks perfectly restore the spatial state for all operations.
Originally posted by @LeoYang06 in #58