Skip to content

chore: tighten TOCTOU windows in Asset write paths #146

@t-a-y-l-o-r

Description

@t-a-y-l-o-r

Summary

Tighten TOCTOU (time-of-check-to-time-of-use) windows in the Asset write paths in blueflow/views/asset.py. None of these races corrupt data — the DB-level unique=True and unique_together constraints catch every integrity violation that matters — but several let concurrent writes silently lose updates or surface as 500s instead of clean 4xx responses.

This is correctness work, not a security bug. The audit was triggered during PR #144 review; no exploitation in the wild is known.

Background

Each finding follows the same structural pattern: a Django ORM read (get/filter/first) followed by a Python-side mutation followed by a write (save/create/update), with no row-level locking between the read and the write. Concurrent transactions can change the row in that window. The DB constraints are the only thing keeping the result valid; the application logic is racing.

The TOCTOU mitigations Django offers are well-trodden:

  • select_for_update() — Postgres SELECT ... FOR UPDATE on the read side, blocks concurrent transactions from modifying the row until commit.
  • update_or_create() — Django's atomic upsert primitive; uses SELECT ... FOR UPDATE internally.
  • transaction.atomic() — wraps the read-then-write in a single transaction so locks are held coherently.
  • DB-level array operators (e.g. open_ports_tcp || ARRAY[...]) — atomic at the SQL layer, no app-level read needed.

Findings

M1 (Medium) — upsert() lost updates on concurrent same-MAC writes

File: blueflow/views/asset.py:956-963

asset = Asset.objects.get(mac_address=mac_address)   # read (no lock)
# ... mutate fields in Python ...
asset.save()                                         # write

Two scanners hitting the same MAC race: each gets the same snapshot, mutates locally, calls save(). Last write wins; the intermediate update is silently lost.

The upsert endpoint is explicitly designed for passive scanners (Tapirx, etc.) per its docstring — multiple scanners racing on the same MAC is the common case during a discovery sweep, not an exotic scenario.

Fix: wrap in transaction.atomic() + select_for_update(), or replace the entire get/mutate/save block with Asset.objects.update_or_create(mac_address=..., defaults={...}).


M2 (Medium) — bulk_update() silently swallows concurrent deletes

File: blueflow/views/asset.py:873-910

Phase 1 reads all the assets outside any atomic block. Phase 2 writes them inside transaction.atomic(). Between the two, a concurrent transaction can DELETE one of the assets; the corresponding serializer.save() issues an UPDATE that affects 0 rows (Postgres returns success on UPDATE-of-deleted-row). The endpoint responds 200 with phase-1 data, claiming success. The asset is gone.

Fix: move the phase-1 reads inside the existing atomic block and use select_for_update() so locks are held until phase 2 commits.


M3 (Medium) — upsert() open-ports merge uses stale read

File: blueflow/views/asset.py:957-960

if new_ports:
    merged = sorted(set(asset.open_ports_tcp) | set(new_ports))
    ...

The asset.open_ports_tcp snapshot is from the M1 read above. A concurrent transaction adding a port observation between the read and the eventual save() is silently overwritten by the merge.

Port discovery is exactly the kind of incremental, multi-source data this endpoint exists to collect. Two scanners observing different open ports on the same device should both contribute; this race drops one.

Fix: same shape as M1 (atomic + lock), OR replace with a single SQL update that uses Postgres array union semantics directly, eliminating the read entirely.


L1 (Low) — Hostname conflict check leaks IntegrityError as 500

File: blueflow/views/asset.py:935-970 (introduced by PR #144)

The application-level conflict check at line 936-940 can return "no conflict" while a third transaction inserts a colliding hostname before line 966 runs Asset.objects.create(...). The create raises IntegrityError, which propagates as 500.

Severity rationale: Data integrity preserved (the DB unique index catches it — this is by design, see PR #144 commentary). Only the response shape is wrong: race winner sees 400 from the application check; race loser sees 500 instead of 400.

Fix: wrap the create in try/except IntegrityError and return a 400 with the same shape as the application-level conflict response. Symmetric with the existing application-layer check.


L2 (Low) — tags() action fallback fetch may DoesNotExist on concurrent delete

File: blueflow/views/asset.py:719-735

try:
    asset_tag.save()
except IntegrityError:
    asset_tag = AssetTag.objects.get(asset=asset, tag=tag)   # may DoesNotExist

If a concurrent transaction deletes the AssetTag in the microsecond window between the IntegrityError catch and the fallback fetch, the .get() raises DoesNotExist and propagates as 500.

Severity: Very low — purely theoretical. Defer until this races in practice.


Patterns done correctly (kept for reference)

Pattern Where Why it's safe
Try-create-then-catch tags() POST handler (asset.py:717-728) Relies on the unique_together constraint on AssetTag to atomically reject duplicates at the DB layer.
All-or-nothing batch writes bulk_update() phase 2 (asset.py:904) The atomic block ensures partial writes don't commit. Read-side is the gap (M2 above).
DB-level identity guarantees unique=True on mac_address and hostname; unique_together on join tables These are the actual contract. Every TOCTOU finding above relies on them as the backstop.

Recommended fix order

  1. M1 first — most user-visible, most common in scanner-heavy traffic, single-method change. ~10 lines.
  2. M3 alongside M1 — same handler, same fix shape. Doing them together costs ~zero extra.
  3. M2 — second priority. ~5 lines (move reads inside the existing atomic block + add select_for_update()).
  4. L1 — small follow-up; convert IntegrityError to 400 for symmetry with the application-level check. ~5 lines.
  5. L2 — defer indefinitely. File a separate issue if a real race ever surfaces in production logs.

Tasks

  • M1 — Use select_for_update() (or update_or_create()) for the upsert read/save in views/asset.py:956-963.
  • M3 — Eliminate the stale-read open-ports merge in views/asset.py:957-960 (covered by the same fix as M1 if select_for_update() is chosen).
  • M2 — Move bulk_update() phase-1 reads inside the existing atomic block at views/asset.py:904; add select_for_update().
  • L1 — Add try/except IntegrityError around the upsert create (views/asset.py:966) returning a 400 in the same shape as the application-level conflict.
  • Tests — for each fix, add a test that simulates the race using transaction.on_commit callbacks or threading; at minimum, document the TOCTOU window in a comment so the lock isn't accidentally removed by a future refactor.

Acceptance criteria

  • All M-severity races have either a select_for_update() lock, an update_or_create() call, or an SQL-atomic update operator covering them.
  • L1 returns 400 (not 500) when the IntegrityError race fires.
  • The existing test suite is at the same pass/fail count as develop baseline (no regressions).
  • New regression tests exist for at least M1 and M2 to prevent the locks from being removed without surfacing the race.

Out of scope

  • TOCTOU in non-Asset views (groups, networks, custom fields, etc.) — separate audit if anyone wants it.
  • Permission-check-then-act races — Django's check_object_permissions runs per-request and isn't a real TOCTOU surface unless permissions are swapped mid-request, which doesn't happen.
  • Switching the deployment to ATOMIC_REQUESTS=True in settings — that's a different conversation about default request semantics, not a TOCTOU fix.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions