Skip to content

import(csv,parquet): replace DuckDB-on-temp-file introspection with in-process row streaming via ArrowBuffer #472

@xe-nvdk

Description

@xe-nvdk

Summary

The CSV and Parquet import handlers (internal/api/import.go#handleCSVImport, handleParquetImport) currently introspect the uploaded temp file by issuing DuckDB queries against read_csv(...) / read_parquet(...). This is architecturally inconsistent with the LP and TLE import handlers (which parse rows in-process and push through ArrowBuffer), and it's fragile under the new DuckDB sandbox model from #442.

The immediate symptom — surfaced while building arcctl PR #3 — is that CSV imports fail with a DuckDB parser error before any data lands:

arc: failed to read file: query failed: Parser Error: syntax error at or near "("

LINE 1: SELECT column_name FROM (DESCRIBE read_csv('/Users/nacho/dev/basekick-labs/arc/.tmp/arc-uploads/arc-import-XXXX/import.csv', auto_detect=true, header=true))
                                                  ^

Came back as HTTP 422 from POST /api/v1/import/csv against a tiny 91-byte CSV with valid headers + 3 rows + every flag at defaults. Same failure against a 0-byte CSV (errors during introspection, before the empty-file check fires).

Same handler also fails for Parquet (untested in PR #3 but uses the identical getColumnsread_parquet(...) path).

Architecture problem (the bigger fix)

Per @xe-nvdk's clarification on PR #3:

"You don't query the CSV, you query the data ingested through /api/v1/query. There is no CSV in arc when the data is ingested."

That's the canonical contract: arc serves queries over its own Parquet partitions, not over arbitrary uploaded files. The current CSV/Parquet import handlers violate this by routing the file through DuckDB's read_csv / read_parquet directly:

// internal/api/import.go (current)

readExpr := h.buildReadExpression(filePath, opts)  // → read_csv('/tmp/.../import.csv', ...)

// 1. schema introspection — direct DuckDB query against temp file
columns, err := h.getColumns(readExpr)              // SELECT column_name FROM (DESCRIBE read_csv(...))

// 3. stats — another direct DuckDB query against temp file
statsQuery := fmt.Sprintf("SELECT COUNT(*), MIN(%s)::VARCHAR, MAX(%s)::VARCHAR FROM %s", timeCast, timeCast, readExpr)

This works only because the upload directory is in the sandbox allowed_directories list (added in #442). It's still:

  • Inconsistent with LP/TLE — both use arrowBuffer.WriteColumnarRecord(...) and never issue DuckDB queries against the temp file. LP parses in-process via ingest.NewLineProtocolParser; TLE has its own parser.
  • Fragile against DuckDB version drift — the SELECT column_name FROM (DESCRIBE ...) SQL is rejected by the DuckDB version arc ships today (duckdb_version=v1.5.1 per the smoke). DESCRIBE can't always be used as a derived table.
  • A new sandbox hole every time a temp dir moves — every directory the import handler writes to has to be allowlisted explicitly. fix(security): sandbox DuckDB user queries via enable_external_access=false + allowed_directories #442 kept the list intentionally minimal; this pattern keeps adding to it.

Proposed fix

Mirror the LP/TLE handler pattern in CSV and Parquet:

  1. Parse rows in-process (not via DuckDB).

    • CSV: use encoding/csv or an Arrow-aware reader. The header row provides column names directly — no DESCRIBE needed.
    • Parquet: use github.com/apache/arrow-go/v18/parquet (or whatever arc already uses for compaction). Schema is in the file metadata; no DESCRIBE needed.
  2. Push rows through arrowBuffer.WriteColumnarRecord(...) — same code path as LP. This means:

    • Time-column validation moves to the row parser (fail-fast per row, not a separate up-front query).
    • Time-format casting happens in the parser, not via a DuckDB SQL expression.
    • Partition layout, hot/cold tier handling, replication, compaction triggers — all automatic via the ArrowBuffer pipeline.
  3. Post-ingest stats come from the query layer over the ingested partitions, if the response needs them at all. The current ImportResult includes time_range_min / time_range_max and partitions_created; those can be computed by the ArrowBuffer (which already knows what it wrote) without any after-the-fact SELECT against read_csv.

  4. Drop the upload directory from the sandbox allowlist once nothing in the import path issues DuckDB queries against /tmp/arc-uploads/.... Pure-Go parsing means the temp file is read by os.Open + csv.Reader (or the parquet equivalent) and deleted before any query runs.

Out of scope

  • LP and TLE imports — already use this pattern.
  • Patching getColumns SQL syntax in isolation (e.g. swap SELECT ... FROM (DESCRIBE ...) for pragma table_info(...)) — would fix the symptom without fixing the architecture. Don't.
  • Audit of every other DuckDB query that touches external files — separate issue; this one is about the import path specifically.

Severity

Medium. Production CSV/Parquet imports are broken today against the current DuckDB version + sandbox config. Workaround: convert source data to line protocol and use import lp, which works. But that's a 1-format workaround for an API that ostensibly supports 3 formats.

Test plan (for the eventual fix)

  • CSV import against a 3-row file with explicit --time-column + --time-format produces the same ImportResult shape as today (rows_imported, partitions_created, time_range_min/max, columns, duration_ms).
  • CSV import against a 0-byte file fails with "file is empty" (the same 400 LP returns today), not a parser error.
  • CSV with delimiter=';' and skip_rows=1 ingests correctly.
  • Parquet import against a basic file with time column produces matching ImportResult.
  • Imported data is queryable via arcctl query (or POST /api/v1/query) against the same database/measurement.
  • Sandbox allowed_directories no longer contains the upload temp dir, and CSV/Parquet imports still succeed (proves the temp file isn't being read via DuckDB anymore).
  • LP and TLE imports continue to work unchanged.

Discovered while

Building arcctl import csv in Basekick-Labs/arcctl#3. arcctl correctly surfaces the verbatim 422 from the server; this is purely an arc-side fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions