You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The CSV and Parquet import handlers (internal/api/import.go#handleCSVImport, handleParquetImport) currently introspect the uploaded temp file by issuing DuckDB queries against read_csv(...) / read_parquet(...). This is architecturally inconsistent with the LP and TLE import handlers (which parse rows in-process and push through ArrowBuffer), and it's fragile under the new DuckDB sandbox model from #442.
The immediate symptom — surfaced while building arcctl PR #3 — is that CSV imports fail with a DuckDB parser error before any data lands:
arc: failed to read file: query failed: Parser Error: syntax error at or near "("
LINE 1: SELECT column_name FROM (DESCRIBE read_csv('/Users/nacho/dev/basekick-labs/arc/.tmp/arc-uploads/arc-import-XXXX/import.csv', auto_detect=true, header=true))
^
Came back as HTTP 422 from POST /api/v1/import/csv against a tiny 91-byte CSV with valid headers + 3 rows + every flag at defaults. Same failure against a 0-byte CSV (errors during introspection, before the empty-file check fires).
Same handler also fails for Parquet (untested in PR #3 but uses the identical getColumns → read_parquet(...) path).
"You don't query the CSV, you query the data ingested through /api/v1/query. There is no CSV in arc when the data is ingested."
That's the canonical contract: arc serves queries over its own Parquet partitions, not over arbitrary uploaded files. The current CSV/Parquet import handlers violate this by routing the file through DuckDB's read_csv / read_parquet directly:
// internal/api/import.go (current)readExpr:=h.buildReadExpression(filePath, opts) // → read_csv('/tmp/.../import.csv', ...)// 1. schema introspection — direct DuckDB query against temp filecolumns, err:=h.getColumns(readExpr) // SELECT column_name FROM (DESCRIBE read_csv(...))// 3. stats — another direct DuckDB query against temp filestatsQuery:=fmt.Sprintf("SELECT COUNT(*), MIN(%s)::VARCHAR, MAX(%s)::VARCHAR FROM %s", timeCast, timeCast, readExpr)
This works only because the upload directory is in the sandbox allowed_directories list (added in #442). It's still:
Inconsistent with LP/TLE — both use arrowBuffer.WriteColumnarRecord(...) and never issue DuckDB queries against the temp file. LP parses in-process via ingest.NewLineProtocolParser; TLE has its own parser.
Fragile against DuckDB version drift — the SELECT column_name FROM (DESCRIBE ...) SQL is rejected by the DuckDB version arc ships today (duckdb_version=v1.5.1 per the smoke). DESCRIBE can't always be used as a derived table.
Mirror the LP/TLE handler pattern in CSV and Parquet:
Parse rows in-process (not via DuckDB).
CSV: use encoding/csv or an Arrow-aware reader. The header row provides column names directly — no DESCRIBE needed.
Parquet: use github.com/apache/arrow-go/v18/parquet (or whatever arc already uses for compaction). Schema is in the file metadata; no DESCRIBE needed.
Push rows through arrowBuffer.WriteColumnarRecord(...) — same code path as LP. This means:
Time-column validation moves to the row parser (fail-fast per row, not a separate up-front query).
Time-format casting happens in the parser, not via a DuckDB SQL expression.
Partition layout, hot/cold tier handling, replication, compaction triggers — all automatic via the ArrowBuffer pipeline.
Post-ingest stats come from the query layer over the ingested partitions, if the response needs them at all. The current ImportResult includes time_range_min / time_range_max and partitions_created; those can be computed by the ArrowBuffer (which already knows what it wrote) without any after-the-fact SELECT against read_csv.
Drop the upload directory from the sandbox allowlist once nothing in the import path issues DuckDB queries against /tmp/arc-uploads/.... Pure-Go parsing means the temp file is read by os.Open + csv.Reader (or the parquet equivalent) and deleted before any query runs.
Out of scope
LP and TLE imports — already use this pattern.
Patching getColumns SQL syntax in isolation (e.g. swap SELECT ... FROM (DESCRIBE ...) for pragma table_info(...)) — would fix the symptom without fixing the architecture. Don't.
Audit of every other DuckDB query that touches external files — separate issue; this one is about the import path specifically.
Severity
Medium. Production CSV/Parquet imports are broken today against the current DuckDB version + sandbox config. Workaround: convert source data to line protocol and use import lp, which works. But that's a 1-format workaround for an API that ostensibly supports 3 formats.
Test plan (for the eventual fix)
CSV import against a 3-row file with explicit --time-column + --time-format produces the same ImportResult shape as today (rows_imported, partitions_created, time_range_min/max, columns, duration_ms).
CSV import against a 0-byte file fails with "file is empty" (the same 400 LP returns today), not a parser error.
CSV with delimiter=';' and skip_rows=1 ingests correctly.
Parquet import against a basic file with time column produces matching ImportResult.
Imported data is queryable via arcctl query (or POST /api/v1/query) against the same database/measurement.
Sandbox allowed_directories no longer contains the upload temp dir, and CSV/Parquet imports still succeed (proves the temp file isn't being read via DuckDB anymore).
LP and TLE imports continue to work unchanged.
Discovered while
Building arcctl import csv in Basekick-Labs/arcctl#3. arcctl correctly surfaces the verbatim 422 from the server; this is purely an arc-side fix.
Summary
The CSV and Parquet import handlers (
internal/api/import.go#handleCSVImport,handleParquetImport) currently introspect the uploaded temp file by issuing DuckDB queries againstread_csv(...)/read_parquet(...). This is architecturally inconsistent with the LP and TLE import handlers (which parse rows in-process and push throughArrowBuffer), and it's fragile under the new DuckDB sandbox model from #442.The immediate symptom — surfaced while building arcctl PR #3 — is that CSV imports fail with a DuckDB parser error before any data lands:
Came back as HTTP 422 from
POST /api/v1/import/csvagainst a tiny 91-byte CSV with valid headers + 3 rows + every flag at defaults. Same failure against a 0-byte CSV (errors during introspection, before the empty-file check fires).Same handler also fails for Parquet (untested in PR #3 but uses the identical
getColumns→read_parquet(...)path).Architecture problem (the bigger fix)
Per @xe-nvdk's clarification on PR #3:
That's the canonical contract: arc serves queries over its own Parquet partitions, not over arbitrary uploaded files. The current CSV/Parquet import handlers violate this by routing the file through DuckDB's
read_csv/read_parquetdirectly:This works only because the upload directory is in the sandbox
allowed_directorieslist (added in #442). It's still:arrowBuffer.WriteColumnarRecord(...)and never issue DuckDB queries against the temp file. LP parses in-process viaingest.NewLineProtocolParser; TLE has its own parser.SELECT column_name FROM (DESCRIBE ...)SQL is rejected by the DuckDB version arc ships today (duckdb_version=v1.5.1per the smoke).DESCRIBEcan't always be used as a derived table.Proposed fix
Mirror the LP/TLE handler pattern in CSV and Parquet:
Parse rows in-process (not via DuckDB).
encoding/csvor an Arrow-aware reader. The header row provides column names directly — noDESCRIBEneeded.github.com/apache/arrow-go/v18/parquet(or whatever arc already uses for compaction). Schema is in the file metadata; noDESCRIBEneeded.Push rows through
arrowBuffer.WriteColumnarRecord(...)— same code path as LP. This means:Post-ingest stats come from the query layer over the ingested partitions, if the response needs them at all. The current
ImportResultincludestime_range_min/time_range_maxandpartitions_created; those can be computed by the ArrowBuffer (which already knows what it wrote) without any after-the-factSELECTagainstread_csv.Drop the upload directory from the sandbox allowlist once nothing in the import path issues DuckDB queries against
/tmp/arc-uploads/.... Pure-Go parsing means the temp file is read byos.Open+csv.Reader(or the parquet equivalent) and deleted before any query runs.Out of scope
getColumnsSQL syntax in isolation (e.g. swapSELECT ... FROM (DESCRIBE ...)forpragma table_info(...)) — would fix the symptom without fixing the architecture. Don't.Severity
Medium. Production CSV/Parquet imports are broken today against the current DuckDB version + sandbox config. Workaround: convert source data to line protocol and use
import lp, which works. But that's a 1-format workaround for an API that ostensibly supports 3 formats.Test plan (for the eventual fix)
--time-column+--time-formatproduces the sameImportResultshape as today (rows_imported, partitions_created, time_range_min/max, columns, duration_ms)."file is empty"(the same 400 LP returns today), not a parser error.delimiter=';'andskip_rows=1ingests correctly.timecolumn produces matchingImportResult.arcctl query(orPOST /api/v1/query) against the same database/measurement.allowed_directoriesno longer contains the upload temp dir, and CSV/Parquet imports still succeed (proves the temp file isn't being read via DuckDB anymore).Discovered while
Building
arcctl import csvin Basekick-Labs/arcctl#3. arcctl correctly surfaces the verbatim 422 from the server; this is purely an arc-side fix.