|
| 1 | +# Zerobus - File Mode |
| 2 | + |
| 3 | +A lightweight, no‑code file ingestion workflow. Configure a set of tables, get a volume path for each, and drop files into those paths—your data lands in Unity Catalog tables via Auto Loader and Lakeflow Pipeline. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | +- [Quick Start](#quick-start) |
| 7 | + - [Step 1. Configure tables](#step-1-configure-tables) |
| 8 | + - [Step 2. Deploy & set up](#step-2-deploy--set-up) |
| 9 | + - [Step 3. Retrieve endpoint & push files](#step-3-retrieve-endpoint--push-files) |
| 10 | +- [Debug Table Issues](#debug-table-issues) |
| 11 | + - [Step 1. Configure tables to debug](#step-1-configure-tables-to-debug) |
| 12 | + - [Step 2. Deploy & set up in dev mode](#step-2-deploy--set-up-in-dev-mode) |
| 13 | + - [Step 3. Retrieve endpoint & push files to debug](#step-3-retrieve-endpoint--push-files-to-debug) |
| 14 | + - [Step 4. Debug table configs](#step-4-debug-table-configs) |
| 15 | + - [Step 5. Fix the table configs in production](#step-5-fix-the-table-configs-in-production) |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## Quick Start |
| 20 | + |
| 21 | +### Step 1. Configure tables |
| 22 | +Edit table configs in `./src/configs/tables.json`. Only `name` and `format` are required. |
| 23 | + |
| 24 | +Currently supported formats are `csv`, `json`, `avro` and `parquet`. |
| 25 | + |
| 26 | +For supported `format_options`, see the [Auto Loader options](https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options). Not all options are supported here. If unsure, specify only `name` and `format`, or follow [Debug Table Issues](#debug-table-issues) to discover the correct options. |
| 27 | + |
| 28 | +```json |
| 29 | +[ |
| 30 | + { |
| 31 | + "name": "table1", |
| 32 | + "format": "csv", |
| 33 | + "format_options": |
| 34 | + { |
| 35 | + "escape": "\"" |
| 36 | + }, |
| 37 | + "schema_hints": "id int, name string" |
| 38 | + }, |
| 39 | + { |
| 40 | + "name": "table2", |
| 41 | + "format": "json" |
| 42 | + } |
| 43 | +] |
| 44 | +``` |
| 45 | + |
| 46 | +> **Tip:** Keep `schema_hints` minimal; Auto Loader can evolve the schema as new columns appear. |
| 47 | + |
| 48 | +### Step 2. Deploy & set up |
| 49 | + |
| 50 | +```bash |
| 51 | +databricks bundle deploy |
| 52 | +databricks bundle run configuration_job |
| 53 | +``` |
| 54 | + |
| 55 | +Wait for the configuration job to finish before moving on. |
| 56 | + |
| 57 | +### Step 3. Retrieve endpoint & push files |
| 58 | +First, grant write permissions to the volume. This enables the client to push files: |
| 59 | + |
| 60 | +```bash |
| 61 | +databricks bundle open filepush_volume |
| 62 | +``` |
| 63 | + |
| 64 | +Fetch the volume path for uploading files to a specific table (example: `table1`): |
| 65 | + |
| 66 | +```bash |
| 67 | +databricks tables get {{.catalog_name}}.{{.schema_name}}.table1 --output json \ |
| 68 | + | jq -r '.properties["filepush.table_volume_path_data"]' |
| 69 | +``` |
| 70 | + |
| 71 | +Example output: |
| 72 | + |
| 73 | +```text |
| 74 | +/Volumes/{{.catalog_name}}/{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume/data/table1 |
| 75 | +``` |
| 76 | + |
| 77 | +Upload files to the path above using any of the [Volumes file APIs](https://docs.databricks.com/aws/en/volumes/volume-files#methods-for-managing-files-in-volumes). |
| 78 | + |
| 79 | +**Databricks CLI example** (destination uses the `dbfs:` scheme): |
| 80 | + |
| 81 | +```bash |
| 82 | +databricks fs cp /local/file/path/datafile1.csv \ |
| 83 | + dbfs:/Volumes/{{.catalog_name}}/{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume/data/table1 |
| 84 | +``` |
| 85 | + |
| 86 | +**REST API example**: |
| 87 | + |
| 88 | +```bash |
| 89 | +# prerequisites: export DATABRICKS_HOST and DATABRICKS_TOKEN (PAT token) |
| 90 | +curl -X PUT "$DATABRICKS_HOST/api/2.0/fs/files/Volumes/{{.catalog_name}}/{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume/data/table1/datafile1.csv" \ |
| 91 | + -H "Authorization: Bearer $DATABRICKS_TOKEN" \ |
| 92 | + -H "Content-Type: application/octet-stream" \ |
| 93 | + --data-binary @"/local/file/path/datafile1.csv" |
| 94 | +``` |
| 95 | + |
| 96 | +Within about a minute, the data should appear in the table `{{.catalog_name}}.{{.schema_name}}.table1`. |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +## Debug Table Issues |
| 101 | +If data isn’t parsed as expected, use **dev mode** to iterate on table options safely. |
| 102 | + |
| 103 | +### Step 1. Configure tables to debug |
| 104 | +Configure tables as in [Step 1 of Quick Start](#step-1-configure-tables). |
| 105 | + |
| 106 | +### Step 2. Deploy & set up in **dev mode** |
| 107 | + |
| 108 | +```bash |
| 109 | +databricks bundle deploy -t dev |
| 110 | +databricks bundle run configuration_job -t dev |
| 111 | +``` |
| 112 | + |
| 113 | +Wait for the configuration job to finish. Example output: |
| 114 | + |
| 115 | +```text |
| 116 | +2025-09-23 22:03:04,938 [INFO] initialization - ========== |
| 117 | +catalog_name: {{.catalog_name}} |
| 118 | +schema_name: dev_first_last_{{.schema_name}} |
| 119 | +volume_path_root: /Volumes/{{.catalog_name}}/dev_first_last_{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume |
| 120 | +volume_path_data: /Volumes/{{.catalog_name}}/dev_first_last_{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume/data |
| 121 | +volume_path_archive: /Volumes/{{.catalog_name}}/dev_first_last_{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume/archive |
| 122 | +========== |
| 123 | +``` |
| 124 | + |
| 125 | +> **Note:** In **dev mode**, the schema name is **prefixed**. Use the printed schema name for the remaining steps. |
| 126 | + |
| 127 | +### Step 3. Retrieve endpoint & push files to debug |
| 128 | + |
| 129 | +Get the dev volume path (note the **prefixed schema**): |
| 130 | + |
| 131 | +```bash |
| 132 | +databricks tables get {{.catalog_name}}.dev_first_last_{{.schema_name}}.table1 --output json \ |
| 133 | + | jq -r '.properties["filepush.table_volume_path_data"]' |
| 134 | +``` |
| 135 | + |
| 136 | +Example output: |
| 137 | + |
| 138 | +```text |
| 139 | +/Volumes/{{.catalog_name}}/dev_first_last_{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume/data/table1 |
| 140 | +``` |
| 141 | + |
| 142 | +Then follow the upload instructions from [Quick Start → Step 3](#step-3-retrieve-endpoint--push-files) to send test files. |
| 143 | + |
| 144 | +### Step 4. Debug table configs |
| 145 | +Open the pipeline in the workspace: |
| 146 | + |
| 147 | +```bash |
| 148 | +databricks bundle open refresh_pipeline -t dev |
| 149 | +``` |
| 150 | + |
| 151 | +Click **Edit pipeline** to launch the development UI. Open the `debug_table_config` notebook and follow its guidance to refine the table options. When satisfied, copy the final config back to `./src/configs/tables.json`. |
| 152 | + |
| 153 | +### Step 5. Fix the table configs in production |
| 154 | +Redeploy the updated config and run a full refresh to correct existing data for an affected table: |
| 155 | + |
| 156 | +```bash |
| 157 | +databricks bundle deploy |
| 158 | +databricks bundle run refresh_pipeline --full-refresh table1 |
| 159 | +``` |
| 160 | + |
| 161 | +--- |
| 162 | + |
| 163 | +**That’s it!** You now have a managed, push-based file ingestion workflow with debuggable table configs and repeatable deployments! |
| 164 | + |
0 commit comments