Skip to content

Commit 152cd6b

Browse files
authored
ZeroBus - File Mode Prototype DAB template (#112)
This is a DAB template that deploys resources to customer's workspace and invoke script jobs for setup of the file push endpoints. No new API or SQL syntax is introduced.
1 parent a8ae912 commit 152cd6b

15 files changed

Lines changed: 925 additions & 0 deletions

File tree

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Zerobus - File Mode
2+
3+
This is an (experimental) template for creating a file push pipeline with Databricks Asset Bundles.
4+
5+
Install it using
6+
```
7+
databricks bundle init --template-dir contrib/templates/file-push https://github.com/databricks/bundle-examples
8+
```
9+
10+
and follow the generated README.md to get started.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
"welcome_message": "\nWelcome to the file-push template for Databricks Asset Bundles!\n\nA workspace was selected based on your current profile. For information about how to change this, see https://docs.databricks.com/dev-tools/cli/profiles.html.\nworkspace_host: {{workspace_host}}",
3+
"properties": {
4+
"catalog_name": {
5+
"type": "string",
6+
"description": "\nPlease provide the name of an EXISTING UC catalog with default storage enabled.\nCatalog Name",
7+
"order": 1,
8+
"default": "main",
9+
"pattern": "^[a-z_][a-z0-9_]{0,254}$",
10+
"pattern_match_failure_message": "Name must only consist of letters, numbers, and underscores."
11+
},
12+
"schema_name": {
13+
"type": "string",
14+
"description": "\nPlease provide a NEW schema name where the pipelines and tables will land in.\nSchema Name",
15+
"order": 2,
16+
"default": "filepushschema",
17+
"pattern": "^[a-z_][a-z0-9_]{0,254}$",
18+
"pattern_match_failure_message": "Name must only consist of letters, numbers, dashes, and underscores."
19+
}
20+
},
21+
"success_message": "\nBundle folder '{{.catalog_name}}.{{.schema_name}}' has been created. Please refer to the README.md for next steps."
22+
}
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# Zerobus - File Mode
2+
3+
A lightweight, no‑code file ingestion workflow. Configure a set of tables, get a volume path for each, and drop files into those paths—your data lands in Unity Catalog tables via Auto Loader and Lakeflow Pipeline.
4+
5+
## Table of Contents
6+
- [Quick Start](#quick-start)
7+
- [Step 1. Configure tables](#step-1-configure-tables)
8+
- [Step 2. Deploy & set up](#step-2-deploy--set-up)
9+
- [Step 3. Retrieve endpoint & push files](#step-3-retrieve-endpoint--push-files)
10+
- [Debug Table Issues](#debug-table-issues)
11+
- [Step 1. Configure tables to debug](#step-1-configure-tables-to-debug)
12+
- [Step 2. Deploy & set up in dev mode](#step-2-deploy--set-up-in-dev-mode)
13+
- [Step 3. Retrieve endpoint & push files to debug](#step-3-retrieve-endpoint--push-files-to-debug)
14+
- [Step 4. Debug table configs](#step-4-debug-table-configs)
15+
- [Step 5. Fix the table configs in production](#step-5-fix-the-table-configs-in-production)
16+
17+
---
18+
19+
## Quick Start
20+
21+
### Step 1. Configure tables
22+
Edit table configs in `./src/configs/tables.json`. Only `name` and `format` are required.
23+
24+
Currently supported formats are `csv`, `json`, `avro` and `parquet`.
25+
26+
For supported `format_options`, see the [Auto Loader options](https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options). Not all options are supported here. If unsure, specify only `name` and `format`, or follow [Debug Table Issues](#debug-table-issues) to discover the correct options.
27+
28+
```json
29+
[
30+
{
31+
"name": "table1",
32+
"format": "csv",
33+
"format_options":
34+
{
35+
"escape": "\""
36+
},
37+
"schema_hints": "id int, name string"
38+
},
39+
{
40+
"name": "table2",
41+
"format": "json"
42+
}
43+
]
44+
```
45+
46+
> **Tip:** Keep `schema_hints` minimal; Auto Loader can evolve the schema as new columns appear.
47+
48+
### Step 2. Deploy & set up
49+
50+
```bash
51+
databricks bundle deploy
52+
databricks bundle run configuration_job
53+
```
54+
55+
Wait for the configuration job to finish before moving on.
56+
57+
### Step 3. Retrieve endpoint & push files
58+
First, grant write permissions to the volume. This enables the client to push files:
59+
60+
```bash
61+
databricks bundle open filepush_volume
62+
```
63+
64+
Fetch the volume path for uploading files to a specific table (example: `table1`):
65+
66+
```bash
67+
databricks tables get {{.catalog_name}}.{{.schema_name}}.table1 --output json \
68+
| jq -r '.properties["filepush.table_volume_path_data"]'
69+
```
70+
71+
Example output:
72+
73+
```text
74+
/Volumes/{{.catalog_name}}/{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume/data/table1
75+
```
76+
77+
Upload files to the path above using any of the [Volumes file APIs](https://docs.databricks.com/aws/en/volumes/volume-files#methods-for-managing-files-in-volumes).
78+
79+
**Databricks CLI example** (destination uses the `dbfs:` scheme):
80+
81+
```bash
82+
databricks fs cp /local/file/path/datafile1.csv \
83+
dbfs:/Volumes/{{.catalog_name}}/{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume/data/table1
84+
```
85+
86+
**REST API example**:
87+
88+
```bash
89+
# prerequisites: export DATABRICKS_HOST and DATABRICKS_TOKEN (PAT token)
90+
curl -X PUT "$DATABRICKS_HOST/api/2.0/fs/files/Volumes/{{.catalog_name}}/{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume/data/table1/datafile1.csv" \
91+
-H "Authorization: Bearer $DATABRICKS_TOKEN" \
92+
-H "Content-Type: application/octet-stream" \
93+
--data-binary @"/local/file/path/datafile1.csv"
94+
```
95+
96+
Within about a minute, the data should appear in the table `{{.catalog_name}}.{{.schema_name}}.table1`.
97+
98+
---
99+
100+
## Debug Table Issues
101+
If data isn’t parsed as expected, use **dev mode** to iterate on table options safely.
102+
103+
### Step 1. Configure tables to debug
104+
Configure tables as in [Step 1 of Quick Start](#step-1-configure-tables).
105+
106+
### Step 2. Deploy & set up in **dev mode**
107+
108+
```bash
109+
databricks bundle deploy -t dev
110+
databricks bundle run configuration_job -t dev
111+
```
112+
113+
Wait for the configuration job to finish. Example output:
114+
115+
```text
116+
2025-09-23 22:03:04,938 [INFO] initialization - ==========
117+
catalog_name: {{.catalog_name}}
118+
schema_name: dev_first_last_{{.schema_name}}
119+
volume_path_root: /Volumes/{{.catalog_name}}/dev_first_last_{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume
120+
volume_path_data: /Volumes/{{.catalog_name}}/dev_first_last_{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume/data
121+
volume_path_archive: /Volumes/{{.catalog_name}}/dev_first_last_{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume/archive
122+
==========
123+
```
124+
125+
> **Note:** In **dev mode**, the schema name is **prefixed**. Use the printed schema name for the remaining steps.
126+
127+
### Step 3. Retrieve endpoint & push files to debug
128+
129+
Get the dev volume path (note the **prefixed schema**):
130+
131+
```bash
132+
databricks tables get {{.catalog_name}}.dev_first_last_{{.schema_name}}.table1 --output json \
133+
| jq -r '.properties["filepush.table_volume_path_data"]'
134+
```
135+
136+
Example output:
137+
138+
```text
139+
/Volumes/{{.catalog_name}}/dev_first_last_{{.schema_name}}/{{.catalog_name}}_{{.schema_name}}_filepush_volume/data/table1
140+
```
141+
142+
Then follow the upload instructions from [Quick Start → Step 3](#step-3-retrieve-endpoint--push-files) to send test files.
143+
144+
### Step 4. Debug table configs
145+
Open the pipeline in the workspace:
146+
147+
```bash
148+
databricks bundle open refresh_pipeline -t dev
149+
```
150+
151+
Click **Edit pipeline** to launch the development UI. Open the `debug_table_config` notebook and follow its guidance to refine the table options. When satisfied, copy the final config back to `./src/configs/tables.json`.
152+
153+
### Step 5. Fix the table configs in production
154+
Redeploy the updated config and run a full refresh to correct existing data for an affected table:
155+
156+
```bash
157+
databricks bundle deploy
158+
databricks bundle run refresh_pipeline --full-refresh table1
159+
```
160+
161+
---
162+
163+
**That’s it!** You now have a managed, push-based file ingestion workflow with debuggable table configs and repeatable deployments!
164+
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# databricks.yml
2+
# This is the configuration for the file push DAB dab.
3+
4+
bundle:
5+
name: {{.schema_name}}
6+
uuid: {{bundle_uuid}}
7+
8+
include:
9+
- resources/*.yml
10+
11+
targets:
12+
# The deployment targets. See https://docs.databricks.com/en/dev-tools/bundles/deployment-modes.html
13+
dev:
14+
mode: development
15+
workspace:
16+
host: {{workspace_host}}
17+
18+
prod:
19+
mode: production
20+
default: true
21+
workspace:
22+
host: {{workspace_host}}
23+
root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}
24+
permissions:
25+
- user_name: ${workspace.current_user.userName}
26+
level: CAN_MANAGE
27+
28+
variables:
29+
catalog_name:
30+
description: The existing catalog where the NEW schema will be created.
31+
default: {{.catalog_name}}
32+
schema_name:
33+
description: The name of the NEW schema where the tables will be created.
34+
default: {{.schema_name}}
35+
resource_name_prefix:
36+
description: The prefix for the resource names.
37+
default: ${var.catalog_name}_${var.schema_name}_
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# The main job for schema dab
2+
# This job will trigger in the schema pipeline
3+
4+
resources:
5+
jobs:
6+
filetrigger_job:
7+
name: ${var.resource_name_prefix}filetrigger_job
8+
tasks:
9+
- task_key: pipeline_refresh
10+
pipeline_task:
11+
pipeline_id: ${resources.pipelines.refresh_pipeline.id}
12+
trigger:
13+
file_arrival:
14+
url: ${resources.volumes.filepush_volume.volume_path}/data/
15+
configuration_job:
16+
name: ${var.resource_name_prefix}configuration_job
17+
tasks:
18+
- task_key: initialization
19+
spark_python_task:
20+
python_file: ../src/utils/initialization.py
21+
parameters:
22+
- "--catalog_name"
23+
- "{{job.parameters.catalog_name}}"
24+
- "--schema_name"
25+
- "{{job.parameters.schema_name}}"
26+
- "--volume_path_root"
27+
- "{{job.parameters.volume_path_root}}"
28+
- "--logging_level"
29+
- "${bundle.target}"
30+
environment_key: serverless
31+
- task_key: trigger_refresh
32+
run_job_task:
33+
job_id: ${resources.jobs.filetrigger_job.id}
34+
depends_on:
35+
- task_key: initialization
36+
environments:
37+
- environment_key: serverless
38+
spec:
39+
client: "3"
40+
parameters:
41+
- name: catalog_name
42+
default: ${var.catalog_name}
43+
- name: schema_name
44+
default: ${resources.schemas.main_schema.name}
45+
- name: volume_path_root
46+
default: ${resources.volumes.filepush_volume.volume_path}
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# The table refresh pipeline for schema dab
2+
3+
resources:
4+
pipelines:
5+
refresh_pipeline:
6+
name: ${var.resource_name_prefix}refresh_pipeline
7+
catalog: ${var.catalog_name}
8+
schema: ${resources.schemas.main_schema.name}
9+
serverless: true
10+
libraries:
11+
- file:
12+
path: ../src/ingestion.py
13+
root_path: ../src
14+
configuration:
15+
lakeflow.experimantal.filepush.version: 0.1
16+
filepush.volume_path_root: ${resources.volumes.filepush_volume.volume_path}
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# The schema dab
2+
3+
resources:
4+
schemas:
5+
main_schema:
6+
name: ${var.schema_name}
7+
catalog_name: ${var.catalog_name}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# The file staging volume for schema dab
2+
3+
resources:
4+
volumes:
5+
filepush_volume:
6+
name: ${var.resource_name_prefix}filepush_volume
7+
catalog_name: ${var.catalog_name}
8+
schema_name: ${var.schema_name}
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
[
2+
{
3+
"name": "example_table_csv",
4+
"format": "csv"
5+
},
6+
{
7+
"name": "example_table_json",
8+
"format": "json"
9+
},
10+
{
11+
"name": "example_table_avro",
12+
"format": "avro"
13+
},
14+
{
15+
"name": "example_table_parquet",
16+
"format": "parquet"
17+
}
18+
]

0 commit comments

Comments
 (0)