Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
ea9fac3
docs: add base food catalog crawler design spec
OrellBuehler May 18, 2026
3d82f18
docs: revise base food catalog spec after design review
OrellBuehler May 18, 2026
7cfa238
docs: add base food catalog foundation + integration plan
OrellBuehler May 21, 2026
a733fde
feat: add catalog schema (datasets/foods/access) + pg_trgm index
OrellBuehler May 21, 2026
2b86fa4
test: tighten catalog drift-guard + register catalog tables in migrat…
OrellBuehler May 21, 2026
a4fb1f5
feat: add catalog dataset JSONL Zod schema
OrellBuehler May 21, 2026
2994835
feat: add catalog:import CLI (validated JSONL, batched replace, GIN r…
OrellBuehler May 21, 2026
01fe672
feat: add catalog:grant/revoke/list CLI subcommands
OrellBuehler May 21, 2026
0576077
chore: gitignore + prek guard against committing catalog datasets
OrellBuehler May 21, 2026
4335b28
refactor: extract shared nutrient-extraction helper from openfoodfacts
OrellBuehler May 21, 2026
dcad285
feat: add access-gated catalog query layer (search/barcode/instantiate)
OrellBuehler May 21, 2026
2abfbc9
refactor: thread optional db override through createFood; route catal…
OrellBuehler May 21, 2026
b6f2669
feat: add /api/catalog search/barcode/save endpoints + openapi
OrellBuehler May 21, 2026
00439de
feat: add i18n strings for catalog picker
OrellBuehler May 21, 2026
232a290
feat: surface online catalog results with source badge in FoodPicker
OrellBuehler May 21, 2026
a46f22d
feat: catalog pick instantiates a personal food then logs (copy-on-use)
OrellBuehler May 21, 2026
c88a38a
feat: barcode scan checks catalog before Open Food Facts fallback
OrellBuehler May 21, 2026
7a7e0b3
chore: catalog UX/OpenAPI polish — toast for save error; uuid format …
OrellBuehler May 21, 2026
5f87752
feat(crawler): scaffold Bun package + shared lib (normalize, jsonl, h…
OrellBuehler May 29, 2026
d2bb95b
feat(crawler): OFF bulk-dump adapter (Swiss/food filter + streaming c…
OrellBuehler May 29, 2026
7c636c8
feat(crawler): Migros adapter (normalizer, crawl loop, migros-api-wra…
OrellBuehler May 29, 2026
33d957f
feat(crawler): CLI entrypoint (off|migros) + end-to-end test
OrellBuehler May 29, 2026
0a40900
fix(crawler): address review — writer try/finally, checkpoint-after-y…
OrellBuehler May 29, 2026
276cfb9
chore: ignore entire .claude directory
OrellBuehler May 30, 2026
e183530
Merge pull request #252 from OrellBuehler/feat/catalog-crawler
OrellBuehler May 30, 2026
8fadab6
feat: surface Open Food Facts search in FoodPicker with copy-on-use save
OrellBuehler May 30, 2026
46325b2
fix: rate limiter returns 429 instead of 500
OrellBuehler May 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,9 @@ vite.config.ts.timestamp-*

# Worktrees
.worktrees/
.claude/worktrees/

# Claude Code
.claude/

# Superpowers
.superpowers/
Expand Down Expand Up @@ -61,3 +63,7 @@ mobile/iosApp/*.xcodeproj

# OpenAPI generator temp output
.openapi-gen-tmp/

# Crawled catalog datasets — never commit (public repo; private data)
data/catalog/
tests/fixtures/catalog/bad.jsonl
8 changes: 8 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,11 @@ repos:
entry: gitleaks protect --staged --verbose --redact
language: system
pass_filenames: false

- id: no-catalog-data
name: no committed catalog datasets
entry: >-
bash -c 'if git diff --cached --name-only | grep -qE "^data/catalog/.*\.jsonl$";
then echo "ERROR - catalog dataset files must not be committed (public repo)"; exit 1; fi'
language: system
pass_filenames: false
67 changes: 67 additions & 0 deletions crawler/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Bissbilanz Catalog Crawler

Offline tool that builds **catalog datasets** (normalized JSONL) for the access-gated
base food catalog. It is **not part of the SvelteKit app**, its build, or `bun run security`
scope — nothing under `src/` imports it.

## Legal posture

- **Private use, no redistribution.** Crawler _code_ ships in this repo; crawled _data_ never
does. Datasets are written under `data/catalog/` which is git-ignored and rejected by a
pre-commit hook (`no-catalog-data`).
- Output is imported only into this app's database and surfaced only to its authenticated,
individually access-granted users. It is not rehosted or redistributed.
- Retailer images are referenced by source URL only — never rehosted.
- Sources are accessed politely: fixed-delay throttling, on-disk response caching, descriptive
User-Agent, exponential-backoff retry.

## Dataset format

One JSONL file per dataset. Line 1 is a `{ "_dataset": { ... } }` header; lines 2..n are one
product per line. The contract is the shared Zod schema
`src/lib/server/catalog/dataset-schema.ts` — the crawler validates every emitted row against it,
so a produced file always imports cleanly (`catalog:import` is fail-closed).

## Usage

```bash
cd crawler
bun install # installs migros-api-wrapper (Migros source only)

# Open Food Facts — from a downloaded ODbL bulk dump (.jsonl or .jsonl.gz):
# download once from https://world.openfoodfacts.org/data (openfoodfacts-products.jsonl.gz)
bun run crawl off /path/to/openfoodfacts-products.jsonl.gz
# → writes data/catalog/off-ch-<date>.jsonl (Swiss products with full core macros)

# Migros — live API (polite, throttled):
bun run crawl migros
# → writes data/catalog/migros-<date>.jsonl
```

The OFF dump is large (tens of GB uncompressed); the crawler streams it (gunzip + line split),
never loading it into memory. The Migros crawl is live and rate-limited — expect it to take a
while; it checkpoints progress.

## Importing on the server host

The CLI that loads a dataset into Postgres runs **on the server host** (production Postgres is
Docker-internal), not from the crawler:

```bash
scp data/catalog/migros-<date>.jsonl server:/tmp/
ssh server
docker compose exec -T app bun run catalog:import /tmp/migros-<date>.jsonl
docker compose exec -T app bun run catalog:grant <userEmail> migros
```

Re-importing the same dataset `key` fully replaces its rows and preserves access grants.

## Testing

```bash
cd crawler && bun test
```

All tests are fixture-driven — no live network. Adapters split a pure, tested normalizer from
thin live-fetch glue; the glue (`createMigrosClient`, dump download) is exercised only by the
maintainer during a real crawl.
39 changes: 39 additions & 0 deletions crawler/adapters/migros/client.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
import { test, expect } from 'bun:test';
import { join } from 'node:path';
import { mapProductDetail, extractProductIds, pickDetail } from './client';

test('mapProductDetail reduces a Migros API product-detail to MigrosProductDetail', async () => {
const raw = await Bun.file(
join(import.meta.dir, '../../fixtures/migros-product-detail.json')
).json();
const d = mapProductDetail(raw);
expect(d).not.toBeNull();
expect(d!.id).toBe('100001');
expect(d!.name).toBe('M-Classic Vollmilch UHT');
expect(d!.gtins).toEqual(['7610200000001']);
expect(d!.productUrl).toContain('100001');
expect(d!.nutrition.basis).toBe('100g');
expect(d!.nutrition.energyKcal).toBe(64);
expect(d!.nutrition.sugar).toBe(4.8);
expect(d!.nutrition.saturatedFat).toBe(2.1);
expect(d!.nutrition.salt).toBe(0.1);
});

test('mapProductDetail returns null when id or name is missing', () => {
expect(mapProductDetail({ name: 'no id' })).toBeNull();
expect(mapProductDetail({ productId: '1' })).toBeNull();
});

test('extractProductIds reads productIds or products[].id/uid', () => {
expect(extractProductIds({ productIds: ['a', 'b'] })).toEqual(['a', 'b']);
expect(extractProductIds({ products: [{ id: 'x' }, { uid: 'y' }] })).toEqual(['x', 'y']);
expect(extractProductIds({})).toEqual([]);
expect(extractProductIds(null)).toEqual([]);
});

test('pickDetail selects a single product from array/products/object shapes', () => {
expect(pickDetail([{ productId: '1' }])?.productId).toBe('1');
expect(pickDetail({ products: [{ productId: '2' }] })?.productId).toBe('2');
expect(pickDetail({ productId: '3' })?.productId).toBe('3');
expect(pickDetail(null)).toBeNull();
});
138 changes: 138 additions & 0 deletions crawler/adapters/migros/client.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
import type { MigrosClient, MigrosProductDetail, MigrosNutrition } from './types';

type RawNutrientValue = { code?: string; value?: number | string };
type RawProductDetail = {
productId?: string;
name?: string;
brand?: string;
gtins?: string[];
productUrls?: Record<string, string>;
image?: { original?: string };
ingredients?: string;
nutrients?: { referenceValue?: string; values?: RawNutrientValue[] };
};

type MigrosNumericKey = Exclude<keyof MigrosNutrition, 'basis'>;

const NUTRIENT_CODE: Record<string, MigrosNumericKey> = {
energy_kcal: 'energyKcal',
protein: 'protein',
carbohydrate: 'carbohydrate',
of_which_sugars: 'sugar',
fat: 'fat',
of_which_saturated: 'saturatedFat',
dietary_fiber: 'fiber',
salt: 'salt'
};

function num(v: number | string | undefined): number | null {
if (v == null) return null;
const n = typeof v === 'string' ? parseFloat(v) : v;
return Number.isNaN(n) ? null : n;
}

export function mapProductDetail(raw: RawProductDetail): MigrosProductDetail | null {
const id = raw.productId;
const name = raw.name;
if (!id || !name) return null;
const nutrition: MigrosNutrition = { basis: raw.nutrients?.referenceValue ?? '100g' };
for (const entry of raw.nutrients?.values ?? []) {
const key = entry.code ? NUTRIENT_CODE[entry.code] : undefined;
if (key) nutrition[key] = num(entry.value);
}
return {
id,
name,
brand: raw.brand ?? null,
gtins: (raw.gtins ?? []).filter((g) => !!g),
productUrl: raw.productUrls?.de ?? Object.values(raw.productUrls ?? {})[0] ?? null,
imageUrl: raw.image?.original ?? null,
ingredients: raw.ingredients ?? null,
nutrition
};
}

export type MigrosClientConfig = {
/** Food category search terms or category ids to page through (host-confirmed). */
categories: string[];
pageSize?: number;
maxPagesPerCategory?: number;
};

/** Best-effort extraction of product ids from a (loosely-typed) search response. */
export function extractProductIds(res: unknown): string[] {
const r = res as { productIds?: string[]; products?: Array<{ id?: string; uid?: string }> };
if (Array.isArray(r?.productIds)) return r.productIds.filter((id): id is string => !!id);
if (Array.isArray(r?.products)) {
return r.products.map((p) => p.id ?? p.uid).filter((id): id is string => !!id);
}
return [];
}

/** Best-effort selection of the single product object from a product-detail response. */
export function pickDetail(res: unknown): RawProductDetail | null {
if (!res) return null;
if (Array.isArray(res)) return (res[0] as RawProductDetail) ?? null;
const r = res as { products?: RawProductDetail[] };
if (Array.isArray(r.products)) return r.products[0] ?? null;
return res as RawProductDetail;
}

/**
* Live client backed by `migros-api-wrapper` (`MigrosAPI`: guest token → product search →
* product-detail). NOT unit-tested — no live network in CI (spec §12). The dependency is
* imported dynamically so the tested core type-checks/runs without loading axios/cheerio/pino.
*
* The wrapper's instance methods return `any` and some option types are inconsistent, so the
* call boundary is navigated through a narrow facade. The exact category ids/pagination params
* and the product-detail response field paths consumed by `mapProductDetail`/`extractProductIds`
* are verified against a live response on the server host during the first crawl (spec §13).
*/
export async function createMigrosClient(config: MigrosClientConfig): Promise<MigrosClient> {
const { MigrosAPI } = await import('migros-api-wrapper');
const api = new MigrosAPI();
// Guest token — public product data needs no login.
const token = (await api.account.oauth2.loginGuestToken()) as string;

const products = api.products as unknown as {
productSearch: {
searchProduct: (
body: { query: string; [k: string]: unknown },
options?: Record<string, unknown>,
token?: string
) => Promise<unknown>;
};
productDisplay: {
getProductDetails: (
options: { uids: string | string[]; [k: string]: unknown },
token?: string
) => Promise<unknown>;
};
};

const pageSize = config.pageSize ?? 24;
const maxPages = config.maxPagesPerCategory ?? 1000;

return {
async *listProductIds({ resume }) {
for (const category of config.categories) {
let page = resume && resume.category === category ? resume.page : 0;
for (; page < maxPages; page++) {
const res = await products.productSearch.searchProduct(
{ query: category },
{ from: page * pageSize, hitsPerPage: pageSize },
token
);
const ids = extractProductIds(res);
for (const id of ids) yield { id, cursor: { category, page } };
if (ids.length < pageSize) break;
}
}
},
async getProduct(id) {
const res = await products.productDisplay.getProductDetails({ uids: id }, token);
const raw = pickDetail(res);
return raw ? mapProductDetail(raw) : null;
}
};
}
78 changes: 78 additions & 0 deletions crawler/adapters/migros/crawl-migros.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
import { test, expect } from 'bun:test';
import { crawlMigros } from './crawl-migros';
import { newStats } from '../../types';
import type { MigrosClient, MigrosProductDetail } from './types';

function makeClient(
products: Record<string, MigrosProductDetail | null>,
ids: string[]
): MigrosClient {
return {
async *listProductIds() {
let page = 0;
for (const id of ids) yield { id, cursor: { category: 'all', page: page++ } };
},
async getProduct(id) {
return products[id] ?? null;
}
};
}

const base: MigrosProductDetail = {
id: '1',
name: 'A',
gtins: ['7610200000001'],
productUrl: 'https://m/1',
nutrition: { basis: '100g', energyKcal: 64, protein: 3.3, carbohydrate: 4.8, fat: 3.5, fiber: 0 }
};

test('emits normalized products and dedupes repeated ids and barcodes', async () => {
const client = makeClient(
{
'1': base,
'2': { ...base, id: '2', name: 'B', gtins: ['7610200000002'] },
'3': { ...base, id: '3', name: 'A-dup', gtins: ['7610200000001'] } // dup barcode
},
['1', '2', '2', '3'] // '2' listed twice
);
const stats = newStats();
const out = [];
for await (const p of crawlMigros(client, { stats, sleep: async () => {} })) out.push(p);
expect(out.map((p) => p.name).sort()).toEqual(['A', 'B']);
expect(stats.emitted).toBe(2);
expect(stats.dropReasons['dup']).toBe(2); // one dup id + one dup barcode
});

test('skips ids whose product detail is null', async () => {
const client = makeClient({ '1': base, '9': null }, ['1', '9']);
const out = [];
for await (const p of crawlMigros(client, { sleep: async () => {} })) out.push(p);
expect(out.length).toBe(1);
});

test('respects the limit option', async () => {
const client = makeClient({ '1': base, '2': { ...base, id: '2', gtins: ['7610200000002'] } }, [
'1',
'2'
]);
const out = [];
for await (const p of crawlMigros(client, { limit: 1, sleep: async () => {} })) out.push(p);
expect(out.length).toBe(1);
});

test('checkpoints only emitted products (after yield), not dropped ones', async () => {
const client = makeClient(
{ '1': base, '2': { ...base, id: '2', name: 'B', gtins: ['7610200000002'] }, '9': null },
['1', '9', '2']
);
const cursors: Array<{ category: string; page: number }> = [];
const out = [];
for await (const p of crawlMigros(client, {
sleep: async () => {},
onCheckpoint: (c) => void cursors.push(c)
}))
out.push(p);
// '9' has no detail (dropped) → not checkpointed; only the two emitted products are.
expect(out.length).toBe(2);
expect(cursors.length).toBe(2);
});
Loading
Loading