Skip to content

Feature/consolidate crawlers#41

Closed
rangarius wants to merge 11 commits into
mainfrom
feature/consolidate-crawlers
Closed

Feature/consolidate crawlers#41
rangarius wants to merge 11 commits into
mainfrom
feature/consolidate-crawlers

Conversation

@rangarius
Copy link
Copy Markdown
Contributor

@rangarius rangarius commented Jan 19, 2026

Umbau des bestehenden Crawler Systems zur Beschleunigung des Build Vorgangs

Vereinfachter Scraper-Container für Website Crawling (generic-scraper)
Vereinfachter Scraper-Container für Website Crawling (nested - generic-scraper)
Bestehenden Images für besondere Konfigurationen
Einheitliche Registry für Container (crawler.yaml)
Funktionierende lokale Testmöglichkeiten (Test Config und Test Local Container)
Implementierung in Github Workflows

Fixes:
-> Default Ausgabe für "Erster Freitag"
-> Fix Gif in Bodenwasser Container
-> Fix Talsperren Darstellung (#39 )

rangarius and others added 11 commits January 16, 2026 13:49
This major refactoring introduces a generic scraper system that reduces
the number of container images needed from 24 to ~12 by using YAML
configuration files instead of individual Python scripts.

Changes:
- Add generic_scraper base image with config-driven scraping engine
- Migrate 9 crawlers to YAML configs (002_gz, 041-054)
- Add optimized CI/CD workflows (single-platform testing, conditional builds)
- Add local development scripts (test-scraper-local.sh, test-all-configs.sh)
- Add compose.consolidated.yaml for new architecture
- Add MIGRATION.md documentation

Benefits:
- ~50% reduction in container images to build
- ~50% faster CI/CD testing (single-platform)
- Conditional base image builds (only when changed)
- Config changes don't require container rebuilds
- Easier to add new crawlers (just add YAML file)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove 9 docker_instances/* directories that are now replaced by
YAML configs in crawler_configs/:
- 002_gz, 041_immenrode, 044_wiedelah, 048_jerstedt
- 050_tschuessschule_studium, 051_vhs, 052_vhs_kinderuni
- 053_tschuessschule_praktikum, 054_tschuessschule_ausbildung

Update compose.dev.yaml to remove references to deleted containers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixes:
- Use correct action: docker/setup-buildx-action@v3
- Use correct health monitor port: 5015
- Pull pre-built images instead of building locally
- Use compose.yaml with production images
- Update upload-artifact to v4
- Add proper error handling
- Document .gitignore file

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add scripts/dev.sh: Main development CLI with setup, up, down, logs, etc.
- Add scripts/build-base-images-local.sh: Build base images locally (no registry)
- Add CONTRIBUTING.md: Clear setup and contribution instructions

New workflow:
  ./scripts/dev.sh setup  # First-time setup
  ./scripts/dev.sh up     # Start containers
  ./scripts/dev.sh test   # Test configs without Docker

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add gs_generic_scraper_simple and gs_generic_scraper_tschuessschule
  to compose.dev.yaml for local development
- Add generic scrapers to health monitor dependencies
- Update schema.md with clarification about run_on_start
- Fix talsperren script text color from white to black

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces crawlers.yaml as the central registry for all crawler definitions.
This eliminates the need to maintain configuration in multiple places.

Key changes:
- Add crawlers.yaml with all 25 crawler definitions
- Add scripts/generate-compose.py to generate compose files from registry
- Add scripts/generate-readme.py to update README tables from registry
- Add scripts/generate-all.sh convenience script
- Update health monitor to read crawler definitions from registry
- Update README.md with auto-generated crawler tables

Workflow for adding/modifying crawlers:
1. Edit crawlers.yaml
2. Run ./scripts/generate-all.sh
3. Commit changes

The health monitor falls back to hardcoded definitions if the registry
file is not mounted (backwards compatibility).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add scripts/test-crawler.py as unified Python test runner
- Update test-scraper-local.sh to be wrapper around test-crawler.py
- Update test-all-configs.sh to be wrapper around test-crawler.py

New features:
- Test both config-driven and custom container crawlers
- Auto-detect base image requirements from Dockerfile
- Create venvs automatically for custom crawlers
- Filter by --config, --custom, or --category
- List all crawlers with --list

Usage:
  ./scripts/test-scraper-local.sh              # List crawlers
  ./scripts/test-scraper-local.sh 002_gz       # Test single crawler
  ./scripts/test-all-configs.sh                # Test all crawlers
  ./scripts/test-all-configs.sh --config       # Test only config-driven

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…mands to ADD; enhance error handling in Python scripts; update development script to rebuild containers.
- Add scripts/generate-workflow-matrix.py to extract data from registry
- Add load-registry job to both test and deploy workflows
- Replace hardcoded base image lists with dynamic registry lookups
- Trigger workflows on crawlers.yaml changes
- Use registry for full container list during deploy-all scenarios
- Include registry data in workflow summary outputs

Both workflows now use the single source of truth (crawlers.yaml) for
container and base image lists, eliminating duplicate maintenance.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@rangarius rangarius closed this Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant