Feature/consolidate crawlers#42
Conversation
This major refactoring introduces a generic scraper system that reduces the number of container images needed from 24 to ~12 by using YAML configuration files instead of individual Python scripts. Changes: - Add generic_scraper base image with config-driven scraping engine - Migrate 9 crawlers to YAML configs (002_gz, 041-054) - Add optimized CI/CD workflows (single-platform testing, conditional builds) - Add local development scripts (test-scraper-local.sh, test-all-configs.sh) - Add compose.consolidated.yaml for new architecture - Add MIGRATION.md documentation Benefits: - ~50% reduction in container images to build - ~50% faster CI/CD testing (single-platform) - Conditional base image builds (only when changed) - Config changes don't require container rebuilds - Easier to add new crawlers (just add YAML file) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove 9 docker_instances/* directories that are now replaced by YAML configs in crawler_configs/: - 002_gz, 041_immenrode, 044_wiedelah, 048_jerstedt - 050_tschuessschule_studium, 051_vhs, 052_vhs_kinderuni - 053_tschuessschule_praktikum, 054_tschuessschule_ausbildung Update compose.dev.yaml to remove references to deleted containers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixes: - Use correct action: docker/setup-buildx-action@v3 - Use correct health monitor port: 5015 - Pull pre-built images instead of building locally - Use compose.yaml with production images - Update upload-artifact to v4 - Add proper error handling - Document .gitignore file Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add scripts/dev.sh: Main development CLI with setup, up, down, logs, etc. - Add scripts/build-base-images-local.sh: Build base images locally (no registry) - Add CONTRIBUTING.md: Clear setup and contribution instructions New workflow: ./scripts/dev.sh setup # First-time setup ./scripts/dev.sh up # Start containers ./scripts/dev.sh test # Test configs without Docker Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add gs_generic_scraper_simple and gs_generic_scraper_tschuessschule to compose.dev.yaml for local development - Add generic scrapers to health monitor dependencies - Update schema.md with clarification about run_on_start - Fix talsperren script text color from white to black Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces crawlers.yaml as the central registry for all crawler definitions. This eliminates the need to maintain configuration in multiple places. Key changes: - Add crawlers.yaml with all 25 crawler definitions - Add scripts/generate-compose.py to generate compose files from registry - Add scripts/generate-readme.py to update README tables from registry - Add scripts/generate-all.sh convenience script - Update health monitor to read crawler definitions from registry - Update README.md with auto-generated crawler tables Workflow for adding/modifying crawlers: 1. Edit crawlers.yaml 2. Run ./scripts/generate-all.sh 3. Commit changes The health monitor falls back to hardcoded definitions if the registry file is not mounted (backwards compatibility). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add scripts/test-crawler.py as unified Python test runner - Update test-scraper-local.sh to be wrapper around test-crawler.py - Update test-all-configs.sh to be wrapper around test-crawler.py New features: - Test both config-driven and custom container crawlers - Auto-detect base image requirements from Dockerfile - Create venvs automatically for custom crawlers - Filter by --config, --custom, or --category - List all crawlers with --list Usage: ./scripts/test-scraper-local.sh # List crawlers ./scripts/test-scraper-local.sh 002_gz # Test single crawler ./scripts/test-all-configs.sh # Test all crawlers ./scripts/test-all-configs.sh --config # Test only config-driven Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…mands to ADD; enhance error handling in Python scripts; update development script to rebuild containers.
- Add scripts/generate-workflow-matrix.py to extract data from registry - Add load-registry job to both test and deploy workflows - Replace hardcoded base image lists with dynamic registry lookups - Trigger workflows on crawlers.yaml changes - Use registry for full container list during deploy-all scenarios - Include registry data in workflow summary outputs Both workflows now use the single source of truth (crawlers.yaml) for container and base image lists, eliminating duplicate maintenance. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When containers are deleted, git diff still shows their files as changed. The workflows would then try to build/test containers that no longer exist. Added check to verify container directory exists before adding to the changed-containers list in both test and deploy workflows. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The test workflow was failing because container Dockerfiles reference base images from ghcr.io which require authentication. Changes: - Add step to extract base image from Dockerfile and build it locally - Tag local build with the exact ghcr.io name the Dockerfile expects - Use --pull=never to prevent Docker from trying to pull from registry This allows container tests to run without registry authentication by building all required base images locally first. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The --pull flag expects a boolean value, not "never". Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The --pull flag is a presence-only flag that forces pulling. To use local images, simply omit the flag - Docker will use cached images by default. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The docker/setup-buildx-action creates a separate builder using the docker-container driver, which has its own build cache isolated from the Docker daemon's image cache. This meant locally built base images weren't available to the buildx builder. Since we're only doing single-platform test builds with regular 'docker build', we don't need buildx. Regular docker build uses the daemon's cache directly, so locally built base images are available. Reference: https://github.com/docker/setup-buildx-action Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This pull request implements a major architectural consolidation of the GS Crawler system, reducing the number of container images from 24 to approximately 12 by introducing a generic scraper engine driven by YAML configurations.
Changes:
- Introduced a centralized
crawlers.yamlregistry as a single source of truth for all crawlers - Created a generic scraper engine that supports YAML-based configuration files for simple crawlers
- Migrated 9 crawlers from custom Docker containers to config-driven implementations
- Added comprehensive development and testing scripts for local development without Docker
- Implemented auto-generation scripts for Docker Compose files, README documentation, and GitHub workflow matrices
- Fixed bugs in existing custom crawlers (default output for "Erster Freitag", GIF URL fix for Bodenwasser, text color fix for Talsperren display)
Reviewed changes
Copilot reviewed 75 out of 78 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/test-scraper-local.sh |
Wrapper script for testing crawlers locally without Docker |
scripts/test-crawler.py |
Main Python script for testing individual or batched crawlers |
scripts/generate-*.py |
Scripts to generate compose files, README, and workflow matrices from crawlers.yaml |
scripts/dev.sh |
Development helper script for managing local Docker environment |
base_images/generic_scraper/* |
New generic scraper base image with config-driven scraping engine |
crawler_configs/**/*.yaml |
YAML configuration files for migrated crawlers |
crawlers.yaml |
Central registry defining all crawlers in the system |
docker_instances/027_erster_freitag/script.py |
Added default output generation when no events found |
docker_instances/033_goslar24-7/get_and_store_images.py |
Added database initialization logic |
docker_instances/035_talsperren/script.py |
Changed text color from white to black for better visibility |
docker_instances/047_bodenwasser/script.py |
Fixed GIF URL and removed unused import |
docker_instances/000_health_monitor/app.py |
Updated to read crawler definitions from crawlers.yaml registry |
compose.yaml, compose.dev.yaml |
Regenerated Docker Compose files from central registry |
README.md |
Updated with new architecture and auto-generated crawler tables |
.github/workflows/* |
Optimized CI/CD workflows using registry-aware matrix generation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| start_div = soup.find("div", id="1-freitag-goslar") | ||
| if not start_div: | ||
| print("❌ Start-DIV nicht gefunden.") | ||
| generateDefault("Start-Div mit ID '1-freitag-goslar' nicht gefunden.") |
There was a problem hiding this comment.
The function call generateDefault is missing the required savemepath and filename parameters. This will cause a TypeError at runtime since the function signature at line 9 expects three parameters: message, savemepath, and filename.
|
|
||
| else: | ||
| print("❌ Keine passenden Einträge gefunden.") | ||
| generateDefault("Keine passenden Einträge gefunden.") |
There was a problem hiding this comment.
The function call generateDefault is missing the required savemepath and filename parameters. This will cause a TypeError at runtime since the function signature at line 9 expects three parameters: message, savemepath, and filename.
| if sqlite3.connect(DB_PATH) is None: | ||
| logging.info("Datenbankverbindung konnte nicht hergestellt werden.") | ||
| init_db.init_db() |
There was a problem hiding this comment.
The condition sqlite3.connect(DB_PATH) is None will never be True. The sqlite3.connect() function returns a Connection object on success or raises an exception on failure. This check should be restructured to use a try-except block instead.
* feat: Add consolidated config-driven scraper architecture This major refactoring introduces a generic scraper system that reduces the number of container images needed from 24 to ~12 by using YAML configuration files instead of individual Python scripts. Changes: - Add generic_scraper base image with config-driven scraping engine - Migrate 9 crawlers to YAML configs (002_gz, 041-054) - Add optimized CI/CD workflows (single-platform testing, conditional builds) - Add local development scripts (test-scraper-local.sh, test-all-configs.sh) - Add compose.consolidated.yaml for new architecture - Add MIGRATION.md documentation Benefits: - ~50% reduction in container images to build - ~50% faster CI/CD testing (single-platform) - Conditional base image builds (only when changed) - Config changes don't require container rebuilds - Easier to add new crawlers (just add YAML file) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: Fix .gitignore formatting and add common excludes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: Remove migrated container directories Remove 9 docker_instances/* directories that are now replaced by YAML configs in crawler_configs/: - 002_gz, 041_immenrode, 044_wiedelah, 048_jerstedt - 050_tschuessschule_studium, 051_vhs, 052_vhs_kinderuni - 053_tschuessschule_praktikum, 054_tschuessschule_ausbildung Update compose.dev.yaml to remove references to deleted containers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: Restore and fix daily-health-check workflow Fixes: - Use correct action: docker/setup-buildx-action@v3 - Use correct health monitor port: 5015 - Pull pre-built images instead of building locally - Use compose.yaml with production images - Update upload-artifact to v4 - Add proper error handling - Document .gitignore file Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: Add missing 068_altstadtfest to compose.dev.yaml Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add local development workflow for contributors - Add scripts/dev.sh: Main development CLI with setup, up, down, logs, etc. - Add scripts/build-base-images-local.sh: Build base images locally (no registry) - Add CONTRIBUTING.md: Clear setup and contribution instructions New workflow: ./scripts/dev.sh setup # First-time setup ./scripts/dev.sh up # Start containers ./scripts/dev.sh test # Test configs without Docker Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add generic scraper to dev compose and minor fixes - Add gs_generic_scraper_simple and gs_generic_scraper_tschuessschule to compose.dev.yaml for local development - Add generic scrapers to health monitor dependencies - Update schema.md with clarification about run_on_start - Fix talsperren script text color from white to black Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add single-source-of-truth registry system Introduces crawlers.yaml as the central registry for all crawler definitions. This eliminates the need to maintain configuration in multiple places. Key changes: - Add crawlers.yaml with all 25 crawler definitions - Add scripts/generate-compose.py to generate compose files from registry - Add scripts/generate-readme.py to update README tables from registry - Add scripts/generate-all.sh convenience script - Update health monitor to read crawler definitions from registry - Update README.md with auto-generated crawler tables Workflow for adding/modifying crawlers: 1. Edit crawlers.yaml 2. Run ./scripts/generate-all.sh 3. Commit changes The health monitor falls back to hardcoded definitions if the registry file is not mounted (backwards compatibility). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Rework test scripts to use crawlers.yaml registry - Add scripts/test-crawler.py as unified Python test runner - Update test-scraper-local.sh to be wrapper around test-crawler.py - Update test-all-configs.sh to be wrapper around test-crawler.py New features: - Test both config-driven and custom container crawlers - Auto-detect base image requirements from Dockerfile - Create venvs automatically for custom crawlers - Filter by --config, --custom, or --category - List all crawlers with --list Usage: ./scripts/test-scraper-local.sh # List crawlers ./scripts/test-scraper-local.sh 002_gz # Test single crawler ./scripts/test-all-configs.sh # Test all crawlers ./scripts/test-all-configs.sh --config # Test only config-driven Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove obsolete CI workflow and scripts; refactor Dockerfile COPY commands to ADD; enhance error handling in Python scripts; update development script to rebuild containers. * feat: Integrate crawlers.yaml registry into GitHub workflows - Add scripts/generate-workflow-matrix.py to extract data from registry - Add load-registry job to both test and deploy workflows - Replace hardcoded base image lists with dynamic registry lookups - Trigger workflows on crawlers.yaml changes - Use registry for full container list during deploy-all scenarios - Include registry data in workflow summary outputs Both workflows now use the single source of truth (crawlers.yaml) for container and base image lists, eliminating duplicate maintenance. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: Skip deleted containers in workflow change detection When containers are deleted, git diff still shows their files as changed. The workflows would then try to build/test containers that no longer exist. Added check to verify container directory exists before adding to the changed-containers list in both test and deploy workflows. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: Build base images locally for container testing The test workflow was failing because container Dockerfiles reference base images from ghcr.io which require authentication. Changes: - Add step to extract base image from Dockerfile and build it locally - Tag local build with the exact ghcr.io name the Dockerfile expects - Use --pull=never to prevent Docker from trying to pull from registry This allows container tests to run without registry authentication by building all required base images locally first. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: Use --pull=false instead of --pull=never The --pull flag expects a boolean value, not "never". Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: Remove invalid --pull flag from docker build The --pull flag is a presence-only flag that forces pulling. To use local images, simply omit the flag - Docker will use cached images by default. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: Remove buildx setup from test-containers job The docker/setup-buildx-action creates a separate builder using the docker-container driver, which has its own build cache isolated from the Docker daemon's image cache. This meant locally built base images weren't available to the buildx builder. Since we're only doing single-platform test builds with regular 'docker build', we don't need buildx. Regular docker build uses the daemon's cache directly, so locally built base images are available. Reference: https://github.com/docker/setup-buildx-action Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * remove old lock files in generic_crawler deployment action --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Umbau des bestehenden Crawler Systems zur Beschleunigung des Build Vorgangs
Vereinfachter Scraper-Container für Website Crawling (generic-scraper)
Vereinfachter Scraper-Container für Website Crawling (nested - generic-scraper)
Bestehenden Images für besondere Konfigurationen
Einheitliche Registry für Container (crawler.yaml)
Funktionierende lokale Testmöglichkeiten (Test Config und Test Local Container)
Implementierung in Github Workflows
Fixes:
-> Default Ausgabe für "Erster Freitag"
-> Fix Gif in Bodenwasser Container
-> Fix Talsperren Darstellung (#39 )