Skip to content

Feature/consolidate crawlers#42

Merged
rangarius merged 16 commits into
mainfrom
feature/consolidate-crawlers
Jan 19, 2026
Merged

Feature/consolidate crawlers#42
rangarius merged 16 commits into
mainfrom
feature/consolidate-crawlers

Conversation

@rangarius
Copy link
Copy Markdown
Contributor

Umbau des bestehenden Crawler Systems zur Beschleunigung des Build Vorgangs

Vereinfachter Scraper-Container für Website Crawling (generic-scraper)
Vereinfachter Scraper-Container für Website Crawling (nested - generic-scraper)
Bestehenden Images für besondere Konfigurationen
Einheitliche Registry für Container (crawler.yaml)
Funktionierende lokale Testmöglichkeiten (Test Config und Test Local Container)
Implementierung in Github Workflows
Fixes:
-> Default Ausgabe für "Erster Freitag"
-> Fix Gif in Bodenwasser Container
-> Fix Talsperren Darstellung (#39 )

rangarius and others added 12 commits January 16, 2026 13:49
This major refactoring introduces a generic scraper system that reduces
the number of container images needed from 24 to ~12 by using YAML
configuration files instead of individual Python scripts.

Changes:
- Add generic_scraper base image with config-driven scraping engine
- Migrate 9 crawlers to YAML configs (002_gz, 041-054)
- Add optimized CI/CD workflows (single-platform testing, conditional builds)
- Add local development scripts (test-scraper-local.sh, test-all-configs.sh)
- Add compose.consolidated.yaml for new architecture
- Add MIGRATION.md documentation

Benefits:
- ~50% reduction in container images to build
- ~50% faster CI/CD testing (single-platform)
- Conditional base image builds (only when changed)
- Config changes don't require container rebuilds
- Easier to add new crawlers (just add YAML file)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove 9 docker_instances/* directories that are now replaced by
YAML configs in crawler_configs/:
- 002_gz, 041_immenrode, 044_wiedelah, 048_jerstedt
- 050_tschuessschule_studium, 051_vhs, 052_vhs_kinderuni
- 053_tschuessschule_praktikum, 054_tschuessschule_ausbildung

Update compose.dev.yaml to remove references to deleted containers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixes:
- Use correct action: docker/setup-buildx-action@v3
- Use correct health monitor port: 5015
- Pull pre-built images instead of building locally
- Use compose.yaml with production images
- Update upload-artifact to v4
- Add proper error handling
- Document .gitignore file

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add scripts/dev.sh: Main development CLI with setup, up, down, logs, etc.
- Add scripts/build-base-images-local.sh: Build base images locally (no registry)
- Add CONTRIBUTING.md: Clear setup and contribution instructions

New workflow:
  ./scripts/dev.sh setup  # First-time setup
  ./scripts/dev.sh up     # Start containers
  ./scripts/dev.sh test   # Test configs without Docker

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add gs_generic_scraper_simple and gs_generic_scraper_tschuessschule
  to compose.dev.yaml for local development
- Add generic scrapers to health monitor dependencies
- Update schema.md with clarification about run_on_start
- Fix talsperren script text color from white to black

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces crawlers.yaml as the central registry for all crawler definitions.
This eliminates the need to maintain configuration in multiple places.

Key changes:
- Add crawlers.yaml with all 25 crawler definitions
- Add scripts/generate-compose.py to generate compose files from registry
- Add scripts/generate-readme.py to update README tables from registry
- Add scripts/generate-all.sh convenience script
- Update health monitor to read crawler definitions from registry
- Update README.md with auto-generated crawler tables

Workflow for adding/modifying crawlers:
1. Edit crawlers.yaml
2. Run ./scripts/generate-all.sh
3. Commit changes

The health monitor falls back to hardcoded definitions if the registry
file is not mounted (backwards compatibility).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add scripts/test-crawler.py as unified Python test runner
- Update test-scraper-local.sh to be wrapper around test-crawler.py
- Update test-all-configs.sh to be wrapper around test-crawler.py

New features:
- Test both config-driven and custom container crawlers
- Auto-detect base image requirements from Dockerfile
- Create venvs automatically for custom crawlers
- Filter by --config, --custom, or --category
- List all crawlers with --list

Usage:
  ./scripts/test-scraper-local.sh              # List crawlers
  ./scripts/test-scraper-local.sh 002_gz       # Test single crawler
  ./scripts/test-all-configs.sh                # Test all crawlers
  ./scripts/test-all-configs.sh --config       # Test only config-driven

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…mands to ADD; enhance error handling in Python scripts; update development script to rebuild containers.
- Add scripts/generate-workflow-matrix.py to extract data from registry
- Add load-registry job to both test and deploy workflows
- Replace hardcoded base image lists with dynamic registry lookups
- Trigger workflows on crawlers.yaml changes
- Use registry for full container list during deploy-all scenarios
- Include registry data in workflow summary outputs

Both workflows now use the single source of truth (crawlers.yaml) for
container and base image lists, eliminating duplicate maintenance.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When containers are deleted, git diff still shows their files as changed.
The workflows would then try to build/test containers that no longer exist.

Added check to verify container directory exists before adding to the
changed-containers list in both test and deploy workflows.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@rangarius rangarius self-assigned this Jan 19, 2026
rangarius and others added 4 commits January 19, 2026 21:48
The test workflow was failing because container Dockerfiles reference
base images from ghcr.io which require authentication.

Changes:
- Add step to extract base image from Dockerfile and build it locally
- Tag local build with the exact ghcr.io name the Dockerfile expects
- Use --pull=never to prevent Docker from trying to pull from registry

This allows container tests to run without registry authentication by
building all required base images locally first.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The --pull flag expects a boolean value, not "never".

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The --pull flag is a presence-only flag that forces pulling.
To use local images, simply omit the flag - Docker will use
cached images by default.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The docker/setup-buildx-action creates a separate builder using the
docker-container driver, which has its own build cache isolated from
the Docker daemon's image cache. This meant locally built base images
weren't available to the buildx builder.

Since we're only doing single-platform test builds with regular
'docker build', we don't need buildx. Regular docker build uses the
daemon's cache directly, so locally built base images are available.

Reference: https://github.com/docker/setup-buildx-action

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@rangarius rangarius marked this pull request as ready for review January 19, 2026 21:01
Copilot AI review requested due to automatic review settings January 19, 2026 21:01
@rangarius rangarius merged commit 35c2566 into main Jan 19, 2026
16 of 18 checks passed
@rangarius rangarius deleted the feature/consolidate-crawlers branch January 19, 2026 21:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a major architectural consolidation of the GS Crawler system, reducing the number of container images from 24 to approximately 12 by introducing a generic scraper engine driven by YAML configurations.

Changes:

  • Introduced a centralized crawlers.yaml registry as a single source of truth for all crawlers
  • Created a generic scraper engine that supports YAML-based configuration files for simple crawlers
  • Migrated 9 crawlers from custom Docker containers to config-driven implementations
  • Added comprehensive development and testing scripts for local development without Docker
  • Implemented auto-generation scripts for Docker Compose files, README documentation, and GitHub workflow matrices
  • Fixed bugs in existing custom crawlers (default output for "Erster Freitag", GIF URL fix for Bodenwasser, text color fix for Talsperren display)

Reviewed changes

Copilot reviewed 75 out of 78 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scripts/test-scraper-local.sh Wrapper script for testing crawlers locally without Docker
scripts/test-crawler.py Main Python script for testing individual or batched crawlers
scripts/generate-*.py Scripts to generate compose files, README, and workflow matrices from crawlers.yaml
scripts/dev.sh Development helper script for managing local Docker environment
base_images/generic_scraper/* New generic scraper base image with config-driven scraping engine
crawler_configs/**/*.yaml YAML configuration files for migrated crawlers
crawlers.yaml Central registry defining all crawlers in the system
docker_instances/027_erster_freitag/script.py Added default output generation when no events found
docker_instances/033_goslar24-7/get_and_store_images.py Added database initialization logic
docker_instances/035_talsperren/script.py Changed text color from white to black for better visibility
docker_instances/047_bodenwasser/script.py Fixed GIF URL and removed unused import
docker_instances/000_health_monitor/app.py Updated to read crawler definitions from crawlers.yaml registry
compose.yaml, compose.dev.yaml Regenerated Docker Compose files from central registry
README.md Updated with new architecture and auto-generated crawler tables
.github/workflows/* Optimized CI/CD workflows using registry-aware matrix generation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

start_div = soup.find("div", id="1-freitag-goslar")
if not start_div:
print("❌ Start-DIV nicht gefunden.")
generateDefault("Start-Div mit ID '1-freitag-goslar' nicht gefunden.")
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function call generateDefault is missing the required savemepath and filename parameters. This will cause a TypeError at runtime since the function signature at line 9 expects three parameters: message, savemepath, and filename.

Copilot uses AI. Check for mistakes.

else:
print("❌ Keine passenden Einträge gefunden.")
generateDefault("Keine passenden Einträge gefunden.")
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function call generateDefault is missing the required savemepath and filename parameters. This will cause a TypeError at runtime since the function signature at line 9 expects three parameters: message, savemepath, and filename.

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +23
if sqlite3.connect(DB_PATH) is None:
logging.info("Datenbankverbindung konnte nicht hergestellt werden.")
init_db.init_db()
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition sqlite3.connect(DB_PATH) is None will never be True. The sqlite3.connect() function returns a Connection object on success or raises an exception on failure. This check should be restructured to use a try-except block instead.

Copilot uses AI. Check for mistakes.
rangarius added a commit that referenced this pull request Jan 19, 2026
* feat: Add consolidated config-driven scraper architecture

This major refactoring introduces a generic scraper system that reduces
the number of container images needed from 24 to ~12 by using YAML
configuration files instead of individual Python scripts.

Changes:
- Add generic_scraper base image with config-driven scraping engine
- Migrate 9 crawlers to YAML configs (002_gz, 041-054)
- Add optimized CI/CD workflows (single-platform testing, conditional builds)
- Add local development scripts (test-scraper-local.sh, test-all-configs.sh)
- Add compose.consolidated.yaml for new architecture
- Add MIGRATION.md documentation

Benefits:
- ~50% reduction in container images to build
- ~50% faster CI/CD testing (single-platform)
- Conditional base image builds (only when changed)
- Config changes don't require container rebuilds
- Easier to add new crawlers (just add YAML file)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* chore: Fix .gitignore formatting and add common excludes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* chore: Remove migrated container directories

Remove 9 docker_instances/* directories that are now replaced by
YAML configs in crawler_configs/:
- 002_gz, 041_immenrode, 044_wiedelah, 048_jerstedt
- 050_tschuessschule_studium, 051_vhs, 052_vhs_kinderuni
- 053_tschuessschule_praktikum, 054_tschuessschule_ausbildung

Update compose.dev.yaml to remove references to deleted containers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Restore and fix daily-health-check workflow

Fixes:
- Use correct action: docker/setup-buildx-action@v3
- Use correct health monitor port: 5015
- Pull pre-built images instead of building locally
- Use compose.yaml with production images
- Update upload-artifact to v4
- Add proper error handling
- Document .gitignore file

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Add missing 068_altstadtfest to compose.dev.yaml

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add local development workflow for contributors

- Add scripts/dev.sh: Main development CLI with setup, up, down, logs, etc.
- Add scripts/build-base-images-local.sh: Build base images locally (no registry)
- Add CONTRIBUTING.md: Clear setup and contribution instructions

New workflow:
  ./scripts/dev.sh setup  # First-time setup
  ./scripts/dev.sh up     # Start containers
  ./scripts/dev.sh test   # Test configs without Docker

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add generic scraper to dev compose and minor fixes

- Add gs_generic_scraper_simple and gs_generic_scraper_tschuessschule
  to compose.dev.yaml for local development
- Add generic scrapers to health monitor dependencies
- Update schema.md with clarification about run_on_start
- Fix talsperren script text color from white to black

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add single-source-of-truth registry system

Introduces crawlers.yaml as the central registry for all crawler definitions.
This eliminates the need to maintain configuration in multiple places.

Key changes:
- Add crawlers.yaml with all 25 crawler definitions
- Add scripts/generate-compose.py to generate compose files from registry
- Add scripts/generate-readme.py to update README tables from registry
- Add scripts/generate-all.sh convenience script
- Update health monitor to read crawler definitions from registry
- Update README.md with auto-generated crawler tables

Workflow for adding/modifying crawlers:
1. Edit crawlers.yaml
2. Run ./scripts/generate-all.sh
3. Commit changes

The health monitor falls back to hardcoded definitions if the registry
file is not mounted (backwards compatibility).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Rework test scripts to use crawlers.yaml registry

- Add scripts/test-crawler.py as unified Python test runner
- Update test-scraper-local.sh to be wrapper around test-crawler.py
- Update test-all-configs.sh to be wrapper around test-crawler.py

New features:
- Test both config-driven and custom container crawlers
- Auto-detect base image requirements from Dockerfile
- Create venvs automatically for custom crawlers
- Filter by --config, --custom, or --category
- List all crawlers with --list

Usage:
  ./scripts/test-scraper-local.sh              # List crawlers
  ./scripts/test-scraper-local.sh 002_gz       # Test single crawler
  ./scripts/test-all-configs.sh                # Test all crawlers
  ./scripts/test-all-configs.sh --config       # Test only config-driven

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove obsolete CI workflow and scripts; refactor Dockerfile COPY commands to ADD; enhance error handling in Python scripts; update development script to rebuild containers.

* feat: Integrate crawlers.yaml registry into GitHub workflows

- Add scripts/generate-workflow-matrix.py to extract data from registry
- Add load-registry job to both test and deploy workflows
- Replace hardcoded base image lists with dynamic registry lookups
- Trigger workflows on crawlers.yaml changes
- Use registry for full container list during deploy-all scenarios
- Include registry data in workflow summary outputs

Both workflows now use the single source of truth (crawlers.yaml) for
container and base image lists, eliminating duplicate maintenance.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Skip deleted containers in workflow change detection

When containers are deleted, git diff still shows their files as changed.
The workflows would then try to build/test containers that no longer exist.

Added check to verify container directory exists before adding to the
changed-containers list in both test and deploy workflows.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Build base images locally for container testing

The test workflow was failing because container Dockerfiles reference
base images from ghcr.io which require authentication.

Changes:
- Add step to extract base image from Dockerfile and build it locally
- Tag local build with the exact ghcr.io name the Dockerfile expects
- Use --pull=never to prevent Docker from trying to pull from registry

This allows container tests to run without registry authentication by
building all required base images locally first.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Use --pull=false instead of --pull=never

The --pull flag expects a boolean value, not "never".

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Remove invalid --pull flag from docker build

The --pull flag is a presence-only flag that forces pulling.
To use local images, simply omit the flag - Docker will use
cached images by default.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Remove buildx setup from test-containers job

The docker/setup-buildx-action creates a separate builder using the
docker-container driver, which has its own build cache isolated from
the Docker daemon's image cache. This meant locally built base images
weren't available to the buildx builder.

Since we're only doing single-platform test builds with regular
'docker build', we don't need buildx. Regular docker build uses the
daemon's cache directly, so locally built base images are available.

Reference: https://github.com/docker/setup-buildx-action

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* remove old lock files in generic_crawler deployment action

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants