Feature/consolidate crawlers#41
Closed
rangarius wants to merge 11 commits into
Closed
Conversation
This major refactoring introduces a generic scraper system that reduces the number of container images needed from 24 to ~12 by using YAML configuration files instead of individual Python scripts. Changes: - Add generic_scraper base image with config-driven scraping engine - Migrate 9 crawlers to YAML configs (002_gz, 041-054) - Add optimized CI/CD workflows (single-platform testing, conditional builds) - Add local development scripts (test-scraper-local.sh, test-all-configs.sh) - Add compose.consolidated.yaml for new architecture - Add MIGRATION.md documentation Benefits: - ~50% reduction in container images to build - ~50% faster CI/CD testing (single-platform) - Conditional base image builds (only when changed) - Config changes don't require container rebuilds - Easier to add new crawlers (just add YAML file) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove 9 docker_instances/* directories that are now replaced by YAML configs in crawler_configs/: - 002_gz, 041_immenrode, 044_wiedelah, 048_jerstedt - 050_tschuessschule_studium, 051_vhs, 052_vhs_kinderuni - 053_tschuessschule_praktikum, 054_tschuessschule_ausbildung Update compose.dev.yaml to remove references to deleted containers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixes: - Use correct action: docker/setup-buildx-action@v3 - Use correct health monitor port: 5015 - Pull pre-built images instead of building locally - Use compose.yaml with production images - Update upload-artifact to v4 - Add proper error handling - Document .gitignore file Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add scripts/dev.sh: Main development CLI with setup, up, down, logs, etc. - Add scripts/build-base-images-local.sh: Build base images locally (no registry) - Add CONTRIBUTING.md: Clear setup and contribution instructions New workflow: ./scripts/dev.sh setup # First-time setup ./scripts/dev.sh up # Start containers ./scripts/dev.sh test # Test configs without Docker Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add gs_generic_scraper_simple and gs_generic_scraper_tschuessschule to compose.dev.yaml for local development - Add generic scrapers to health monitor dependencies - Update schema.md with clarification about run_on_start - Fix talsperren script text color from white to black Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces crawlers.yaml as the central registry for all crawler definitions. This eliminates the need to maintain configuration in multiple places. Key changes: - Add crawlers.yaml with all 25 crawler definitions - Add scripts/generate-compose.py to generate compose files from registry - Add scripts/generate-readme.py to update README tables from registry - Add scripts/generate-all.sh convenience script - Update health monitor to read crawler definitions from registry - Update README.md with auto-generated crawler tables Workflow for adding/modifying crawlers: 1. Edit crawlers.yaml 2. Run ./scripts/generate-all.sh 3. Commit changes The health monitor falls back to hardcoded definitions if the registry file is not mounted (backwards compatibility). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add scripts/test-crawler.py as unified Python test runner - Update test-scraper-local.sh to be wrapper around test-crawler.py - Update test-all-configs.sh to be wrapper around test-crawler.py New features: - Test both config-driven and custom container crawlers - Auto-detect base image requirements from Dockerfile - Create venvs automatically for custom crawlers - Filter by --config, --custom, or --category - List all crawlers with --list Usage: ./scripts/test-scraper-local.sh # List crawlers ./scripts/test-scraper-local.sh 002_gz # Test single crawler ./scripts/test-all-configs.sh # Test all crawlers ./scripts/test-all-configs.sh --config # Test only config-driven Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…mands to ADD; enhance error handling in Python scripts; update development script to rebuild containers.
- Add scripts/generate-workflow-matrix.py to extract data from registry - Add load-registry job to both test and deploy workflows - Replace hardcoded base image lists with dynamic registry lookups - Trigger workflows on crawlers.yaml changes - Use registry for full container list during deploy-all scenarios - Include registry data in workflow summary outputs Both workflows now use the single source of truth (crawlers.yaml) for container and base image lists, eliminating duplicate maintenance. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Umbau des bestehenden Crawler Systems zur Beschleunigung des Build Vorgangs
Fixes:
-> Default Ausgabe für "Erster Freitag"
-> Fix Gif in Bodenwasser Container
-> Fix Talsperren Darstellung (#39 )