Skip to content

Astropy caching#1190

Open
rcboufleur wants to merge 19 commits into
mainfrom
astropy_caching
Open

Astropy caching#1190
rcboufleur wants to merge 19 commits into
mainfrom
astropy_caching

Conversation

@rcboufleur
Copy link
Copy Markdown
Collaborator

No description provided.

- Add benchmark.py: Section-based timing for identifying bottlenecks
- Add resource_monitor.py: CPU, memory, I/O, and load monitoring
- Instrument occ_path_coeff.py with benchmark and resource monitoring
- Enable with BENCHMARK_ENABLED=1 and RESOURCE_MONITOR=1 env vars
- All monitoring is non-intrusive and easily removable (marked with comments)
- Reverted experimental optimizations that compromised scientific accuracy
- Benchmark and ResourceMonitor now accept 'enabled' parameter
- occ_path_coeff.py reads debug flag from obj_data or obj_data.predict_occultation
- Environment variables still work as fallback (BENCHMARK_ENABLED, RESOURCE_MONITOR)
- Set debug=True in job configuration to enable monitoring on cluster
- Added set_debug() method to Asteroid class
- Call a.set_debug(DEBUG) in submit_tasks after setting job_id
- This propagates debug=True from job to each asteroid's JSON
- Enables benchmarking and resource monitoring on cluster workers
- Removed 'enabled' parameter from Benchmark and ResourceMonitor
- BENCHMARK_ENABLED=1 enables benchmarking
- RESOURCE_MONITOR=1 enables resource monitoring
- Reverted asteroid.set_debug() and run_pred_occ changes
- Set env vars in cluster.sh for cluster runs
- MAX_NODES: controls max_blocks (default: 20)
- MAX_WORKERS_PER_NODE: controls max_workers per node (default: 28)
- BENCHMARK_ENABLED/RESOURCE_MONITOR: passed to cluster workers
- Reduces I/O contention by limiting concurrent workers
- init_blocks still calculated dynamically in run_pred_occ.py
- Update parsl_config.py to properly detect and export BENCHMARK_ENABLED and RESOURCE_MONITOR
- Use 'in os.environ' check to reliably detect variables from docker-compose environment section
- Export variables in worker_init script to ensure they reach all worker processes
- Fixes issue where variables from .env file were not reaching Parsl workers
- Fix import order in run_pred_occ.py: move astropy_cache_config before astropy.config
  This ensures XDG_CACHE_HOME is set before astropy initializes its cache directory

- Configure astropy cache in env.sh for linea environment on lead node
  Sets XDG_CACHE_HOME based on PREDICT_INPUTS parent directory for shared filesystem cache

- Export BENCHMARK_ENABLED and RESOURCE_MONITOR in env.sh for lead node
  Ensures variables are available when get_config() runs and can be passed to workers

- Export BENCHMARK_ENABLED and RESOURCE_MONITOR in cluster.sh for workers
  Variables are passed via Parsl envs dict but need explicit export for Python processes

Fixes:
- Astropy cache now uses shared filesystem location instead of /home/user/.astropy/cache
- Environment variables from .env properly propagated to cluster workers
- Works for both daemon.sh and rerun.sh execution paths
This fixes the hanging issue during ingestion worker startup. When
astropy.config was imported before astropy_cache_config, astropy would
initialize with the default cache directory (~/.cache/astropy) instead
of the shared filesystem cache. This caused IERS data lookups to fail
and trigger download attempts, leading to hangs when network was
unavailable or slow.

By importing astropy_cache_config first, XDG_CACHE_HOME is set before
astropy initializes, ensuring it uses the correct shared cache location
where IERS data already exists.
The pandas to_sql() method with custom upsert function processes data
in chunks (default ~1000 rows). Each chunk executes a separate
INSERT ... ON CONFLICT statement, which checks the hash_id unique
constraint.

When multiple asteroids finish simultaneously, this creates many
concurrent INSERT operations all checking the same unique constraint,
causing database lock contention and slowdowns.

By increasing chunksize to 5000, we reduce the number of INSERT
statements by 5x, significantly reducing lock contention on the
hash_id constraint while still maintaining reasonable transaction sizes.

This should improve ingestion performance when multiple asteroids
complete processing at the same time.
- Fix UnboundLocalError in predict_occ.py: Initialize occultation_file before try block and add guard check
- Add null value handling in consolidate_results() and ingest_predictions(): Filter out rows with null closest_approach to prevent NOT NULL constraint violations
- Revert chunksize optimization in occultation.py: Restore original ingestion method without chunksize parameter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant