Skip to content

Epic: ccpm-hardening-v0_1 #1

@t3chn

Description

@t3chn

Epic: CCPM v0.1 Hardening

Overview

Implement reliability and safety improvements to Claude Code PM through four core infrastructure enhancements: YAML frontmatter validation, file-based locking for parallel operations, rate-limit-aware GitHub API access, and hook-based logging. All improvements maintain existing UX and command signatures while preventing common failure scenarios.

Architecture Decisions

1. Shell-Based Implementation

  • Decision: Use POSIX shell scripts for maximum compatibility
  • Rationale: Existing system is shell-based; maintains consistency and minimal dependencies
  • Pattern: Place all scripts in .claude/scripts/pm/ directory

2. File-Based Locking with TTL

  • Decision: Simple filesystem locks with automatic expiration
  • Rationale: Avoids complex locking mechanisms; self-healing through TTL
  • Implementation: Lock files in .claude/locks/ with timestamp-based expiration

3. Non-Invasive Integration

  • Decision: Add validation calls to existing commands without changing signatures
  • Rationale: Maintains backward compatibility and existing workflows
  • Pattern: Prepend validation calls to existing command implementations

4. Hook-Based Logging

  • Decision: Use Claude Code's hook system for transparent logging
  • Rationale: Zero impact on command behavior; pure observability enhancement
  • Format: NDJSON for structured log analysis

Technical Approach

Core Components

1. Frontmatter Validation (validate-frontmatter.sh)

  • Purpose: Validate YAML frontmatter and internal links before GitHub sync
  • Input: Directory path for recursive validation
  • Validation Rules:
    • Presence of YAML frontmatter block
    • Required title field exists
    • Files referenced in depends_on exist within same epic
  • Exit Codes: 0 (OK), 2 (FAILED)

2. File Locking System (lock)

  • Purpose: Prevent concurrent access to same resources
  • Interface: lock acquire|release <name> [ttl_s]
  • Default TTL: 1800 seconds (30 minutes)
  • Lock Storage: .claude/locks/<name>.lock with timestamp
  • Cleanup: Automatic expiration via TTL check

3. Rate-Limit-Aware GitHub Wrapper (gh_safe)

  • Purpose: Prevent GitHub API rate limit failures
  • Interface: Drop-in replacement for gh command
  • Behavior: Check rate limit before each call; sleep until reset if needed
  • Integration: Replace bulk GitHub operations in sync commands

4. Hook-Based Logging (pre_tool_use.sh)

  • Purpose: Log all tool usage for observability
  • Hook Type: PreToolUse in Claude Code settings
  • Output: .claude/logs/hooks.ndjson with structured events
  • Fields: timestamp, tool_name, command, working_directory

Infrastructure Components

Lock Directory Structure

.claude/locks/
├── issue-1234.lock    # Issue-specific locks
├── epic-sync.lock     # Epic synchronization lock
└── bulk-github.lock   # GitHub API bulk operations lock

Log Directory Structure

.claude/logs/
├── hooks.ndjson       # Tool usage events
└── gh_safe.log        # Rate limit wait events (optional)

Implementation Strategy

Development Phases

Phase 1: Core Scripts (2-3 hours)

  • Implement validate-frontmatter.sh with comprehensive validation
  • Implement lock script with acquire/release/TTL functionality
  • Implement gh_safe wrapper with rate limit checking
  • Create pre_tool_use.sh hook for logging

Phase 2: Integration (1-2 hours)

  • Add validation calls to /pm:epic-sync, /pm:sync, /pm:validate
  • Wrap /pm:issue-start with locking mechanism
  • Replace gh calls with gh_safe in bulk operations
  • Configure hook in .claude/settings.json

Phase 3: Testing & Validation (1 hour)

  • Test positive/negative validation scenarios
  • Test concurrent lock acquisition
  • Test rate limit handling with forced low limits
  • Verify NDJSON log generation

Risk Mitigation

Script Permissions: Document chmod +x requirements clearly
Backward Compatibility: All existing commands maintain identical interfaces
Rollback Strategy: Simple removal of prolog lines and hook configuration
Lock Deadlocks: TTL-based automatic cleanup prevents permanent locks

Testing Approach

Unit Testing:

  • Individual script testing with edge cases
  • Mock GitHub API responses for rate limit testing
  • Filesystem permission testing

Integration Testing:

  • End-to-end command flows with validation
  • Concurrent session testing for locking
  • Bulk operation testing with rate limits

Task Breakdown Preview

High-level task categories that will be created:

  • Validation Infrastructure: Implement validate-frontmatter.sh with YAML parsing and link checking
  • Locking System: Implement file-based locking with TTL and cleanup mechanisms
  • GitHub Rate Limiting: Implement gh_safe wrapper with intelligent wait logic
  • Logging Infrastructure: Implement hook-based logging with NDJSON output
  • Command Integration: Wire validation and locking into existing PM commands
  • Configuration Setup: Update settings and ensure proper permissions
  • Testing & Documentation: Comprehensive testing and usage documentation

Dependencies

External Dependencies

  • GitHub CLI: Already present in system
  • POSIX Shell: Standard Unix environment
  • Claude Code Hook System: For logging integration

Internal Dependencies

  • Existing PM Commands: Integration points for validation and locking
  • Epic Directory Structure: For frontmatter validation scope
  • Settings Configuration: For hook system integration

Prerequisite Work

  • None - builds on existing CCPM infrastructure

Success Criteria (Technical)

Functionality

  • /pm:validate fails appropriately on invalid frontmatter or broken links
  • Concurrent /pm:issue-start operations are safely serialized
  • GitHub API operations never fail due to rate limits
  • All tool usage is logged without affecting command behavior

Performance

  • Validation adds <2 seconds to sync operations
  • Lock operations complete in <100ms
  • Rate limit checks add minimal overhead to GitHub calls
  • Logging has zero impact on command execution time

Reliability

  • No false positives in frontmatter validation
  • Locks are automatically cleaned up after TTL expiration
  • Rate limit detection is accurate and responsive
  • Log files rotate or remain bounded in size

Estimated Effort

Overall Timeline

  • Total Development: 4-6 hours
  • Testing & Integration: 2 hours
  • Documentation: 1 hour
  • Total: 7-9 hours (can be completed in 1 development session)

Resource Requirements

  • Single developer with shell scripting experience
  • Access to GitHub API for testing rate limit scenarios
  • Local CCPM installation for integration testing

Critical Path Items

  1. validate-frontmatter.sh implementation (foundation for other components)
  2. Integration with existing commands (affects all workflows)
  3. Permission and configuration setup (required for deployment)

This epic delivers significant reliability improvements while maintaining the existing user experience and can be implemented as a focused, single-session development effort.

Stats

Total tasks: 6
Parallel tasks: 4 (can be worked on simultaneously)
Sequential tasks: 2 (have dependencies)
Estimated total effort: 11 hours

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions