Epic: ccpm-hardening-v0_1


# Epic: CCPM v0.1 Hardening

## Overview

Implement reliability and safety improvements to Claude Code PM through four core infrastructure enhancements: YAML frontmatter validation, file-based locking for parallel operations, rate-limit-aware GitHub API access, and hook-based logging. All improvements maintain existing UX and command signatures while preventing common failure scenarios.

## Architecture Decisions

### 1. Shell-Based Implementation
- **Decision**: Use POSIX shell scripts for maximum compatibility
- **Rationale**: Existing system is shell-based; maintains consistency and minimal dependencies
- **Pattern**: Place all scripts in `.claude/scripts/pm/` directory

### 2. File-Based Locking with TTL
- **Decision**: Simple filesystem locks with automatic expiration
- **Rationale**: Avoids complex locking mechanisms; self-healing through TTL
- **Implementation**: Lock files in `.claude/locks/` with timestamp-based expiration

### 3. Non-Invasive Integration
- **Decision**: Add validation calls to existing commands without changing signatures
- **Rationale**: Maintains backward compatibility and existing workflows
- **Pattern**: Prepend validation calls to existing command implementations

### 4. Hook-Based Logging
- **Decision**: Use Claude Code's hook system for transparent logging
- **Rationale**: Zero impact on command behavior; pure observability enhancement
- **Format**: NDJSON for structured log analysis

## Technical Approach

### Core Components

#### 1. Frontmatter Validation (`validate-frontmatter.sh`)
- **Purpose**: Validate YAML frontmatter and internal links before GitHub sync
- **Input**: Directory path for recursive validation
- **Validation Rules**:
  - Presence of YAML frontmatter block
  - Required `title` field exists
  - Files referenced in `depends_on` exist within same epic
- **Exit Codes**: 0 (OK), 2 (FAILED)

#### 2. File Locking System (`lock`)
- **Purpose**: Prevent concurrent access to same resources
- **Interface**: `lock acquire|release <name> [ttl_s]`
- **Default TTL**: 1800 seconds (30 minutes)
- **Lock Storage**: `.claude/locks/<name>.lock` with timestamp
- **Cleanup**: Automatic expiration via TTL check

#### 3. Rate-Limit-Aware GitHub Wrapper (`gh_safe`)
- **Purpose**: Prevent GitHub API rate limit failures
- **Interface**: Drop-in replacement for `gh` command
- **Behavior**: Check rate limit before each call; sleep until reset if needed
- **Integration**: Replace bulk GitHub operations in sync commands

#### 4. Hook-Based Logging (`pre_tool_use.sh`)
- **Purpose**: Log all tool usage for observability
- **Hook Type**: PreToolUse in Claude Code settings
- **Output**: `.claude/logs/hooks.ndjson` with structured events
- **Fields**: timestamp, tool_name, command, working_directory

### Infrastructure Components

#### Lock Directory Structure
```
.claude/locks/
├── issue-1234.lock    # Issue-specific locks
├── epic-sync.lock     # Epic synchronization lock
└── bulk-github.lock   # GitHub API bulk operations lock
```

#### Log Directory Structure
```
.claude/logs/
├── hooks.ndjson       # Tool usage events
└── gh_safe.log        # Rate limit wait events (optional)
```

## Implementation Strategy

### Development Phases

**Phase 1: Core Scripts (2-3 hours)**
- Implement `validate-frontmatter.sh` with comprehensive validation
- Implement `lock` script with acquire/release/TTL functionality
- Implement `gh_safe` wrapper with rate limit checking
- Create `pre_tool_use.sh` hook for logging

**Phase 2: Integration (1-2 hours)**
- Add validation calls to `/pm:epic-sync`, `/pm:sync`, `/pm:validate`
- Wrap `/pm:issue-start` with locking mechanism
- Replace `gh` calls with `gh_safe` in bulk operations
- Configure hook in `.claude/settings.json`

**Phase 3: Testing & Validation (1 hour)**
- Test positive/negative validation scenarios
- Test concurrent lock acquisition
- Test rate limit handling with forced low limits
- Verify NDJSON log generation

### Risk Mitigation

**Script Permissions**: Document `chmod +x` requirements clearly
**Backward Compatibility**: All existing commands maintain identical interfaces
**Rollback Strategy**: Simple removal of prolog lines and hook configuration
**Lock Deadlocks**: TTL-based automatic cleanup prevents permanent locks

### Testing Approach

**Unit Testing**:
- Individual script testing with edge cases
- Mock GitHub API responses for rate limit testing
- Filesystem permission testing

**Integration Testing**:
- End-to-end command flows with validation
- Concurrent session testing for locking
- Bulk operation testing with rate limits

## Task Breakdown Preview

High-level task categories that will be created:

- [ ] **Validation Infrastructure**: Implement `validate-frontmatter.sh` with YAML parsing and link checking
- [ ] **Locking System**: Implement file-based locking with TTL and cleanup mechanisms  
- [ ] **GitHub Rate Limiting**: Implement `gh_safe` wrapper with intelligent wait logic
- [ ] **Logging Infrastructure**: Implement hook-based logging with NDJSON output
- [ ] **Command Integration**: Wire validation and locking into existing PM commands
- [ ] **Configuration Setup**: Update settings and ensure proper permissions
- [ ] **Testing & Documentation**: Comprehensive testing and usage documentation

## Dependencies

### External Dependencies
- **GitHub CLI**: Already present in system
- **POSIX Shell**: Standard Unix environment
- **Claude Code Hook System**: For logging integration

### Internal Dependencies
- **Existing PM Commands**: Integration points for validation and locking
- **Epic Directory Structure**: For frontmatter validation scope
- **Settings Configuration**: For hook system integration

### Prerequisite Work
- None - builds on existing CCPM infrastructure

## Success Criteria (Technical)

### Functionality
- `/pm:validate` fails appropriately on invalid frontmatter or broken links
- Concurrent `/pm:issue-start` operations are safely serialized
- GitHub API operations never fail due to rate limits
- All tool usage is logged without affecting command behavior

### Performance
- Validation adds <2 seconds to sync operations
- Lock operations complete in <100ms
- Rate limit checks add minimal overhead to GitHub calls
- Logging has zero impact on command execution time

### Reliability
- No false positives in frontmatter validation
- Locks are automatically cleaned up after TTL expiration
- Rate limit detection is accurate and responsive
- Log files rotate or remain bounded in size

## Estimated Effort

### Overall Timeline
- **Total Development**: 4-6 hours
- **Testing & Integration**: 2 hours  
- **Documentation**: 1 hour
- **Total**: 7-9 hours (can be completed in 1 development session)

### Resource Requirements
- Single developer with shell scripting experience
- Access to GitHub API for testing rate limit scenarios
- Local CCPM installation for integration testing

### Critical Path Items
1. `validate-frontmatter.sh` implementation (foundation for other components)
2. Integration with existing commands (affects all workflows)
3. Permission and configuration setup (required for deployment)

This epic delivers significant reliability improvements while maintaining the existing user experience and can be implemented as a focused, single-session development effort.

## Stats

Total tasks: 6
Parallel tasks: 4 (can be worked on simultaneously)
Sequential tasks: 2 (have dependencies)
Estimated total effort: 11 hours


Epic: ccpm-hardening-v0_1 #1

Description

Epic: CCPM v0.1 Hardening

Overview

Architecture Decisions

1. Shell-Based Implementation

2. File-Based Locking with TTL

3. Non-Invasive Integration

4. Hook-Based Logging

Technical Approach

Core Components

1. Frontmatter Validation (validate-frontmatter.sh)

2. File Locking System (lock)

3. Rate-Limit-Aware GitHub Wrapper (gh_safe)

4. Hook-Based Logging (pre_tool_use.sh)

Infrastructure Components

Lock Directory Structure

Log Directory Structure

Implementation Strategy

Development Phases

Risk Mitigation

Testing Approach

Task Breakdown Preview

Dependencies

External Dependencies

Internal Dependencies

Prerequisite Work

Success Criteria (Technical)

Functionality

Performance

Reliability

Estimated Effort

Overall Timeline

Resource Requirements

Critical Path Items

Stats

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Frontmatter Validation (`validate-frontmatter.sh`)

2. File Locking System (`lock`)

3. Rate-Limit-Aware GitHub Wrapper (`gh_safe`)

4. Hook-Based Logging (`pre_tool_use.sh`)