Commit Graph

6 Commits

Author SHA1 Message Date
5d23b747ff fix(orchestrator): P5b first-line agent header strict + supplement throttle
Bug discovered during #24 IMP-24 K6 Stage 2 (2026-05-20):
- Codex r1, r2, r3 started with '=== IMPLEMENTATION_UNITS ===' on first line
  (not '[Codex #N] ...'), so detect_agent (P0-1 strict, first-line only)
  returned None.
- For non-audit issues, the P5 supplement guard was audit-only gated → silent
  loop until Codex r4 happened to use correct format. 4 rounds wasted.

Verified that #21 Stage 4 had the same latent silent loop pattern
('## [Codex #1]' first line) — orchestrator looped through ~10 Claude rounds
before random recovery. P5b fix addresses this long-standing bug.

Patch (defensive parser-contract hardening; does not assume single root cause):

1. RULES global gets explicit "FIRST non-empty line MUST be [Claude #N] /
   [Codex #N]" rule that OVERRIDES any stage-specific "body MUST contain"
   constraint.

2. COMPACT_PLAN_RULE wording clarified: "body" begins AFTER the first-line
   agent header. The 'body MUST contain ONLY' set no longer accidentally
   permits '=== IMPLEMENTATION_UNITS ===' on line 1.

3. is_codex None supplement guard:
   - audit-only gate REMOVED → fires for all issues (#24 latent loop fixed)
   - Throttle: max 2 supplements per stage; on 3rd violation, orchestrator
     hard-stops the issue with explicit "user action required" message
     and exits run_stage cleanly
   - Supplement message names both Claude AND Codex (Claude's first-line
     violation also breaks downstream via Codex mimicry)
   - Body-head 80 chars logged on detection failure (debugging aid)

4. Regression tests (+5 cases in test_orchestrator_core.py):
   - TestDetectAgent: '=== IMPLEMENTATION_UNITS ===' first line → None
   - TestDetectAgent: [Codex #N] first line + units after → 'codex' OK
   - TestDetectAgent: '## ', '📌 **', '**' prefix all → None
   - TestRulesAndCompactPlanFirstLineContract: RULES wording has FIRST/OVERRIDES
   - TestRulesAndCompactPlanFirstLineContract: COMPACT_PLAN_RULE has carve-out

Cosmetic side effect (accepted): Claude's '📌 **[Claude #N] ...**' or
'## [Codex #N] ...' decoration prefixes will fail detect_agent. Agents
will drop decorations from line 1; line 2+ can still use them.

Out of scope (NOT included to keep regression risk low):
- detect_agent function logic UNCHANGED (P0-1 strict preserved)
- consensus parser UNCHANGED
- stage loop structure UNCHANGED
- git/Gitea retrieval logic UNCHANGED
- audit-only mode P4/P4a guards UNCHANGED
- pre-post comment validation (future axis, larger refactor)

Total: 131/131 pytest pass (126 prior + 5 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 17:01:24 +09:00
134f52d3d3 feat(#58): L3 dormant trigger guard -- DORMANT-TRIGGERS.yaml + checker + orchestrator hook
P5-1 docs/architecture/DORMANT-TRIGGERS.yaml -- 5 entries (IMP-16/17/18/19 active + IMP-20 followup-linked #55).
P5-2 scripts/check_dormant_triggers.py -- standalone, reads registry, scans tree + diff, writes .orchestrator/dormant_alerts.json, exit 0 always.
P5-3 orchestrator.py -- _check_dormant_triggers() helper + Stage 4->5 informational alert branch (skips audit-only, never blocks).
P5-4 tests/orchestrator_unit/test_dormant_triggers.py -- 30 cases (yaml schema, registry contents, checker matching, false-positive guards, manual-evidence skip, orchestrator branch, audit bypass, governance ref).
P5-5 PROJECT-INTENT-AND-GOVERNANCE.md -- single anti-patterns row referencing the L3 registry as binding contract surface.

Tests: pytest -q tests = 337 passed (baseline 307 + 30 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:43:14 +09:00
9389b8425b fix(orchestrator): P5 audit-anchor-first-line regression guard
Bug discovered during #56 INTEGRATION-AUDIT-02 execution (2026-05-20):
- Both Claude and Codex put "Audit anchor: ..." as the FIRST line of every
  Gitea comment per the #56 issue body instruction "cite anchor at start
  of every stage".
- detect_agent (P0-1 strict, first-line only) then returns None for these
  comments because the first line is "Audit anchor:..." not "[Codex #N]"
  or "[Claude #N]".
- Result: orchestrator's "is_codex" check (line ~1288) flips false →
  "Codex 응답 미감지 — continuing" → infinite Stage 4 loop. #56 reached
  Round #14 (>300 comments, ~2 hours wasted token).

Fix path (NOT relaxing detect_agent — that would revive the original #45
pre-P0-1 bug where [Claude #N] citations inside Codex bodies caused
mis-detection):

1. AUDIT_ONLY_NOTE updated to enforce comment format:
   - FIRST non-empty line MUST be `[Claude #N] <stage>` or `[Codex #N] <stage>`
   - Audit anchor / banners / prefaces MUST appear line 2 or later
   - Concrete CORRECT example included
   - Explicit warning that violation breaks stage advance

2. is_codex None guard auto-supplements:
   - When _audit_mode(title) AND detect_agent returns None, orchestrator
     posts a Gitea supplement comment requesting the correct format
   - Next round's Claude/Codex see the supplement and correct
   - Breaks the infinite loop automatically (no manual ctrl-C needed)

3. Regression tests in TestDetectAgent (test_orchestrator_core.py):
   - test_audit_anchor_preface_breaks_detection: confirms P0-1 strict
     correctly returns None when anchor is first line
   - test_audit_anchor_after_header_works: correct format passes

Total: 96/96 pytest pass (94 prior + 2 P5 regression).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:03:12 +09:00
e32f632464 fix(orchestrator): P4a baseline-diff guard + Stage 5 commit scope
P4 had two production issues blocking #50 integration audit deployment:

1. Stage 3 guard had no baseline awareness — flagged ALL forbidden-path
   changes including pre-existing dirty WIP. Empirical: 328 such files
   already in current working tree (tests/matching/ artifacts etc).
   #50 would have hit reject loops immediately without Claude doing
   anything wrong.

2. Stage 5 had no commit-scope guard — if Claude ran `git add -A` and
   committed user's existing WIP, audit commit would be polluted with
   unrelated production changes.

P4a additions:
- _audit_baseline_path / _ensure_audit_baseline / _load_audit_baseline:
  snapshot working-tree dirty paths at run_issue entry for audit issues.
  Resumed runs preserve existing baseline (no overwrite).
- _check_audit_only_violations(baseline=None): accept baseline set,
  subtract from violations — only flags NEW forbidden changes introduced
  after audit start.
- _check_audit_commit_scope: verify HEAD commit's file list matches
  AUDIT_ALLOWED_COMMIT_GLOBS (INTEGRATION-AUDIT-*.md, BACKLOG.md).
- run_issue: save baseline on audit-mode entry only — no impact on
  normal issues.
- Stage 5 (commit-push) YES gate: new guard rejects on out-of-scope
  files with remediation prompt (git reset --soft + force-with-lease).

19 new tests:
- baseline subtraction (5): pre-existing removed, None=keep-all,
  empty-set=catch-all, full-coverage filter, Windows path normalize.
- baseline persist (5): roundtrip, no-overwrite on resume, missing
  fallback, corrupt JSON fallback, non-list fallback.
- commit scope detection (7): report-only allowed, backlog allowed,
  src/ rejected, unrelated docs rejected, git error fail-open,
  Windows backslash, empty commit pass.
- allowed globs sanity (2): every glob has audit marker, all under
  docs/architecture/.

Total: 94/94 pytest pass (75 prior + 19 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:29:15 +09:00
4289a500b6 feat(orchestrator): P3 wrapper input/encoding fix + P4 audit-only mode
P3 hotfix (2026-05-18 — verified during #46 retry attempt):
- _run_with_tree_kill: encode input only when Popen is in binary mode.
  Previously force-encoded str→bytes even with encoding= set, breaking
  text-mode stdin pipes with: write() argument must be str, not bytes.
- run_claude path was the only affected call site.
- 3 new C7 regression tests (input+encoding / bytes+binary / auto-encode).
- C3/C6 test fixtures hardened with DEVNULL stdio isolation.

P4 audit-only mode (2026-05-19, prep for #50 integration audit):
- _is_audit_issue: title-based detection for [INTEGRATION-AUDIT*],
  [AUDIT-ONLY], or "integration audit" phrase.
- _audit_mode + --audit-only CLI flag: manual override regardless of title.
- AUDIT_ONLY_NOTE injected into context pack across all stages/rounds.
- Stage 3 (code-edit) YES gate: deterministic git status check.
  Changes touching src/**, templates/**, tests/** auto-reject Stage 3 YES
  and post a supplement-request comment. LLM-independent enforcement.
- 26 new audit-mode tests (title detection, CLI override, forbidden
  prefix detection, allowed paths pass, Windows backslash normalization,
  quoted paths with spaces, git error fail-open, constants sanity).

Total: 75/75 pytest pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:18:28 +09:00
f3bff898fb feat(orchestrator): initial orchestrator + subprocess cleanup hardening
Pre-existing P0+P1 fixes (verified via #45 pilot 2026-05-18):
- P0-1: detect_agent first-line only (fixes #45 infinite loop)
- P0-2: stage_start_count sanity reset on external comment delete
- P0-3: 32 pytest cases for parse/detect regressions
- P1-4: execution-issue mode prompt (compact scope-tight)
- P1-5: Stage 2 COMPACT_PLAN_RULE (size budget, no code snippets)
- P1-6: tests:[] orchestrator-level enforcement at Stage 2 YES guard
- P1-7: dual-write CRLF/trailing-whitespace normalize

P3 subprocess cleanup (PID 2780 orphan grandchild regression):
- (pid, create_time) signature tracking — Windows PID reuse safe
- _kill_process_tree: parent-alive traversal path
- _kill_tracked: parent-dead orphan path
- _run_with_tree_kill: 1s monitor thread captures descendants live
- atexit + SIGINT safety net via _SPAWNED set
- 4 subprocess.run sites switched to wrapper (compaction/exit_report/
  run_claude/run_codex)
- 12 cleanup pytest cases incl. C6 PID 2780 regression test

Selenium boundary unchanged — driver.quit() in phase_z2_pipeline.py
and slide_measurer.py already protected by try/finally.

Total: 44/44 pytest pass (32 core + 12 cleanup).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:56:06 +09:00