feat(#64): IMP-35 details_popup_escalation u11 baseline-red invariance gate

Add a test-only invariance gate that locks the pre-existing four-test red
baseline so IMP-35 cannot silently grow the red surface while in-flight.
u11 does NOT fix the four reds — Stage 2 follow_up_candidates tracks the
actual repair as a separate issue. u1~u10 production work remains in the
worktree and is explicitly out of this commit per Stage 3 R7 carve-out.

Frozen registry (IMP35_BASELINE_RED_NODE_IDS, set semantics):
  1. tests/test_imp47b_step12_ai_wiring.py
       ::test_mixed_units_classified_by_route_and_provisional_flag
  2. tests/test_imp47b_step12_ai_wiring.py
       ::test_reject_provisional_unit_reaches_router_short_circuit
  3. tests/test_imp47b_step12_ai_wiring.py
       ::test_step12_ai_repair_artifact_writes_json_serialisable_records
  4. tests/test_phase_z2_ai_fallback_config.py
       ::test_ai_fallback_master_flag_default_off

Gate semantics (subprocess pytest, set comparison):
  - All 4 node ids resolve to collectible pytest items
    (rename / delete is caught up front).
  - Broader baseline-area sweep across the two registry files yields
    EXACTLY 4 FAILED and 0 ERROR, with FAILED set ≡ registry.
  - A new red in the baseline area flips count above 4 OR introduces a
    FAILED id outside the registry; either branch fails the gate.
  - Cross-lock test ensures registry node ids cannot point outside the
    declared area-files inventory.

AI isolation contract (feedback_ai_isolation_contract):
  Gate body uses stdlib only (subprocess + re + ast). An AST self-verify
  test rejects `anthropic` imports and `route_ai_fallback` references in
  this file, structurally preventing AI routing inside the gate.

Stage 4 verification (HEAD c1df656 pre-commit):
  pytest -q tests/phase_z2/test_imp35_baseline_red_invariance.py
    → 7 passed in 15.26s.
  Baseline area sweep
    (tests/test_imp47b_step12_ai_wiring.py +
     tests/test_phase_z2_ai_fallback_config.py)
    → 4 failed / 6 passed / 0 errors; FAILED set ≡ registry (identity).
  pytest --collect-only on the 4 registered node ids → all 4 resolve.
  py_compile clean. Codex R1 = YES (independent verify).

Guardrails honored:
  - Scope-locked: test-only file; zero production code in this commit.
  - 1 commit = 1 decision unit (u11 only).
  - No hardcoding: registry = Stage 2 contract frozen tuple, not
    sample-specific literal; gate body has zero magic constants.
  - AI isolation: stdlib-only gate, AST self-verify locks isolation.
  - baseline-red 4 body repair = separate follow-up issue, not u11 scope.

source_comment_ids: Stage 1 problem-review; Stage 2 plan R2 + Codex R2
YES; Stage 3 Claude #30 + Codex #31 R7 YES; Stage 4 Claude #32 + Codex
#33 R1 YES.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-23 04:13:54 +09:00
parent c1df656312
commit 7c93031f9b

View File

@@ -0,0 +1,339 @@
"""IMP-35 (#64) u11 — baseline-red invariance gate.
Stage 2 binding contract (unit u11):
IMP-35 inherits a four-test red baseline from prior phases that is
explicitly OUT OF SCOPE for this issue:
1. tests/test_imp47b_step12_ai_wiring.py
::test_mixed_units_classified_by_route_and_provisional_flag
2. tests/test_imp47b_step12_ai_wiring.py
::test_reject_provisional_unit_reaches_router_short_circuit
3. tests/test_imp47b_step12_ai_wiring.py
::test_step12_ai_repair_artifact_writes_json_serialisable_records
4. tests/test_phase_z2_ai_fallback_config.py
::test_ai_fallback_master_flag_default_off
u11 does NOT fix these. u11 LOCKS the count + identity of the
baseline-red set so that IMP-35 cannot silently grow the red surface
while the issue is in-flight. A follow-up issue (Stage 2 plan
`follow_up_candidates`) tracks the actual repair.
Invariance semantics:
- The exact four baseline-red node ids resolve to real, collectible
pytest items (a rename / delete is caught up front; the gate cannot
be defeated by silently removing the failing test).
- Running pytest on the BROADER baseline-area files
(``tests/test_imp47b_step12_ai_wiring.py`` +
``tests/test_phase_z2_ai_fallback_config.py``) yields EXACTLY four
FAILED node ids and zero ERROR node ids; the FAILED set is exactly
the documented baseline-red set.
- A NEW red introduced by IMP-35 in the baseline area flips the
FAILED count above four AND/OR introduces an extra FAILED node id
that is not in the baseline set; either branch fails this gate.
AI isolation contract (`feedback_ai_isolation_contract`):
The invariance gate runs pytest in a child process and parses stdout.
It must NOT import the Anthropic SDK and must NOT route through
``route_ai_fallback``. The structural import test below locks this.
Stage 2 plan source: Stage 2 exit report u11 — "u11 acknowledges the
current four red baseline tests as pre-existing and adds an invariance
gate so IMP-35 cannot worsen them."
"""
from __future__ import annotations
import ast
import re
import subprocess
import sys
from pathlib import Path
# === BASELINE-RED REGISTRY (frozen by Stage 2 u11 contract) ===
#
# Order is informational only; the gate compares as a set. Each entry
# is a fully-qualified pytest node id resolvable from the repo root.
IMP35_BASELINE_RED_NODE_IDS: tuple[str, ...] = (
"tests/test_imp47b_step12_ai_wiring.py"
"::test_mixed_units_classified_by_route_and_provisional_flag",
"tests/test_imp47b_step12_ai_wiring.py"
"::test_reject_provisional_unit_reaches_router_short_circuit",
"tests/test_imp47b_step12_ai_wiring.py"
"::test_step12_ai_repair_artifact_writes_json_serialisable_records",
"tests/test_phase_z2_ai_fallback_config.py"
"::test_ai_fallback_master_flag_default_off",
)
# Files that own the baseline-red set. The "no-new-red in baseline area"
# axis runs pytest on this set and checks that ONLY the registry above
# fails.
IMP35_BASELINE_RED_AREA_FILES: tuple[str, ...] = (
"tests/test_imp47b_step12_ai_wiring.py",
"tests/test_phase_z2_ai_fallback_config.py",
)
# === Repo root resolution (subprocess CWD anchor) ===
# tests/phase_z2/<this file>.py -> parents[2] = repo root.
_REPO_ROOT: Path = Path(__file__).resolve().parents[2]
# === pytest stdout parsers ===
# Matches lines like:
# FAILED tests/test_imp47b_step12_ai_wiring.py::test_xxx
# and:
# FAILED tests/test_imp47b_step12_ai_wiring.py::test_xxx - AssertionError: ...
# The capture group is the bare node id (no trailing failure detail).
_FAILED_LINE_RE = re.compile(r"^FAILED\s+(\S+?)(?:\s+-\s+.*)?$", re.MULTILINE)
# Matches lines like:
# ERROR tests/test_xxx.py::test_yyy
_ERROR_LINE_RE = re.compile(r"^ERROR\s+(\S+?)(?:\s+-\s+.*)?$", re.MULTILINE)
# Matches the pytest tail summary line (sub-second timing field varies):
# 4 failed, 6 passed in 2.27s
_TAIL_SUMMARY_RE = re.compile(
r"^(?P<body>.*?)\s+in\s+\d+(?:\.\d+)?s\s*$", re.MULTILINE
)
def _run_pytest_collect_only(node_ids: tuple[str, ...]) -> subprocess.CompletedProcess:
"""Run ``pytest --collect-only -q`` against the supplied node ids.
Used to confirm the baseline-red registry resolves to real, currently
collectible tests. If a test is renamed / moved / deleted out from
under the registry, pytest's collection failure is the signal.
"""
return subprocess.run(
[
sys.executable,
"-m",
"pytest",
"--collect-only",
"-q",
*node_ids,
],
cwd=_REPO_ROOT,
capture_output=True,
text=True,
check=False,
)
def _run_pytest_quiet(targets: tuple[str, ...]) -> subprocess.CompletedProcess:
"""Run ``pytest -q --tb=no -p no:cacheprovider`` against ``targets``.
``-p no:cacheprovider`` keeps the gate hermetic across reruns; the
parent pytest invocation that triggers this child process must not
poison or be poisoned by the child's cache state.
"""
return subprocess.run(
[
sys.executable,
"-m",
"pytest",
"-q",
"--tb=no",
"-p",
"no:cacheprovider",
*targets,
],
cwd=_REPO_ROOT,
capture_output=True,
text=True,
check=False,
)
def _parse_failed_node_ids(stdout: str) -> set[str]:
"""Extract the set of FAILED node ids from pytest's ``--tb=no -q`` stdout."""
return {match.group(1) for match in _FAILED_LINE_RE.finditer(stdout)}
def _parse_error_node_ids(stdout: str) -> set[str]:
"""Extract the set of ERROR node ids from pytest's ``--tb=no -q`` stdout."""
return {match.group(1) for match in _ERROR_LINE_RE.finditer(stdout)}
# === Tests ===
def test_imp35_baseline_red_registry_has_exactly_four_node_ids() -> None:
"""The baseline-red registry is a frozen four-tuple (Stage 2 u11 lock)."""
assert len(IMP35_BASELINE_RED_NODE_IDS) == 4
assert len(set(IMP35_BASELINE_RED_NODE_IDS)) == 4, (
"IMP-35 baseline-red registry must not contain duplicate node ids; "
"duplicates would silently weaken the invariance gate."
)
def test_imp35_baseline_red_registry_node_ids_are_well_formed() -> None:
"""Each baseline-red node id must look like ``tests/<file>.py::<test>``."""
for node_id in IMP35_BASELINE_RED_NODE_IDS:
assert node_id.startswith("tests/"), (
f"IMP-35 baseline-red registry node id {node_id!r} must live "
"under tests/ — registry entries point at repo-rooted node ids."
)
assert ".py::" in node_id, (
f"IMP-35 baseline-red registry node id {node_id!r} must use the "
"<file>.py::<test_name> pytest node id grammar."
)
def test_imp35_baseline_red_registry_files_match_area_inventory() -> None:
"""Registry node ids must all live in declared baseline-area files.
Locks the cross-axis link between :data:`IMP35_BASELINE_RED_NODE_IDS`
and :data:`IMP35_BASELINE_RED_AREA_FILES` — adding a registry entry
without expanding the area sweep (or vice versa) is the kind of
half-wiring that would silently let the gate miss new reds.
"""
declared_files = set(IMP35_BASELINE_RED_AREA_FILES)
for node_id in IMP35_BASELINE_RED_NODE_IDS:
file_part, _, _ = node_id.partition("::")
assert file_part in declared_files, (
f"IMP-35 baseline-red registry entry {node_id!r} references "
f"{file_part!r}, which is not in IMP35_BASELINE_RED_AREA_FILES. "
"Update both lists together or the area sweep will miss new reds."
)
def test_imp35_baseline_red_node_ids_resolve_to_collectible_tests() -> None:
"""``pytest --collect-only`` must resolve every baseline-red node id.
A failure here means a baseline-red test was renamed / deleted /
moved out from under the gate; the registry must be updated in the
same commit (or, if the test was fixed, the follow-up issue must
deregister it).
"""
result = _run_pytest_collect_only(IMP35_BASELINE_RED_NODE_IDS)
# ``pytest --collect-only`` exits 0 on full collection, 2/4/5 on
# collection errors. Exit code 5 = no tests collected ("not found").
assert result.returncode in (0,), (
"pytest --collect-only failed for the IMP-35 baseline-red "
f"registry (rc={result.returncode}).\n"
f"STDOUT:\n{result.stdout}\n"
f"STDERR:\n{result.stderr}"
)
def test_imp35_baseline_red_invariance_gate_failed_set_matches_registry() -> None:
"""Running pytest on the baseline area must FAIL EXACTLY the registry.
This is the core invariance contract. If IMP-35 work breaks a 5th
test in the baseline area, the FAILED set diverges from the registry
and this gate trips. If IMP-35 accidentally fixes one of the four,
the FAILED set shrinks below four and this gate also trips — at
which point the registry is removed from the failing test (the
follow-up issue deregisters it) and the gate is re-locked.
"""
result = _run_pytest_quiet(IMP35_BASELINE_RED_AREA_FILES)
# The baseline area is currently red: pytest MUST exit non-zero. A
# zero return code here would mean the baseline magically went green
# (or the parser missed the failures); both branches require human
# review before the registry is updated.
assert result.returncode != 0, (
"IMP-35 baseline-red area is expected to fail (4 known reds). "
"A clean pytest exit means either the baseline was unexpectedly "
"fixed (deregister via follow-up issue) or the gate's subprocess "
"did not reach the failing tests.\n"
f"STDOUT:\n{result.stdout}\n"
f"STDERR:\n{result.stderr}"
)
failed_ids = _parse_failed_node_ids(result.stdout)
error_ids = _parse_error_node_ids(result.stdout)
expected = set(IMP35_BASELINE_RED_NODE_IDS)
assert error_ids == set(), (
"IMP-35 baseline-red invariance gate found ERROR-state tests "
f"in the baseline area (expected zero): {sorted(error_ids)}.\n"
f"STDOUT:\n{result.stdout}"
)
assert failed_ids == expected, (
"IMP-35 baseline-red invariance gate detected drift between the "
"registered baseline-red set and the actual pytest FAILED set.\n"
f" registered (expected): {sorted(expected)}\n"
f" actual (observed): {sorted(failed_ids)}\n"
f" unexpected new reds: {sorted(failed_ids - expected)}\n"
f" unexpectedly green: {sorted(expected - failed_ids)}\n"
"If new reds appear above, IMP-35 has silently grown the red "
"surface (u11 contract violation). If reds are unexpectedly "
"green, the follow-up issue must deregister them.\n"
f"STDOUT:\n{result.stdout}"
)
def test_imp35_baseline_red_invariance_gate_failed_count_is_exactly_four() -> None:
"""Count-only assertion: the baseline area has exactly four FAILED nodes.
Complements the identity check above. Even if a parser bug or
output-format change ever weakens the identity check, the bare count
still catches the "did a new red sneak in?" failure mode.
"""
result = _run_pytest_quiet(IMP35_BASELINE_RED_AREA_FILES)
failed_ids = _parse_failed_node_ids(result.stdout)
assert len(failed_ids) == 4, (
"IMP-35 baseline-red invariance gate expected exactly 4 FAILED "
f"node ids in the baseline area; observed {len(failed_ids)}: "
f"{sorted(failed_ids)}.\n"
f"STDOUT:\n{result.stdout}"
)
def test_imp35_baseline_red_invariance_module_has_no_ai_imports() -> None:
"""AI isolation contract — u11 invariance gate must stay pure stdlib.
Mirrors the structural import lock used by u6 / u7 / u10. The gate
is deterministic-with-data (subprocess pytest + regex parse); any
Anthropic SDK import or route through the AI fallback router would
violate the ``feedback_ai_isolation_contract`` lock.
The check is AST-based so the assertion bodies (which reference
forbidden tokens by name) do not self-trigger a string-substring
false positive.
"""
forbidden_module_prefix = "anthropic"
forbidden_attr_substring = "route_ai_fallback"
module_source = Path(__file__).read_text(encoding="utf-8")
tree = ast.parse(module_source)
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for alias in node.names:
root = alias.name.split(".", 1)[0]
assert root != forbidden_module_prefix, (
"IMP-35 u11 invariance gate must not import the "
f"Anthropic SDK (found ``import {alias.name}``)."
)
elif isinstance(node, ast.ImportFrom):
if node.module is None:
continue
root = node.module.split(".", 1)[0]
assert root != forbidden_module_prefix, (
"IMP-35 u11 invariance gate must not import from the "
f"Anthropic SDK (found ``from {node.module} import ...``)."
)
for alias in node.names:
assert forbidden_attr_substring not in alias.name, (
"IMP-35 u11 invariance gate must not route through the "
"AI fallback router (found "
f"``from {node.module} import {alias.name}``)."
)
elif isinstance(node, ast.Call):
func = node.func
if isinstance(func, ast.Name):
assert forbidden_attr_substring not in func.id, (
"IMP-35 u11 invariance gate must not call into the "
f"AI fallback router (found call to ``{func.id}``)."
)
elif isinstance(func, ast.Attribute):
assert forbidden_attr_substring not in func.attr, (
"IMP-35 u11 invariance gate must not call into the "
f"AI fallback router (found call to ``.{func.attr}``)."
)