feat(#64): IMP-35 details_popup_escalation u11 baseline-red invariance gate
Add a test-only invariance gate that locks the pre-existing four-test red
baseline so IMP-35 cannot silently grow the red surface while in-flight.
u11 does NOT fix the four reds — Stage 2 follow_up_candidates tracks the
actual repair as a separate issue. u1~u10 production work remains in the
worktree and is explicitly out of this commit per Stage 3 R7 carve-out.
Frozen registry (IMP35_BASELINE_RED_NODE_IDS, set semantics):
1. tests/test_imp47b_step12_ai_wiring.py
::test_mixed_units_classified_by_route_and_provisional_flag
2. tests/test_imp47b_step12_ai_wiring.py
::test_reject_provisional_unit_reaches_router_short_circuit
3. tests/test_imp47b_step12_ai_wiring.py
::test_step12_ai_repair_artifact_writes_json_serialisable_records
4. tests/test_phase_z2_ai_fallback_config.py
::test_ai_fallback_master_flag_default_off
Gate semantics (subprocess pytest, set comparison):
- All 4 node ids resolve to collectible pytest items
(rename / delete is caught up front).
- Broader baseline-area sweep across the two registry files yields
EXACTLY 4 FAILED and 0 ERROR, with FAILED set ≡ registry.
- A new red in the baseline area flips count above 4 OR introduces a
FAILED id outside the registry; either branch fails the gate.
- Cross-lock test ensures registry node ids cannot point outside the
declared area-files inventory.
AI isolation contract (feedback_ai_isolation_contract):
Gate body uses stdlib only (subprocess + re + ast). An AST self-verify
test rejects `anthropic` imports and `route_ai_fallback` references in
this file, structurally preventing AI routing inside the gate.
Stage 4 verification (HEAD c1df656 pre-commit):
pytest -q tests/phase_z2/test_imp35_baseline_red_invariance.py
→ 7 passed in 15.26s.
Baseline area sweep
(tests/test_imp47b_step12_ai_wiring.py +
tests/test_phase_z2_ai_fallback_config.py)
→ 4 failed / 6 passed / 0 errors; FAILED set ≡ registry (identity).
pytest --collect-only on the 4 registered node ids → all 4 resolve.
py_compile clean. Codex R1 = YES (independent verify).
Guardrails honored:
- Scope-locked: test-only file; zero production code in this commit.
- 1 commit = 1 decision unit (u11 only).
- No hardcoding: registry = Stage 2 contract frozen tuple, not
sample-specific literal; gate body has zero magic constants.
- AI isolation: stdlib-only gate, AST self-verify locks isolation.
- baseline-red 4 body repair = separate follow-up issue, not u11 scope.
source_comment_ids: Stage 1 problem-review; Stage 2 plan R2 + Codex R2
YES; Stage 3 Claude #30 + Codex #31 R7 YES; Stage 4 Claude #32 + Codex
#33 R1 YES.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
339
tests/phase_z2/test_imp35_baseline_red_invariance.py
Normal file
339
tests/phase_z2/test_imp35_baseline_red_invariance.py
Normal file
@@ -0,0 +1,339 @@
|
|||||||
|
"""IMP-35 (#64) u11 — baseline-red invariance gate.
|
||||||
|
|
||||||
|
Stage 2 binding contract (unit u11):
|
||||||
|
IMP-35 inherits a four-test red baseline from prior phases that is
|
||||||
|
explicitly OUT OF SCOPE for this issue:
|
||||||
|
|
||||||
|
1. tests/test_imp47b_step12_ai_wiring.py
|
||||||
|
::test_mixed_units_classified_by_route_and_provisional_flag
|
||||||
|
2. tests/test_imp47b_step12_ai_wiring.py
|
||||||
|
::test_reject_provisional_unit_reaches_router_short_circuit
|
||||||
|
3. tests/test_imp47b_step12_ai_wiring.py
|
||||||
|
::test_step12_ai_repair_artifact_writes_json_serialisable_records
|
||||||
|
4. tests/test_phase_z2_ai_fallback_config.py
|
||||||
|
::test_ai_fallback_master_flag_default_off
|
||||||
|
|
||||||
|
u11 does NOT fix these. u11 LOCKS the count + identity of the
|
||||||
|
baseline-red set so that IMP-35 cannot silently grow the red surface
|
||||||
|
while the issue is in-flight. A follow-up issue (Stage 2 plan
|
||||||
|
`follow_up_candidates`) tracks the actual repair.
|
||||||
|
|
||||||
|
Invariance semantics:
|
||||||
|
- The exact four baseline-red node ids resolve to real, collectible
|
||||||
|
pytest items (a rename / delete is caught up front; the gate cannot
|
||||||
|
be defeated by silently removing the failing test).
|
||||||
|
- Running pytest on the BROADER baseline-area files
|
||||||
|
(``tests/test_imp47b_step12_ai_wiring.py`` +
|
||||||
|
``tests/test_phase_z2_ai_fallback_config.py``) yields EXACTLY four
|
||||||
|
FAILED node ids and zero ERROR node ids; the FAILED set is exactly
|
||||||
|
the documented baseline-red set.
|
||||||
|
- A NEW red introduced by IMP-35 in the baseline area flips the
|
||||||
|
FAILED count above four AND/OR introduces an extra FAILED node id
|
||||||
|
that is not in the baseline set; either branch fails this gate.
|
||||||
|
|
||||||
|
AI isolation contract (`feedback_ai_isolation_contract`):
|
||||||
|
The invariance gate runs pytest in a child process and parses stdout.
|
||||||
|
It must NOT import the Anthropic SDK and must NOT route through
|
||||||
|
``route_ai_fallback``. The structural import test below locks this.
|
||||||
|
|
||||||
|
Stage 2 plan source: Stage 2 exit report u11 — "u11 acknowledges the
|
||||||
|
current four red baseline tests as pre-existing and adds an invariance
|
||||||
|
gate so IMP-35 cannot worsen them."
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import ast
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
# === BASELINE-RED REGISTRY (frozen by Stage 2 u11 contract) ===
|
||||||
|
#
|
||||||
|
# Order is informational only; the gate compares as a set. Each entry
|
||||||
|
# is a fully-qualified pytest node id resolvable from the repo root.
|
||||||
|
IMP35_BASELINE_RED_NODE_IDS: tuple[str, ...] = (
|
||||||
|
"tests/test_imp47b_step12_ai_wiring.py"
|
||||||
|
"::test_mixed_units_classified_by_route_and_provisional_flag",
|
||||||
|
"tests/test_imp47b_step12_ai_wiring.py"
|
||||||
|
"::test_reject_provisional_unit_reaches_router_short_circuit",
|
||||||
|
"tests/test_imp47b_step12_ai_wiring.py"
|
||||||
|
"::test_step12_ai_repair_artifact_writes_json_serialisable_records",
|
||||||
|
"tests/test_phase_z2_ai_fallback_config.py"
|
||||||
|
"::test_ai_fallback_master_flag_default_off",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Files that own the baseline-red set. The "no-new-red in baseline area"
|
||||||
|
# axis runs pytest on this set and checks that ONLY the registry above
|
||||||
|
# fails.
|
||||||
|
IMP35_BASELINE_RED_AREA_FILES: tuple[str, ...] = (
|
||||||
|
"tests/test_imp47b_step12_ai_wiring.py",
|
||||||
|
"tests/test_phase_z2_ai_fallback_config.py",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# === Repo root resolution (subprocess CWD anchor) ===
|
||||||
|
|
||||||
|
# tests/phase_z2/<this file>.py -> parents[2] = repo root.
|
||||||
|
_REPO_ROOT: Path = Path(__file__).resolve().parents[2]
|
||||||
|
|
||||||
|
|
||||||
|
# === pytest stdout parsers ===
|
||||||
|
|
||||||
|
# Matches lines like:
|
||||||
|
# FAILED tests/test_imp47b_step12_ai_wiring.py::test_xxx
|
||||||
|
# and:
|
||||||
|
# FAILED tests/test_imp47b_step12_ai_wiring.py::test_xxx - AssertionError: ...
|
||||||
|
# The capture group is the bare node id (no trailing failure detail).
|
||||||
|
_FAILED_LINE_RE = re.compile(r"^FAILED\s+(\S+?)(?:\s+-\s+.*)?$", re.MULTILINE)
|
||||||
|
|
||||||
|
# Matches lines like:
|
||||||
|
# ERROR tests/test_xxx.py::test_yyy
|
||||||
|
_ERROR_LINE_RE = re.compile(r"^ERROR\s+(\S+?)(?:\s+-\s+.*)?$", re.MULTILINE)
|
||||||
|
|
||||||
|
# Matches the pytest tail summary line (sub-second timing field varies):
|
||||||
|
# 4 failed, 6 passed in 2.27s
|
||||||
|
_TAIL_SUMMARY_RE = re.compile(
|
||||||
|
r"^(?P<body>.*?)\s+in\s+\d+(?:\.\d+)?s\s*$", re.MULTILINE
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _run_pytest_collect_only(node_ids: tuple[str, ...]) -> subprocess.CompletedProcess:
|
||||||
|
"""Run ``pytest --collect-only -q`` against the supplied node ids.
|
||||||
|
|
||||||
|
Used to confirm the baseline-red registry resolves to real, currently
|
||||||
|
collectible tests. If a test is renamed / moved / deleted out from
|
||||||
|
under the registry, pytest's collection failure is the signal.
|
||||||
|
"""
|
||||||
|
return subprocess.run(
|
||||||
|
[
|
||||||
|
sys.executable,
|
||||||
|
"-m",
|
||||||
|
"pytest",
|
||||||
|
"--collect-only",
|
||||||
|
"-q",
|
||||||
|
*node_ids,
|
||||||
|
],
|
||||||
|
cwd=_REPO_ROOT,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
check=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _run_pytest_quiet(targets: tuple[str, ...]) -> subprocess.CompletedProcess:
|
||||||
|
"""Run ``pytest -q --tb=no -p no:cacheprovider`` against ``targets``.
|
||||||
|
|
||||||
|
``-p no:cacheprovider`` keeps the gate hermetic across reruns; the
|
||||||
|
parent pytest invocation that triggers this child process must not
|
||||||
|
poison or be poisoned by the child's cache state.
|
||||||
|
"""
|
||||||
|
return subprocess.run(
|
||||||
|
[
|
||||||
|
sys.executable,
|
||||||
|
"-m",
|
||||||
|
"pytest",
|
||||||
|
"-q",
|
||||||
|
"--tb=no",
|
||||||
|
"-p",
|
||||||
|
"no:cacheprovider",
|
||||||
|
*targets,
|
||||||
|
],
|
||||||
|
cwd=_REPO_ROOT,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
check=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_failed_node_ids(stdout: str) -> set[str]:
|
||||||
|
"""Extract the set of FAILED node ids from pytest's ``--tb=no -q`` stdout."""
|
||||||
|
return {match.group(1) for match in _FAILED_LINE_RE.finditer(stdout)}
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_error_node_ids(stdout: str) -> set[str]:
|
||||||
|
"""Extract the set of ERROR node ids from pytest's ``--tb=no -q`` stdout."""
|
||||||
|
return {match.group(1) for match in _ERROR_LINE_RE.finditer(stdout)}
|
||||||
|
|
||||||
|
|
||||||
|
# === Tests ===
|
||||||
|
|
||||||
|
|
||||||
|
def test_imp35_baseline_red_registry_has_exactly_four_node_ids() -> None:
|
||||||
|
"""The baseline-red registry is a frozen four-tuple (Stage 2 u11 lock)."""
|
||||||
|
assert len(IMP35_BASELINE_RED_NODE_IDS) == 4
|
||||||
|
assert len(set(IMP35_BASELINE_RED_NODE_IDS)) == 4, (
|
||||||
|
"IMP-35 baseline-red registry must not contain duplicate node ids; "
|
||||||
|
"duplicates would silently weaken the invariance gate."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_imp35_baseline_red_registry_node_ids_are_well_formed() -> None:
|
||||||
|
"""Each baseline-red node id must look like ``tests/<file>.py::<test>``."""
|
||||||
|
for node_id in IMP35_BASELINE_RED_NODE_IDS:
|
||||||
|
assert node_id.startswith("tests/"), (
|
||||||
|
f"IMP-35 baseline-red registry node id {node_id!r} must live "
|
||||||
|
"under tests/ — registry entries point at repo-rooted node ids."
|
||||||
|
)
|
||||||
|
assert ".py::" in node_id, (
|
||||||
|
f"IMP-35 baseline-red registry node id {node_id!r} must use the "
|
||||||
|
"<file>.py::<test_name> pytest node id grammar."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_imp35_baseline_red_registry_files_match_area_inventory() -> None:
|
||||||
|
"""Registry node ids must all live in declared baseline-area files.
|
||||||
|
|
||||||
|
Locks the cross-axis link between :data:`IMP35_BASELINE_RED_NODE_IDS`
|
||||||
|
and :data:`IMP35_BASELINE_RED_AREA_FILES` — adding a registry entry
|
||||||
|
without expanding the area sweep (or vice versa) is the kind of
|
||||||
|
half-wiring that would silently let the gate miss new reds.
|
||||||
|
"""
|
||||||
|
declared_files = set(IMP35_BASELINE_RED_AREA_FILES)
|
||||||
|
for node_id in IMP35_BASELINE_RED_NODE_IDS:
|
||||||
|
file_part, _, _ = node_id.partition("::")
|
||||||
|
assert file_part in declared_files, (
|
||||||
|
f"IMP-35 baseline-red registry entry {node_id!r} references "
|
||||||
|
f"{file_part!r}, which is not in IMP35_BASELINE_RED_AREA_FILES. "
|
||||||
|
"Update both lists together or the area sweep will miss new reds."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_imp35_baseline_red_node_ids_resolve_to_collectible_tests() -> None:
|
||||||
|
"""``pytest --collect-only`` must resolve every baseline-red node id.
|
||||||
|
|
||||||
|
A failure here means a baseline-red test was renamed / deleted /
|
||||||
|
moved out from under the gate; the registry must be updated in the
|
||||||
|
same commit (or, if the test was fixed, the follow-up issue must
|
||||||
|
deregister it).
|
||||||
|
"""
|
||||||
|
result = _run_pytest_collect_only(IMP35_BASELINE_RED_NODE_IDS)
|
||||||
|
# ``pytest --collect-only`` exits 0 on full collection, 2/4/5 on
|
||||||
|
# collection errors. Exit code 5 = no tests collected ("not found").
|
||||||
|
assert result.returncode in (0,), (
|
||||||
|
"pytest --collect-only failed for the IMP-35 baseline-red "
|
||||||
|
f"registry (rc={result.returncode}).\n"
|
||||||
|
f"STDOUT:\n{result.stdout}\n"
|
||||||
|
f"STDERR:\n{result.stderr}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_imp35_baseline_red_invariance_gate_failed_set_matches_registry() -> None:
|
||||||
|
"""Running pytest on the baseline area must FAIL EXACTLY the registry.
|
||||||
|
|
||||||
|
This is the core invariance contract. If IMP-35 work breaks a 5th
|
||||||
|
test in the baseline area, the FAILED set diverges from the registry
|
||||||
|
and this gate trips. If IMP-35 accidentally fixes one of the four,
|
||||||
|
the FAILED set shrinks below four and this gate also trips — at
|
||||||
|
which point the registry is removed from the failing test (the
|
||||||
|
follow-up issue deregisters it) and the gate is re-locked.
|
||||||
|
"""
|
||||||
|
result = _run_pytest_quiet(IMP35_BASELINE_RED_AREA_FILES)
|
||||||
|
|
||||||
|
# The baseline area is currently red: pytest MUST exit non-zero. A
|
||||||
|
# zero return code here would mean the baseline magically went green
|
||||||
|
# (or the parser missed the failures); both branches require human
|
||||||
|
# review before the registry is updated.
|
||||||
|
assert result.returncode != 0, (
|
||||||
|
"IMP-35 baseline-red area is expected to fail (4 known reds). "
|
||||||
|
"A clean pytest exit means either the baseline was unexpectedly "
|
||||||
|
"fixed (deregister via follow-up issue) or the gate's subprocess "
|
||||||
|
"did not reach the failing tests.\n"
|
||||||
|
f"STDOUT:\n{result.stdout}\n"
|
||||||
|
f"STDERR:\n{result.stderr}"
|
||||||
|
)
|
||||||
|
|
||||||
|
failed_ids = _parse_failed_node_ids(result.stdout)
|
||||||
|
error_ids = _parse_error_node_ids(result.stdout)
|
||||||
|
expected = set(IMP35_BASELINE_RED_NODE_IDS)
|
||||||
|
|
||||||
|
assert error_ids == set(), (
|
||||||
|
"IMP-35 baseline-red invariance gate found ERROR-state tests "
|
||||||
|
f"in the baseline area (expected zero): {sorted(error_ids)}.\n"
|
||||||
|
f"STDOUT:\n{result.stdout}"
|
||||||
|
)
|
||||||
|
|
||||||
|
assert failed_ids == expected, (
|
||||||
|
"IMP-35 baseline-red invariance gate detected drift between the "
|
||||||
|
"registered baseline-red set and the actual pytest FAILED set.\n"
|
||||||
|
f" registered (expected): {sorted(expected)}\n"
|
||||||
|
f" actual (observed): {sorted(failed_ids)}\n"
|
||||||
|
f" unexpected new reds: {sorted(failed_ids - expected)}\n"
|
||||||
|
f" unexpectedly green: {sorted(expected - failed_ids)}\n"
|
||||||
|
"If new reds appear above, IMP-35 has silently grown the red "
|
||||||
|
"surface (u11 contract violation). If reds are unexpectedly "
|
||||||
|
"green, the follow-up issue must deregister them.\n"
|
||||||
|
f"STDOUT:\n{result.stdout}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_imp35_baseline_red_invariance_gate_failed_count_is_exactly_four() -> None:
|
||||||
|
"""Count-only assertion: the baseline area has exactly four FAILED nodes.
|
||||||
|
|
||||||
|
Complements the identity check above. Even if a parser bug or
|
||||||
|
output-format change ever weakens the identity check, the bare count
|
||||||
|
still catches the "did a new red sneak in?" failure mode.
|
||||||
|
"""
|
||||||
|
result = _run_pytest_quiet(IMP35_BASELINE_RED_AREA_FILES)
|
||||||
|
failed_ids = _parse_failed_node_ids(result.stdout)
|
||||||
|
assert len(failed_ids) == 4, (
|
||||||
|
"IMP-35 baseline-red invariance gate expected exactly 4 FAILED "
|
||||||
|
f"node ids in the baseline area; observed {len(failed_ids)}: "
|
||||||
|
f"{sorted(failed_ids)}.\n"
|
||||||
|
f"STDOUT:\n{result.stdout}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_imp35_baseline_red_invariance_module_has_no_ai_imports() -> None:
|
||||||
|
"""AI isolation contract — u11 invariance gate must stay pure stdlib.
|
||||||
|
|
||||||
|
Mirrors the structural import lock used by u6 / u7 / u10. The gate
|
||||||
|
is deterministic-with-data (subprocess pytest + regex parse); any
|
||||||
|
Anthropic SDK import or route through the AI fallback router would
|
||||||
|
violate the ``feedback_ai_isolation_contract`` lock.
|
||||||
|
|
||||||
|
The check is AST-based so the assertion bodies (which reference
|
||||||
|
forbidden tokens by name) do not self-trigger a string-substring
|
||||||
|
false positive.
|
||||||
|
"""
|
||||||
|
forbidden_module_prefix = "anthropic"
|
||||||
|
forbidden_attr_substring = "route_ai_fallback"
|
||||||
|
|
||||||
|
module_source = Path(__file__).read_text(encoding="utf-8")
|
||||||
|
tree = ast.parse(module_source)
|
||||||
|
|
||||||
|
for node in ast.walk(tree):
|
||||||
|
if isinstance(node, ast.Import):
|
||||||
|
for alias in node.names:
|
||||||
|
root = alias.name.split(".", 1)[0]
|
||||||
|
assert root != forbidden_module_prefix, (
|
||||||
|
"IMP-35 u11 invariance gate must not import the "
|
||||||
|
f"Anthropic SDK (found ``import {alias.name}``)."
|
||||||
|
)
|
||||||
|
elif isinstance(node, ast.ImportFrom):
|
||||||
|
if node.module is None:
|
||||||
|
continue
|
||||||
|
root = node.module.split(".", 1)[0]
|
||||||
|
assert root != forbidden_module_prefix, (
|
||||||
|
"IMP-35 u11 invariance gate must not import from the "
|
||||||
|
f"Anthropic SDK (found ``from {node.module} import ...``)."
|
||||||
|
)
|
||||||
|
for alias in node.names:
|
||||||
|
assert forbidden_attr_substring not in alias.name, (
|
||||||
|
"IMP-35 u11 invariance gate must not route through the "
|
||||||
|
"AI fallback router (found "
|
||||||
|
f"``from {node.module} import {alias.name}``)."
|
||||||
|
)
|
||||||
|
elif isinstance(node, ast.Call):
|
||||||
|
func = node.func
|
||||||
|
if isinstance(func, ast.Name):
|
||||||
|
assert forbidden_attr_substring not in func.id, (
|
||||||
|
"IMP-35 u11 invariance gate must not call into the "
|
||||||
|
f"AI fallback router (found call to ``{func.id}``)."
|
||||||
|
)
|
||||||
|
elif isinstance(func, ast.Attribute):
|
||||||
|
assert forbidden_attr_substring not in func.attr, (
|
||||||
|
"IMP-35 u11 invariance gate must not call into the "
|
||||||
|
f"AI fallback router (found call to ``.{func.attr}``)."
|
||||||
|
)
|
||||||
Reference in New Issue
Block a user