feat(step2+step3): slide-level rich ContentObject trace (IMP-03 #3)

- Add extract_rich_content_objects(normalized_assets, mdx_id) in
  phase_z2_content_extractor.py emitting slide-level rich ContentObjects
  for SPEC v1 §1.2 types: details (popups), image, table
- Extend ContentObject dataclass with optional scope/mdx_id/section_id
  metadata fields (additive, default None — v0 unchanged)
- _stage0_chained_adapter() returns 5-tuple adding normalized_assets
  ({popups, images, tables}); empty on env=OFF / hard fallback
- Step 2 artifact gains additive stage0_normalized_assets nested field
  (env=OFF / fallback → empty lists). Existing 7 fields preserved.
- Step 3 emits root-level rich_content_objects once at slide scope
  with rich_content_objects_enabled / scope / source / disabled_reason /
  skips / invariant_warnings. per_zone list still references v0 only.
- PHASE_Z_STEP3_RICH_OBJECTS_ENABLED env flag, default OFF (canary,
  matches PHASE_Z_STAGE0_ADAPTER_ENABLED / PHASE_Z_B4_*). Enable
  requires flag=1 AND non-empty normalized_assets; otherwise records
  disabled_reason = FLAG_OFF or NO_NORMALIZED_ASSETS.
- transform_table dedup: arrow glyph detection in normalized table
  rows/headers → skip with reason=skipped_transform_table_duplicate.
  v0 _capture_3col_transform_table remains the sole transform_table
  source; generic table only for non-transform tables.
- ID pattern {mdx_id}.{details,image,table}-N (slide-level namespace).
- plan_placement() input unchanged (v0 content_objects only) — rich
  list never feeds placement/region planning in this issue.
- self-test extended with 5 rich extractor cases (popup/image/table
  /arrow-skip/empty); v0 self-test unchanged and still PASS.
- mapper / V4 / composition / Step 6+ / AI/Kei / pipeline_path_connected
  unchanged. trace fidelity only.

env OFF + rich OFF: legacy PASS, no regression
env OFF + rich=1   : disabled_reason=NO_NORMALIZED_ASSETS, rich list empty
env=1   + rich=1   : Step 2 stage0_normalized_assets populated (1 table on
                     MDX 03, invariant match adapter_counts). Step 3 write
                     blocked by inherited IMP-02 composition_planner abort
                     (downstream gap, not IMP-03 scope).

Refs Gitea #3 (IMP-03 A-1 popup/image/table trace)
This commit is contained in:
2026-05-13 01:18:25 +09:00
parent bac13c09c4
commit fc3f7d8826
2 changed files with 346 additions and 14 deletions

View File

@@ -64,7 +64,7 @@ from phase_z2_failure_router import enrich_retry_trace_with_failure_classificati
# trace-only runtime 연결 v0 — B1 → B4 chain.
# final.html / mapper / render path 미영향. debug_zones[i].placement_trace 만 기록.
from phase_z2_content_extractor import extract_content_objects
from phase_z2_content_extractor import extract_content_objects, extract_rich_content_objects
from phase_z2_placement_planner import plan_placement
@@ -223,7 +223,7 @@ def _stage0_chained_adapter(
legacy_slide_title: str,
legacy_sections: list[MdxSection],
legacy_footer: Optional[str],
) -> tuple[str, list[MdxSection], Optional[str], dict]:
) -> tuple[str, list[MdxSection], Optional[str], dict, dict]:
"""IMP-02 — chained adapter for Stage 0 normalize → Phase Z Step 2 input.
Chain: mdx_normalizer.normalize_mdx_content + section_parser.extract_major_sections
@@ -233,7 +233,9 @@ def _stage0_chained_adapter(
output with diagnostics indicating disabled. When ON, runs adapter chain; on any
hard contract failure or exception, falls back to legacy and records fallback_reason.
Returns (slide_title, sections, footer, diagnostics).
Returns (slide_title, sections, footer, diagnostics, normalized_assets).
normalized_assets = {"popups": [...], "images": [...], "tables": [...]}
— IMP-03 Step 3 handoff. env=OFF or hard fallback 시 빈 list.
"""
diagnostics: dict = {
"enabled": False,
@@ -243,12 +245,14 @@ def _stage0_chained_adapter(
"adapter_counts": None,
"legacy_counts": {"sections": len(legacy_sections)},
}
# IMP-03 — Step 3 handoff. env=OFF / fallback 시 모든 list 가 비어 있음.
normalized_assets: dict = {"popups": [], "images": [], "tables": []}
raw_flag = os.environ.get("PHASE_Z_STAGE0_ADAPTER_ENABLED", "").strip().lower()
enabled = raw_flag in {"1", "true", "yes"}
diagnostics["enabled"] = enabled
if not enabled:
return legacy_slide_title, legacy_sections, legacy_footer, diagnostics
return legacy_slide_title, legacy_sections, legacy_footer, diagnostics, normalized_assets
try:
# Defer imports — legacy path must not depend on these modules.
@@ -259,12 +263,12 @@ def _stage0_chained_adapter(
normalized = normalize_mdx_content(raw_mdx)
if not isinstance(normalized, dict) or not isinstance(normalized.get("sections"), list):
diagnostics["fallback_reason"] = "MISSING_INVALID_IDS"
return legacy_slide_title, legacy_sections, legacy_footer, diagnostics
return legacy_slide_title, legacy_sections, legacy_footer, diagnostics, normalized_assets
majors = extract_major_sections(normalized["sections"])
if not majors:
diagnostics["fallback_reason"] = "NO_USABLE_SECTIONS"
return legacy_slide_title, legacy_sections, legacy_footer, diagnostics
return legacy_slide_title, legacy_sections, legacy_footer, diagnostics, normalized_assets
adapter_title = (normalized.get("title") or "").strip() or legacy_slide_title
conclusion = extract_conclusion_text(raw_mdx)
@@ -304,10 +308,10 @@ def _stage0_chained_adapter(
if section_num <= 0:
diagnostics["fallback_reason"] = "NON_POSITIVE_SECTION_NUM"
return legacy_slide_title, legacy_sections, legacy_footer, diagnostics
return legacy_slide_title, legacy_sections, legacy_footer, diagnostics, normalized_assets
if section_num in used_nums:
diagnostics["fallback_reason"] = "DUPLICATE_IDS"
return legacy_slide_title, legacy_sections, legacy_footer, diagnostics
return legacy_slide_title, legacy_sections, legacy_footer, diagnostics, normalized_assets
used_nums.add(section_num)
diagnostics["id_reconstruction_log"].append({
@@ -337,12 +341,19 @@ def _stage0_chained_adapter(
"footer_match": adapter_footer == legacy_footer,
}
diagnostics["used"] = True
return adapter_title, adapter_sections, adapter_footer, diagnostics
# IMP-03 — populate Step 3 handoff (success path only).
# All fallback paths leave normalized_assets as empty lists (defined at fn top).
normalized_assets = {
"popups": normalized.get("popups", []) or [],
"images": normalized.get("images", []) or [],
"tables": normalized.get("tables", []) or [],
}
return adapter_title, adapter_sections, adapter_footer, diagnostics, normalized_assets
except Exception as exc: # noqa: BLE001 — adapter must never break legacy path
diagnostics["fallback_reason"] = "ADAPTER_EXCEPTION"
diagnostics["exception"] = repr(exc)
return legacy_slide_title, legacy_sections, legacy_footer, diagnostics
return legacy_slide_title, legacy_sections, legacy_footer, diagnostics, normalized_assets
# ─── V4 lookup ──────────────────────────────────────────────────
@@ -1480,7 +1491,14 @@ def run_phase_z2_mvp1(
# (mdx_normalizer + section_parser) replaces legacy parse_mdx output;
# on any contract failure or exception, falls back to legacy with
# fallback_reason recorded in stage0_adapter_diagnostics.
slide_title, sections, slide_footer, stage0_adapter_diagnostics = _stage0_chained_adapter(
# IMP-03 — 5-tuple return adds stage0_normalized_assets (Step 3 handoff).
(
slide_title,
sections,
slide_footer,
stage0_adapter_diagnostics,
stage0_normalized_assets,
) = _stage0_chained_adapter(
mdx_path, legacy_slide_title, legacy_sections, legacy_footer,
)
_adapter_tag = (
@@ -1516,6 +1534,10 @@ def run_phase_z2_mvp1(
# IMP-02 — additive only. enabled/used/fallback_reason + id reconstruction
# trace + count diff. Out of scope: V4 / align / composition.
"stage0_adapter_diagnostics": stage0_adapter_diagnostics,
# IMP-03 — Step 3 handoff (slide-level rich asset list).
# env=OFF / fallback 시 모든 list 가 비어 있음. consumer = Step 3
# rich extractor (PHASE_Z_STEP3_RICH_OBJECTS_ENABLED canary).
"stage0_normalized_assets": stage0_normalized_assets,
},
step_status="partial",
pipeline_path_connected=True,
@@ -1525,7 +1547,8 @@ def run_phase_z2_mvp1(
"parse_mdx 결과: title / sections / footer 분리 + raw_content 보존. "
"heading tree 미생성, orphan / details 감지 미완 (Step 2 ⚠ partial — 별 axis). "
"orphans / details 필드는 schema lock — 빈 배열이라도 'detection 미수행' marker. "
"stage0_adapter_diagnostics = IMP-02 chained adapter trace (default OFF canary)."
"stage0_adapter_diagnostics = IMP-02 chained adapter trace (default OFF canary). "
"stage0_normalized_assets = IMP-03 Step 3 slide-level handoff (popups/images/tables list)."
),
)
@@ -1891,6 +1914,49 @@ def run_phase_z2_mvp1(
})
# ─── Step 3: Content Object 추출 (B1, trace-only) ───
# IMP-03 — slide-level rich ContentObject 추출 (default OFF canary).
# scope-lock 16 조건 (Gitea #3) :
# - 별 함수 (extract_rich_content_objects) — v0 extract_content_objects unchanged
# - slide-level — section_id=None, id=`{mdx_id}.{type}-N`, scope='slide'
# - root-level once (per-zone duplication X)
# - plan_placement() 는 v0 list 만 받음 (B4 회귀 X) — 본 rich 결과는 artifact only
# - transform_table dedup : arrow row 감지 시 skip
rich_flag = os.environ.get("PHASE_Z_STEP3_RICH_OBJECTS_ENABLED", "").strip().lower()
rich_enabled_flag = rich_flag in {"1", "true", "yes"}
_assets_total = (
len(stage0_normalized_assets.get("popups") or [])
+ len(stage0_normalized_assets.get("images") or [])
+ len(stage0_normalized_assets.get("tables") or [])
)
rich_disabled_reason: Optional[str] = None
if not rich_enabled_flag:
rich_disabled_reason = "FLAG_OFF"
elif _assets_total == 0:
rich_disabled_reason = "NO_NORMALIZED_ASSETS"
rich_objects: list = []
rich_skips: list = []
if rich_disabled_reason is None:
mdx_num_match = re.match(r"(\d+)", mdx_path.stem)
rich_mdx_id = mdx_num_match.group(1).zfill(2) if mdx_num_match else "00"
rich_objects, rich_skips = extract_rich_content_objects(
stage0_normalized_assets, mdx_id=rich_mdx_id,
)
# Count/list invariant check (IMP-02 ↔ IMP-03 chain) — soft warning, no fail.
invariant_warnings: list[dict] = []
_adapter_counts = (stage0_adapter_diagnostics or {}).get("adapter_counts") or {}
if _adapter_counts:
for key in ("popups", "images", "tables"):
expected = _adapter_counts.get(key)
actual = len(stage0_normalized_assets.get(key) or [])
if expected is not None and expected != actual:
invariant_warnings.append({
"field": key,
"adapter_counts": expected,
"stage0_normalized_assets_len": actual,
})
_write_step_artifact(
run_dir, 3, "content_objects",
data={
@@ -1902,12 +1968,25 @@ def run_phase_z2_mvp1(
}
for dz in debug_zones
],
# IMP-03 — slide-level rich trace (additive, trace-only).
"rich_content_objects": [asdict(o) for o in rich_objects],
"rich_content_objects_enabled": rich_disabled_reason is None,
"rich_content_objects_scope": "slide",
"rich_content_objects_source": "stage0_normalized_assets",
"rich_content_objects_disabled_reason": rich_disabled_reason,
"rich_content_objects_skips": rich_skips,
"rich_content_objects_invariant_warnings": invariant_warnings,
},
step_status="trace-only",
pipeline_path_connected=False,
inputs=["step02_normalized.json"],
outputs=["step03_content_objects.json"],
note="현재는 trace 로 기록되지만 render payload 를 직접 만들지는 않음. mapper.py 가 별도로 MDX 직접 파싱.",
note=(
"현재는 trace 로 기록되지만 render payload 를 직접 만들지는 않음. "
"mapper.py 가 별도로 MDX 직접 파싱. "
"IMP-03 rich_content_objects = slide-level popup/image/table trace "
"(PHASE_Z_STEP3_RICH_OBJECTS_ENABLED canary, default OFF)."
),
)
# ─── Step 4: Section Internal Composition (B2, trace-only) ───