- Add extract_rich_content_objects(normalized_assets, mdx_id) in
phase_z2_content_extractor.py emitting slide-level rich ContentObjects
for SPEC v1 §1.2 types: details (popups), image, table
- Extend ContentObject dataclass with optional scope/mdx_id/section_id
metadata fields (additive, default None — v0 unchanged)
- _stage0_chained_adapter() returns 5-tuple adding normalized_assets
({popups, images, tables}); empty on env=OFF / hard fallback
- Step 2 artifact gains additive stage0_normalized_assets nested field
(env=OFF / fallback → empty lists). Existing 7 fields preserved.
- Step 3 emits root-level rich_content_objects once at slide scope
with rich_content_objects_enabled / scope / source / disabled_reason /
skips / invariant_warnings. per_zone list still references v0 only.
- PHASE_Z_STEP3_RICH_OBJECTS_ENABLED env flag, default OFF (canary,
matches PHASE_Z_STAGE0_ADAPTER_ENABLED / PHASE_Z_B4_*). Enable
requires flag=1 AND non-empty normalized_assets; otherwise records
disabled_reason = FLAG_OFF or NO_NORMALIZED_ASSETS.
- transform_table dedup: arrow glyph detection in normalized table
rows/headers → skip with reason=skipped_transform_table_duplicate.
v0 _capture_3col_transform_table remains the sole transform_table
source; generic table only for non-transform tables.
- ID pattern {mdx_id}.{details,image,table}-N (slide-level namespace).
- plan_placement() input unchanged (v0 content_objects only) — rich
list never feeds placement/region planning in this issue.
- self-test extended with 5 rich extractor cases (popup/image/table
/arrow-skip/empty); v0 self-test unchanged and still PASS.
- mapper / V4 / composition / Step 6+ / AI/Kei / pipeline_path_connected
unchanged. trace fidelity only.
env OFF + rich OFF: legacy PASS, no regression
env OFF + rich=1 : disabled_reason=NO_NORMALIZED_ASSETS, rich list empty
env=1 + rich=1 : Step 2 stage0_normalized_assets populated (1 table on
MDX 03, invariant match adapter_counts). Step 3 write
blocked by inherited IMP-02 composition_planner abort
(downstream gap, not IMP-03 scope).
Refs Gitea #3 (IMP-03 A-1 popup/image/table trace)
612 lines
24 KiB
Python
612 lines
24 KiB
Python
"""Phase Z-2 Content Object extractor (B1 v0 — dormant module).
|
|
|
|
SPEC v1 §1 의 typed content_object schema 만족하는 dedicated extractor.
|
|
|
|
v0 minimal :
|
|
- 지원 type : text_block, transform_table 2 개 만 (table / image / diagram / details 제외)
|
|
- role : 모두 "summary" (v0 default — role 정밀화는 별 axis)
|
|
- dormant — runtime path 미연결 (pipeline / composition / mapper 미터치)
|
|
- mapper 미수정, 기존 helper move / promote / copy 없음
|
|
- transform_table 은 *arrow column 보존* 위해 B1 *local helper* 로 구현
|
|
(regex / parsing 일부가 mapper helper 와 유사 — 단 mapper helper 는 arrow 폐기.
|
|
향후 helper promote / 통합 refactor 는 별 axis)
|
|
|
|
v0 흐름 :
|
|
section.raw_content
|
|
→ 3-column markdown table 감지 (arrow glyph 포함) → transform_table
|
|
→ 나머지 content → text_block (format / bullet_count / has_emphasis 분석)
|
|
→ list[ContentObject]
|
|
|
|
검증 :
|
|
- dormancy : MDX 03 final.html SHA = canonical 유지 (runtime path 미연결)
|
|
- correctness : __main__ self-test (text_block 1 case + transform_table 1 case)
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import re
|
|
from dataclasses import dataclass, field
|
|
from typing import Optional
|
|
|
|
# B1 v0 helper 처리 정직 기록 (기존 보고 정정 — 2026-04-30) :
|
|
# - `phase_z2_mapper` 미수정. 기존 mapper helper (`_extract_markdown_table` 등) move /
|
|
# promote / copy 없음.
|
|
# - 단 SPEC v1 §1.2 transform_table.rows = [{from, arrow, to}] schema 가
|
|
# mapper 의 helper 출력 (from/to 만, arrow 폐기) 와 호환 안 됨.
|
|
# - 따라서 *arrow column 보존* 이 필요한 transform_table 추출 부분은 본 module 의
|
|
# *layer-agnostic local helper* (`_capture_3col_transform_table`) 로 *별도 구현*.
|
|
# - mapper helper 와 regex / parsing 일부 유사 — 향후 *promote / 통합 refactor* 는
|
|
# 별 axis (B1 안정 후 layer-agnostic helper module 통합 검토 가능).
|
|
|
|
|
|
# ─── ContentObject schema (SPEC v1 §1.1) ────────────────────────
|
|
|
|
|
|
@dataclass
|
|
class ContentObject:
|
|
"""SPEC v1 §1.1 base schema. v0 = text_block + transform_table 만 지원.
|
|
|
|
Fields :
|
|
id : section 내 unique id (예: '03-2.transform-1' / '03-2.text-1')
|
|
type : "text_block" | "transform_table" | "details" | "image" | "table"
|
|
role : v0 = "summary" 만 (정밀화는 별 axis)
|
|
raw_payload : 원본 markdown (자름 / 변형 X — 원문 보존 룰)
|
|
size_estimate : type 별 (line_count / rows 등)
|
|
type_specific : type 별 detail (SPEC v1 §1.2)
|
|
source_shape_index : positional index within source_shape (Option 1, optional)
|
|
source_shape_kind : "top_bullets" | "h3_subsections" | ... (Option 1, optional)
|
|
scope : "section" (default for v0) | "slide" (IMP-03 rich objects)
|
|
mdx_id : 2-digit MDX id (예: '03') — slide-level rich objects 용
|
|
section_id : section 매핑 — slide-level rich objects 는 None
|
|
"""
|
|
|
|
id: str
|
|
type: str
|
|
role: str
|
|
raw_payload: str
|
|
size_estimate: dict = field(default_factory=dict)
|
|
type_specific: dict = field(default_factory=dict)
|
|
source_shape_index: Optional[int] = None
|
|
source_shape_kind: Optional[str] = None
|
|
scope: Optional[str] = None
|
|
mdx_id: Optional[str] = None
|
|
section_id: Optional[str] = None
|
|
|
|
|
|
# ─── Transform table extraction ─────────────────────────────────
|
|
|
|
|
|
_ARROW_GLYPHS = ("➜", "➠", "→", "->", "=>")
|
|
|
|
_TABLE_PATTERN = re.compile(
|
|
r"(^[ \t]*\|[^\n]+\|\n[ \t]*\|[\s\-:|]+\|\n(?:[ \t]*\|[^\n]+\|\n?)+)",
|
|
re.MULTILINE,
|
|
)
|
|
|
|
|
|
def _capture_3col_transform_table(content: str) -> tuple[dict | None, str]:
|
|
"""3-column markdown table 에서 (from / arrow / to) 캡처 → transform_table.
|
|
|
|
본 함수 = B1 v0 의 *layer-agnostic extractor helper*. mapper 의
|
|
`_extract_markdown_table` 와 regex / parsing 의 일부가 유사하나, mapper helper 는
|
|
arrow column 을 폐기 (from/to 만 추출) — SPEC v1 §1.2 의
|
|
`transform_table.rows = [{from, arrow, to}]` schema 를 직접 만족 못 함.
|
|
따라서 arrow column 보존 필요해 본 module 안에 *별도 구현*. mapper 미수정 유지.
|
|
|
|
*향후 helper promote / 통합 refactor 는 별 axis* — B1 안정 후 mapper 와
|
|
*layer-agnostic helper module* 통합 검토 가능.
|
|
|
|
arrow column 에 arrow glyph 가 있어야 transform 으로 인정.
|
|
|
|
Returns :
|
|
({"type_specific": ..., "raw_payload": <table markdown>}, content_without_table)
|
|
또는 (None, original_content) — transform 패턴 미감지 시
|
|
"""
|
|
m = _TABLE_PATTERN.search(content)
|
|
if not m:
|
|
return None, content
|
|
|
|
raw_lines = [r.strip() for r in m.group(1).strip().splitlines() if r.strip()]
|
|
if len(raw_lines) < 3: # header + separator + ≥1 data row
|
|
return None, content
|
|
|
|
data_rows = raw_lines[2:] # skip header + separator
|
|
pairs: list[dict] = []
|
|
arrow_glyph = ""
|
|
for r in data_rows:
|
|
cells = [c.strip() for c in r.strip("|").split("|")]
|
|
if len(cells) < 3:
|
|
continue
|
|
f = re.sub(r"\*\*(.+?)\*\*", r"\1", cells[0])
|
|
a = re.sub(r"\*\*(.+?)\*\*", r"\1", cells[1])
|
|
t = re.sub(r"\*\*(.+?)\*\*", r"\1", cells[2])
|
|
if not arrow_glyph:
|
|
for g in _ARROW_GLYPHS:
|
|
if g in a:
|
|
arrow_glyph = g
|
|
break
|
|
pairs.append({"from": f, "arrow": a, "to": t})
|
|
|
|
if not pairs:
|
|
return None, content
|
|
|
|
# transform 인지 검증 — arrow glyph 가 *어느 row 든* 등장해야
|
|
has_arrow = any(any(g in p["arrow"] for g in _ARROW_GLYPHS) for p in pairs)
|
|
if not has_arrow:
|
|
return None, content
|
|
|
|
type_specific = {
|
|
"pair_count": len(pairs),
|
|
"arrow_glyph": arrow_glyph,
|
|
"rows": pairs,
|
|
}
|
|
raw_table = m.group(1)
|
|
remaining = content[: m.start()] + content[m.end() :]
|
|
return ({"type_specific": type_specific, "raw_payload": raw_table}, remaining)
|
|
|
|
|
|
# ─── Text block extraction ──────────────────────────────────────
|
|
|
|
|
|
def _detect_text_block_specific(content: str) -> tuple[dict, int]:
|
|
"""text_block 의 type_specific + line_count 추출.
|
|
|
|
format 결정 :
|
|
- top bullet 0 → paragraph
|
|
- top bullet 있음, nested 0 → bullet_list
|
|
- top bullet + nested → nested_list
|
|
|
|
Returns :
|
|
(type_specific dict, line_count)
|
|
"""
|
|
lines = content.splitlines()
|
|
|
|
top_bullets = sum(1 for l in lines if re.match(r"^[\*\-]\s", l))
|
|
nested_bullets = sum(1 for l in lines if re.match(r"^\s+[\*\-]\s", l))
|
|
|
|
# max_indent_level (2-space indent 단위)
|
|
max_indent = 0
|
|
for l in lines:
|
|
mm = re.match(r"^( *)[\*\-]\s", l)
|
|
if mm:
|
|
level = len(mm.group(1)) // 2
|
|
max_indent = max(max_indent, level)
|
|
|
|
if top_bullets == 0:
|
|
fmt = "paragraph"
|
|
elif nested_bullets > 0:
|
|
fmt = "nested_list"
|
|
else:
|
|
fmt = "bullet_list"
|
|
|
|
has_emphasis = bool(
|
|
re.search(r"\*\*[^*\n]+\*\*", content)
|
|
or re.search(r"(?<!\*)\*[^*\n]+\*(?!\*)", content)
|
|
)
|
|
|
|
line_count = sum(1 for l in lines if l.strip())
|
|
|
|
type_specific = {
|
|
"format": fmt,
|
|
"bullet_count": top_bullets,
|
|
"max_indent_level": max_indent,
|
|
"has_emphasis": has_emphasis,
|
|
}
|
|
return type_specific, line_count
|
|
|
|
|
|
# ─── Public entry ───────────────────────────────────────────────
|
|
|
|
|
|
def extract_content_objects(section, source_shape: Optional[str] = None) -> list[ContentObject]:
|
|
"""MDX section.raw_content → typed content_object list (SPEC v1 §1).
|
|
|
|
v0 minimal :
|
|
- 1 section → 1~2 ContentObject (transform_table + text_block 또는 text_block 만)
|
|
- role = "summary" (모두 — v0 default)
|
|
- 미지원 type (table / image / diagram / details) = 무시 (별 axis)
|
|
- 원문 (raw_payload) = 자름 / 변형 X (원문 보존 룰)
|
|
|
|
Option 1 (source_shape-aware) :
|
|
- source_shape="top_bullets" : raw_content 를 mapper.split_source 로 N units 분할 →
|
|
unit 별 ContentObject 1 개 (text_block) with source_shape_index=i / source_shape_kind="top_bullets"
|
|
- source_shape=None 또는 미지원 값 (h3_subsections 등) : 기존 legacy 동작
|
|
|
|
Args :
|
|
section : MdxSection-like 객체 (section_id, raw_content 필드 필요)
|
|
source_shape : "top_bullets" 시 source_shape-aware 분기. None 이면 legacy.
|
|
|
|
Returns :
|
|
list[ContentObject] — legacy 0~2 / top_bullets N (bullet 수)
|
|
"""
|
|
content = section.raw_content
|
|
section_id = section.section_id
|
|
|
|
if source_shape == "top_bullets":
|
|
from phase_z2_mapper import split_source
|
|
units = split_source("top_bullets", content)
|
|
objects: list[ContentObject] = []
|
|
for i, unit in enumerate(units):
|
|
unit_text = unit if isinstance(unit, str) else str(unit)
|
|
if not unit_text.strip():
|
|
continue
|
|
text_specific, line_count = _detect_text_block_specific(unit_text)
|
|
objects.append(
|
|
ContentObject(
|
|
id=f"{section_id}.text-{i + 1}",
|
|
type="text_block",
|
|
role="summary",
|
|
raw_payload=unit_text.strip(),
|
|
size_estimate={"line_count": line_count},
|
|
type_specific=text_specific,
|
|
source_shape_index=i,
|
|
source_shape_kind="top_bullets",
|
|
)
|
|
)
|
|
return objects
|
|
|
|
# legacy path (source_shape=None 또는 미지원 값)
|
|
objects: list[ContentObject] = []
|
|
|
|
# 1. transform_table 추출 시도 (3-col with arrow)
|
|
transform_result, remaining = _capture_3col_transform_table(content)
|
|
if transform_result is not None:
|
|
objects.append(
|
|
ContentObject(
|
|
id=f"{section_id}.transform-1",
|
|
type="transform_table",
|
|
role="summary",
|
|
raw_payload=transform_result["raw_payload"],
|
|
size_estimate={"rows": transform_result["type_specific"]["pair_count"]},
|
|
type_specific=transform_result["type_specific"],
|
|
)
|
|
)
|
|
|
|
# 2. text_block 추출 (transform 추출 후 남은 content, 또는 transform 없으면 전체)
|
|
text_remainder = remaining if transform_result is not None else content
|
|
if text_remainder.strip():
|
|
text_specific, line_count = _detect_text_block_specific(text_remainder)
|
|
objects.append(
|
|
ContentObject(
|
|
id=f"{section_id}.text-1",
|
|
type="text_block",
|
|
role="summary",
|
|
raw_payload=text_remainder.strip(),
|
|
size_estimate={"line_count": line_count},
|
|
type_specific=text_specific,
|
|
)
|
|
)
|
|
|
|
return objects
|
|
|
|
|
|
# ─── IMP-03 (Step 3) — rich ContentObject extractor (slide-level) ─
|
|
|
|
# scope-lock 16 조건 (Gitea #3) :
|
|
# - SPEC v1 §1.2 의 table / image / details 3 type 추가 (diagram 제외)
|
|
# - 별 함수 분리 — v0 `extract_content_objects` signature/behavior 미터치
|
|
# - slide-level attribution — section param 없음, id = `{mdx_id}.{type}-N`,
|
|
# ContentObject.scope='slide' / mdx_id=<id> / section_id=None
|
|
# - transform_table dedup — arrow row 감지 시 skip (v0 가 단독 source)
|
|
# - asset row shape contract (mdx_normalizer SoT) :
|
|
# popup = {title:str, content:str}
|
|
# image = {alt:str, path:str}
|
|
# table = {headers:list[str], rows:list[list[str]]}
|
|
# - render path 미연결 — Step 3 artifact trace only
|
|
# - plan_placement() 는 *v0 list 만* 받음 (B4 회귀 X)
|
|
|
|
|
|
def _looks_like_transform_table(table: dict) -> bool:
|
|
"""normalize_mdx_content 의 table 가 AS-IS / arrow / TO-BE 구조인지 감지.
|
|
|
|
arrow row 가 *어떤 column 이든* 1 개 이상 등장 → transform 으로 분류 (v0 가 처리).
|
|
|
|
Args :
|
|
table : {"headers": list[str], "rows": list[list[str]]}
|
|
Returns :
|
|
True = transform_table 후보 (rich extractor 는 skip)
|
|
False = 일반 table
|
|
"""
|
|
rows = table.get("rows") or []
|
|
for row in rows:
|
|
for cell in row:
|
|
cell_s = str(cell) if cell is not None else ""
|
|
if any(g in cell_s for g in _ARROW_GLYPHS):
|
|
return True
|
|
headers = table.get("headers") or []
|
|
for h in headers:
|
|
h_s = str(h) if h is not None else ""
|
|
if any(g in h_s for g in _ARROW_GLYPHS):
|
|
return True
|
|
return False
|
|
|
|
|
|
def _reconstruct_markdown_table(headers: list, rows: list) -> str:
|
|
"""headers / rows → markdown table string (raw_md / raw_payload 용)."""
|
|
if not headers and not rows:
|
|
return ""
|
|
out_lines: list[str] = []
|
|
if headers:
|
|
out_lines.append("| " + " | ".join(str(h) for h in headers) + " |")
|
|
out_lines.append("|" + "|".join("---" for _ in headers) + "|")
|
|
for row in rows:
|
|
out_lines.append("| " + " | ".join(str(c) for c in row) + " |")
|
|
return "\n".join(out_lines)
|
|
|
|
|
|
def extract_rich_content_objects(
|
|
normalized_assets: Optional[dict],
|
|
mdx_id: str,
|
|
) -> tuple[list[ContentObject], list[dict]]:
|
|
"""IMP-03 — slide-level rich ContentObject extractor.
|
|
|
|
Consumes mdx_normalizer's flat popup/image/table lists (via
|
|
`stage0_normalized_assets`) and emits typed ContentObjects with
|
|
slide-level attribution (`scope='slide'`, `section_id=None`).
|
|
|
|
transform_table dedup : arrow glyph 감지 시 skip — v0
|
|
`_capture_3col_transform_table()` 가 단독 transform_table source.
|
|
skip 시 진단 entry 반환 (`skipped_transform_table_duplicate` reason).
|
|
|
|
Args :
|
|
normalized_assets : {popups: [{title, content}], images: [{alt, path}],
|
|
tables: [{headers, rows}]} 또는 None
|
|
mdx_id : 2-digit MDX id (예: '03')
|
|
|
|
Returns :
|
|
(rich_objects, skip_diagnostics)
|
|
rich_objects : list[ContentObject] — slide-level
|
|
skip_diagnostics : list[dict] — 각 skip 사유 (index, reason)
|
|
"""
|
|
if not normalized_assets:
|
|
return [], []
|
|
|
|
out: list[ContentObject] = []
|
|
skips: list[dict] = []
|
|
|
|
# details (popups) — sequence 1..N
|
|
for i, p in enumerate(normalized_assets.get("popups") or [], start=1):
|
|
title = (p.get("title") or "").strip() if isinstance(p, dict) else ""
|
|
body = (p.get("content") or "").strip() if isinstance(p, dict) else ""
|
|
line_count = body.count("\n") + (1 if body else 0)
|
|
out.append(ContentObject(
|
|
id=f"{mdx_id}.details-{i}",
|
|
type="details",
|
|
role="summary",
|
|
raw_payload=body,
|
|
size_estimate={"line_count": line_count, "bytes": len(body)},
|
|
type_specific={
|
|
"summary": title,
|
|
"body_raw": body,
|
|
"display_hint": "popup",
|
|
},
|
|
scope="slide",
|
|
mdx_id=mdx_id,
|
|
section_id=None,
|
|
))
|
|
|
|
# image
|
|
for i, img in enumerate(normalized_assets.get("images") or [], start=1):
|
|
src = (img.get("path") or "").strip() if isinstance(img, dict) else ""
|
|
alt = (img.get("alt") or "").strip() if isinstance(img, dict) else ""
|
|
out.append(ContentObject(
|
|
id=f"{mdx_id}.image-{i}",
|
|
type="image",
|
|
role="summary",
|
|
raw_payload=src,
|
|
size_estimate={"bytes": len(src)},
|
|
type_specific={
|
|
"src": src,
|
|
"alt": alt,
|
|
"aspect_ratio": None,
|
|
"intrinsic_width_px": None,
|
|
"intrinsic_height_px": None,
|
|
},
|
|
scope="slide",
|
|
mdx_id=mdx_id,
|
|
section_id=None,
|
|
))
|
|
|
|
# table — arrow 감지 시 skip
|
|
for i, t in enumerate(normalized_assets.get("tables") or [], start=1):
|
|
if not isinstance(t, dict):
|
|
skips.append({"index": i, "reason": "invalid_table_shape"})
|
|
continue
|
|
if _looks_like_transform_table(t):
|
|
skips.append({
|
|
"index": i,
|
|
"reason": "skipped_transform_table_duplicate",
|
|
"headers": t.get("headers") or [],
|
|
})
|
|
continue
|
|
headers = t.get("headers") or []
|
|
rows = t.get("rows") or []
|
|
raw_md = _reconstruct_markdown_table(headers, rows)
|
|
out.append(ContentObject(
|
|
id=f"{mdx_id}.table-{i}",
|
|
type="table",
|
|
role="summary",
|
|
raw_payload=raw_md,
|
|
size_estimate={"rows": len(rows), "bytes": len(raw_md)},
|
|
type_specific={
|
|
"rows": len(rows),
|
|
"cols": len(headers),
|
|
"header_present": bool(headers),
|
|
"is_transform": False,
|
|
"raw_md": raw_md,
|
|
},
|
|
scope="slide",
|
|
mdx_id=mdx_id,
|
|
section_id=None,
|
|
))
|
|
|
|
return out, skips
|
|
|
|
|
|
# ─── Self-test (B1 v0 correctness 검증) ─────────────────────────
|
|
|
|
|
|
def _run_self_test():
|
|
"""v0 unit test : text_block 1 case + transform_table 1 case.
|
|
|
|
scope-lock 의 검증 (b) correctness — 추출기 정확성 확인.
|
|
fixed input 기반, MDX 01/02/04 미사용.
|
|
"""
|
|
|
|
class MockSection:
|
|
def __init__(self, section_id: str, raw_content: str):
|
|
self.section_id = section_id
|
|
self.raw_content = raw_content
|
|
|
|
# ─── Test 1 : text_block (nested_list 형태, F13 style) ───────
|
|
text_section = MockSection(
|
|
"test-1",
|
|
"* **기술 부족**\n"
|
|
" * 디지털 도구 미숙\n"
|
|
" * BIM 활용 제한\n"
|
|
"* **인력 부족**\n"
|
|
" * 전문가 부재\n"
|
|
"* **자연 환경**\n"
|
|
" * 지역적 제약\n",
|
|
)
|
|
objs1 = extract_content_objects(text_section)
|
|
assert len(objs1) == 1, f"text-only section → 1 obj 기대, got {len(objs1)}"
|
|
o = objs1[0]
|
|
assert o.type == "text_block", f"type=text_block 기대, got {o.type}"
|
|
assert o.role == "summary"
|
|
assert o.id == "test-1.text-1"
|
|
assert o.type_specific["format"] == "nested_list", f"format=nested_list 기대, got {o.type_specific['format']}"
|
|
assert o.type_specific["bullet_count"] == 3, f"top bullet=3 기대, got {o.type_specific['bullet_count']}"
|
|
assert o.type_specific["max_indent_level"] >= 1, "nested 가 있으니 max_indent ≥ 1"
|
|
assert o.type_specific["has_emphasis"] is True, "**bold** 존재 → has_emphasis=True"
|
|
assert o.size_estimate["line_count"] >= 6
|
|
assert "기술 부족" in o.raw_payload, "원문 보존 — '기술 부족' 잔존 필요"
|
|
print("[OK] Test 1 (text_block) passed.")
|
|
|
|
# ─── Test 2 : transform_table (3-col, arrow 포함) + 잔여 text ─
|
|
transform_section = MockSection(
|
|
"test-2",
|
|
"**프로세스 변환**\n"
|
|
"\n"
|
|
"| AS-IS | ➜ | TO-BE |\n"
|
|
"|---|---|---|\n"
|
|
"| 도면 중심 | ➜ | BIM 모델 중심 |\n"
|
|
"| 단계별 분리 | ➜ | 통합 협업 |\n"
|
|
"| 사후 검토 | ➜ | 실시간 검증 |\n"
|
|
"\n"
|
|
"추가 설명 : 위 변환이 핵심.\n",
|
|
)
|
|
objs2 = extract_content_objects(transform_section)
|
|
assert len(objs2) == 2, f"transform+text → 2 obj 기대, got {len(objs2)}"
|
|
|
|
# transform_table 검증
|
|
t = objs2[0]
|
|
assert t.type == "transform_table", f"첫 obj=transform_table 기대, got {t.type}"
|
|
assert t.role == "summary"
|
|
assert t.id == "test-2.transform-1"
|
|
assert t.type_specific["pair_count"] == 3, f"pair_count=3 기대, got {t.type_specific['pair_count']}"
|
|
assert t.type_specific["arrow_glyph"] == "➜", f"arrow_glyph=➜ 기대, got {t.type_specific['arrow_glyph']}"
|
|
assert len(t.type_specific["rows"]) == 3
|
|
assert t.type_specific["rows"][0]["from"] == "도면 중심"
|
|
assert t.type_specific["rows"][0]["to"] == "BIM 모델 중심"
|
|
assert t.size_estimate["rows"] == 3
|
|
assert "도면 중심" in t.raw_payload, "raw_payload 에 원본 table 보존"
|
|
|
|
# text_block 검증 (transform 제거 후 남은 content)
|
|
tb = objs2[1]
|
|
assert tb.type == "text_block", f"두번째 obj=text_block 기대, got {tb.type}"
|
|
assert tb.id == "test-2.text-1"
|
|
assert "프로세스 변환" in tb.raw_payload, "transform 제거 후 surrounding text 보존 — '프로세스 변환'"
|
|
assert "추가 설명" in tb.raw_payload, "transform 뒤 잔여 text 보존 — '추가 설명'"
|
|
print("[OK] Test 2 (transform_table + text_block) passed.")
|
|
|
|
print("\n=== B1 v0 self-test PASS ===")
|
|
|
|
|
|
def _run_rich_self_test():
|
|
"""IMP-03 (Step 3) — rich ContentObject extractor 3 case self-test.
|
|
|
|
cases :
|
|
1. popup → details ContentObject
|
|
2. image → image ContentObject
|
|
3. table (non-transform) → table ContentObject
|
|
4. table (arrow) → skip (transform_table dedup)
|
|
"""
|
|
|
|
# ─── Test 1 : popup → details ───
|
|
assets1 = {
|
|
"popups": [{"title": "F13 안", "content": "정책 사례 정리 ...\n2 번째 줄"}],
|
|
"images": [],
|
|
"tables": [],
|
|
}
|
|
rich1, skips1 = extract_rich_content_objects(assets1, mdx_id="03")
|
|
assert len(rich1) == 1 and not skips1, f"popup → 1 obj, got rich={len(rich1)} skips={len(skips1)}"
|
|
o = rich1[0]
|
|
assert o.id == "03.details-1"
|
|
assert o.type == "details" and o.role == "summary"
|
|
assert o.scope == "slide" and o.mdx_id == "03" and o.section_id is None
|
|
assert o.type_specific["summary"] == "F13 안"
|
|
assert o.type_specific["display_hint"] == "popup"
|
|
assert "정책 사례" in o.type_specific["body_raw"]
|
|
print("[OK] Rich Test 1 (popup → details) passed.")
|
|
|
|
# ─── Test 2 : image ───
|
|
assets2 = {
|
|
"popups": [],
|
|
"images": [{"alt": "BIM 모델", "path": "img/bim.png"}],
|
|
"tables": [],
|
|
}
|
|
rich2, skips2 = extract_rich_content_objects(assets2, mdx_id="03")
|
|
assert len(rich2) == 1 and not skips2
|
|
o = rich2[0]
|
|
assert o.id == "03.image-1"
|
|
assert o.type == "image" and o.scope == "slide"
|
|
assert o.type_specific["src"] == "img/bim.png"
|
|
assert o.type_specific["alt"] == "BIM 모델"
|
|
assert o.type_specific["aspect_ratio"] is None
|
|
print("[OK] Rich Test 2 (image) passed.")
|
|
|
|
# ─── Test 3 : non-transform table ───
|
|
assets3 = {
|
|
"popups": [],
|
|
"images": [],
|
|
"tables": [{"headers": ["분류", "내용"], "rows": [["기술", "BIM"], ["인력", "전문가"]]}],
|
|
}
|
|
rich3, skips3 = extract_rich_content_objects(assets3, mdx_id="03")
|
|
assert len(rich3) == 1 and not skips3
|
|
o = rich3[0]
|
|
assert o.id == "03.table-1"
|
|
assert o.type == "table" and o.scope == "slide"
|
|
assert o.type_specific["rows"] == 2 and o.type_specific["cols"] == 2
|
|
assert o.type_specific["header_present"] is True
|
|
assert o.type_specific["is_transform"] is False
|
|
assert "기술" in o.type_specific["raw_md"]
|
|
print("[OK] Rich Test 3 (non-transform table) passed.")
|
|
|
|
# ─── Test 4 : arrow transform table → skip ───
|
|
assets4 = {
|
|
"popups": [],
|
|
"images": [],
|
|
"tables": [{"headers": ["AS-IS", "➜", "TO-BE"],
|
|
"rows": [["도면 중심", "➜", "BIM 중심"]]}],
|
|
}
|
|
rich4, skips4 = extract_rich_content_objects(assets4, mdx_id="03")
|
|
assert len(rich4) == 0, f"arrow table → 0 rich obj 기대, got {len(rich4)}"
|
|
assert len(skips4) == 1
|
|
assert skips4[0]["reason"] == "skipped_transform_table_duplicate"
|
|
print("[OK] Rich Test 4 (arrow table → skip) passed.")
|
|
|
|
# ─── Test 5 : empty normalized_assets → empty ───
|
|
rich5, skips5 = extract_rich_content_objects(None, mdx_id="03")
|
|
assert rich5 == [] and skips5 == []
|
|
rich6, skips6 = extract_rich_content_objects({}, mdx_id="03")
|
|
assert rich6 == [] and skips6 == []
|
|
print("[OK] Rich Test 5 (empty) passed.")
|
|
|
|
print("\n=== IMP-03 rich extractor self-test PASS ===")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
_run_self_test()
|
|
_run_rich_self_test()
|