diff --git a/docs/architecture/IMP-16-U2-WIRING-DESIGN.md b/docs/architecture/IMP-16-U2-WIRING-DESIGN.md new file mode 100644 index 0000000..23ef253 --- /dev/null +++ b/docs/architecture/IMP-16-U2-WIRING-DESIGN.md @@ -0,0 +1,75 @@ +# IMP-16-U2 — Phase Z verification wiring design (design-only) + +**Status**: design-only contract. **No runtime wiring lands in this issue.** All wiring is gated behind IMP-07 reverse-path activation (B-2 main). When IMP-07 lands, this doc becomes the binding contract for the Step 1 / 2 / 14 / 21 / 22 changes that consume the IMP-16-U1 surface in `src/phase_z2_verification_utils.py`. + +**Source anchors** +- IMP-16 backlog row — [`docs/architecture/PHASE-Z-IMPLEMENTATION-ISSUE-BACKLOG.md`](PHASE-Z-IMPLEMENTATION-ISSUE-BACKLOG.md):67 (priority ↓ low, hard link IMP-07, source §3 H3 Reference Only). +- IMP-07 backlog row — same doc line 51 (status `pending`). +- 22-step pipeline anchor — [`PHASE-Z-PIPELINE-OVERVIEW.md`](PHASE-Z-PIPELINE-OVERVIEW.md) Steps 1 / 2 / 14 / 21 / 22. +- U1 module — `src/phase_z2_verification_utils.py` (u1~u10 ports). +- Phase Q reference H3 (Reference Only — do not import) — `src/content_verifier.py`. + +## Gate (hard block — do not merge wiring before this clears) + +- IMP-07 status MUST be `implemented` and `verified` before any code change listed below lands. +- Repo grep `html_to_slide_mdx | edited_html_to_mdx | reverse_path` MUST return at least one runtime hit in a non-test module under `src/`. +- The reverse-path entry point MUST emit (a) a normalized re-entry MDX string and (b) the upstream generated HTML string, both as deterministic outputs accessible to Step 2 and Step 14 callers. + +## Per-step wiring contract + +### Step 1 — MDX upload (re-entered MDX validation) +- Caller : the reverse-path adapter introduced by IMP-07, immediately after it produces a re-entry MDX. +- Surface used : u6 `split_into_sentences` (validate that the reverse-path MDX yields at least one sentence after meta-strip + bullet-marker strip). +- Behavior : if `split_into_sentences(reentry_mdx)` returns an empty list, the reverse-path adapter MUST raise a deterministic input error before Step 2 starts. No silent fallback. No AI call. No content rewrite. +- Trace : `debug.json["step01"]["reentry_sentence_count"]` (additive integer field). + +### Step 2 — MDX normalize (text preservation cross-check) +- Caller : `parse_mdx` / `align_sections_to_v4_granularity` post-normalize hook (added only when the input came through the IMP-07 reverse path; original-upload path is unchanged). +- Surface used : u8 `verify_text_preservation(reentry_mdx, upstream_generated_html, area_name="reentry_mdx_vs_upstream_html")`. +- Threshold : the U1 module default (`_TEXT_PRESERVATION_DEFAULT_THRESHOLD = 0.70`, ported verbatim from Phase Q). Do not redesign in U2. +- Behavior : `VerificationResult.passed == False` → adapter aborts the re-entry with the result's `errors` list surfaced; auto pipeline does NOT silently continue. Per `feedback_auto_pipeline_first`, no `review_required` / `review_queue` is inserted — adapter abort is the deterministic outcome. +- Trace : `debug.json["step02"]["reentry_text_preservation"] = {passed, score, area_name, missing_count}` (additive; missing sentences themselves NOT serialised, per privacy-by-default). + +### Step 14 — Selenium visual runtime check (invented-text guard) +- Caller : the `run_overflow_check` post-render path, ONLY when the run was triggered from the reverse-path re-entry. Original-upload path keeps current Step 14 behavior unchanged (this is NOT an enhancement of Step 14 image/table coverage — that axis belongs to IMP-15). +- Surface used : u9 `detect_invented_text(reentry_mdx, final_html)` against the just-rendered `final.html`. +- Behavior : the returned `list[str]` is purely *telemetry*. It does NOT change render outcome and does NOT change `compute_slide_status` (Step 20). The reverse-path may consult the list to decide whether to surface a warning at Step 22 — but auto pipeline does not gate on it (per `feedback_auto_pipeline_first` + AI-isolation contract). +- Trace : `debug.json["step14"]["reentry_invented_text_fragments"] = list[str]` (additive; already truncated by u9's `_INVENTED_TEXT_TRUNCATE_LEN = 80`). + +### Step 21 — Debug / trace recording (additive only) +- Surface used : none (Step 21 consumes the additive fields written by Step 1 / 2 / 14 above). +- Behavior : `write_debug_json` MUST treat the new fields as additive — no rename, no removal, no schema regression of existing keys. Missing fields (original-upload path) MUST be absent rather than null, so downstream consumers can distinguish "original upload" from "reverse-path re-entry". +- Trace contract : the three additive fields above + a single new flag `debug.json["pipeline"]["reverse_path_reentry"] = bool` (the only schema field that gates the existence of the other three). + +### Step 22 — User confirmation / export (surface, no AI) +- Surface used : none directly (Step 22 is UI scope, currently CLI-only — see PHASE-Z-PIPELINE-OVERVIEW Step 22). +- Behavior contract for whoever lands Step 22 UI : Step 22 MAY render the additive Step 2 / Step 14 fields read-only. No write-back. No AI call. No content rewrite. + +## Redesigned frame-contract pattern dict (reserved, NOT delivered in U2) + +- Phase Q `REQUIRED_PATTERNS` (Phase Q reference: `src/content_verifier.py:382`) is `body_bg / core / sidebar / footer` — these are Phase Q *area* names, not Phase Z entities. **Values are NOT reused.** +- Phase Z replacement will be keyed on (frame_id, frame_slot_id) per the canonical hierarchy `Slide → Zone → Internal Region → Frame → Frame Slot → Content` ([`PHASE-Z-PIPELINE-OVERVIEW.md`](PHASE-Z-PIPELINE-OVERVIEW.md) §Operating Principles), and will be sourced from `templates/phase_z2/catalog/frame_contracts.yaml` (Step 0 / Step 10). +- **Out of scope for IMP-16-U2.** This belongs to IMP-20 (H2 frame contract validation — same backlog doc line 71). U2 must not ship a pattern dict; U2 must not import or wrap Phase Q `verify_structure` / `verify_area` / `verify_all_areas`. + +## Guardrails (binding) + +- **AI isolation contract** — all wiring above is deterministic. No LLM / Kei / httpx / SSE call on any path. (per `feedback_ai_isolation_contract` + `PZ-1: AI=0 normal`.) +- **No-hardcoding** — U2 ports the algorithm. The only literal values reused are the Phase Q H3 thresholds already lifted to named constants in u7 / u8 / u9. No sample-specific value (MDX 03 / 04 / 05) enters U2. +- **No `src.content_verifier` import** — under any condition. The U1 module is the sole Phase Z surface. +- **No FORBIDDEN_KEI_MEMOS / `generate_with_retry` port** — these are H4 / H5 archive markers and remain out of scope. +- **Schema additive only** — debug.json keys listed above are new; no existing key is renamed, removed, or repurposed. (per `feedback_artifact_status_naming` — final.html is not the same axis as preservation / invented-text telemetry.) +- **Spacing direction** — N/A for this axis (this is verification, not layout). No common CSS / padding / tolerance shrinking is introduced. +- **Status semantics** — Step 20 `compute_slide_status` is NOT changed by U2. Preservation / invented-text fields are *telemetry*; they do not flip `PASS` → `RENDERED_WITH_VISUAL_REGRESSION` on their own. + +## Rollback + +- All changes are additive: the Step 1 input-error path, the Step 2 post-normalize hook, the Step 14 telemetry call, the four new `debug.json` keys. +- Rollback = revert the IMP-07 reverse-path entry's call sites; no schema migration needed because the four debug.json keys are gated on `pipeline.reverse_path_reentry`. + +## Open items deferred until IMP-07 lands + +- Exact module path of the IMP-07 reverse-path adapter (TBD by IMP-07). +- Whether Step 2's preservation cross-check needs a per-section variant or only a whole-MDX variant — depends on whether IMP-07 emits a single re-entry MDX or per-section MDX fragments. +- Whether Step 14's invented-text telemetry should be emitted per `area_name` or only once globally — depends on whether IMP-07's reverse-path produces area-tagged HTML. + +These are NOT resolved here. They are resolved at IMP-07 land time, in a follow-up update to this doc. diff --git a/src/phase_z2_verification_utils.py b/src/phase_z2_verification_utils.py new file mode 100644 index 0000000..e6ba237 --- /dev/null +++ b/src/phase_z2_verification_utils.py @@ -0,0 +1,335 @@ +"""Phase Z2 deterministic verification utilities (IMP-16-U1 port). + +Ports the H3 deterministic subset of src/content_verifier.py into a +Phase Z-owned module so the Phase Z pipeline never imports the Phase Q +reference-only module (which co-hosts H4/H5 Kei/AI assets). + +Scope: deterministic, pure, no I/O, no LLM call, no httpx/SSE. +Wiring into Step 1/2/14/21/22 is gated behind IMP-07 (see +docs/architecture/IMP-16-U2-WIRING-DESIGN.md when u11 lands). +""" +from __future__ import annotations + +import re +from dataclasses import dataclass, field +from difflib import SequenceMatcher +from html.parser import HTMLParser + + +@dataclass +class VerificationResult: + """Single-axis deterministic verification outcome. + + Mirrors the Phase Q VerificationResult shape so callers ported from + that surface keep their field access; the value semantics are + Phase Z-owned (no Phase Q area defaults baked in). + """ + + passed: bool + area_name: str + checks: dict[str, bool] = field(default_factory=dict) + score: float = 0.0 + errors: list[str] = field(default_factory=list) + warnings: list[str] = field(default_factory=list) + + +class _TextExtractor(HTMLParser): + """Extract visible text only. Skips " + "" + "

visible

" + ) + out = extract_text_from_html(html) + assert "visible" in out + joined = " ".join(out) + assert "color: red" not in joined + assert "keep_out" not in joined + + +def test_extract_drops_whitespace_only_chunks_and_strips_survivors(): + from src.phase_z2_verification_utils import extract_text_from_html + + html = "
\n\n
hello
world\t" + out = extract_text_from_html(html) + assert out == ["hello", "world"] + + +def test_extract_preserves_korean_and_inline_markup_text(): + from src.phase_z2_verification_utils import extract_text_from_html + + html = "

설계 방식의 왜곡

" + out = extract_text_from_html(html) + assert out == ["설계", "방식", "의 왜곡"] + + +def test_extract_empty_input_returns_empty_list(): + from src.phase_z2_verification_utils import extract_text_from_html + + assert extract_text_from_html("") == [] diff --git a/tests/phase_z2/test_pz2_vu_integration.py b/tests/phase_z2/test_pz2_vu_integration.py new file mode 100644 index 0000000..b8953fe --- /dev/null +++ b/tests/phase_z2/test_pz2_vu_integration.py @@ -0,0 +1,106 @@ +"""Tests for IMP-16-U1 unit u10: sample-backed smoke without pipeline import. + +End-to-end smoke of the deterministic chain (extract_text_from_html ∘ +normalize_for_comparison ∘ split_into_sentences ∘ _sentence_matches_html +→ verify_text_preservation / detect_invented_text) on a real +``samples/mdx_batch`` MDX file. Per Stage 2 rationale: smoke coverage +uses the sample but does NOT hardcode a sample-specific pass. + +Also locks the AI-isolation contract for the verification axis: this +test and the production module MUST NOT import orchestrator / +phase_z2_pipeline / Phase Q content_verifier / Kei client. +""" +from __future__ import annotations + +import ast +from pathlib import Path + +from src.phase_z2_verification_utils import ( + VerificationResult, + detect_invented_text, + verify_text_preservation, +) + +_REPO_ROOT = Path(__file__).resolve().parents[2] +_SAMPLE_MDX_PATH = _REPO_ROOT / "samples" / "mdx_batch" / "02.mdx" +_FORBIDDEN_IMPORT_ROOTS = ( + "orchestrator", + "src.phase_z2_pipeline", + "src.content_verifier", + "src.kei_client", +) + + +def _module_imports(path: Path) -> set[str]: + tree = ast.parse(path.read_text(encoding="utf-8")) + names: set[str] = set() + for node in ast.walk(tree): + if isinstance(node, ast.Import): + for alias in node.names: + names.add(alias.name) + elif isinstance(node, ast.ImportFrom) and node.module: + names.add(node.module) + return names + + +def test_integration_sample_mdx_exists(): + # Smoke fixture availability gate; explicit so a missing sample + # surfaces as a fixture problem, not a downstream assertion failure. + assert _SAMPLE_MDX_PATH.exists(), f"sample missing: {_SAMPLE_MDX_PATH}" + + +def test_integration_full_chain_runs_on_real_sample(): + # Locks API contract over the full chain on a real MDX: returns a + # VerificationResult, area_name passthrough works, score within + # [0.0, 1.0], and detect_invented_text returns a list. No assertion + # is made about a specific score so the sample is not hardcoded as + # the pipeline's pass rule (Stage 2 u10 rationale). + mdx = _SAMPLE_MDX_PATH.read_text(encoding="utf-8") + html = f"
{mdx}
" + result = verify_text_preservation(mdx, html, "smoke") + assert isinstance(result, VerificationResult) + assert result.area_name == "smoke" + assert 0.0 <= result.score <= 1.0 + assert isinstance(detect_invented_text(mdx, html), list) + + +def test_integration_mirrored_html_passes_default_threshold(): + # When the HTML side mirrors the MDX text verbatim, the deterministic + # preservation check must pass the Phase Q-default threshold (0.70). + # This is the integration-level guarantee for the B-2 reverse path: + # round-tripped HTML that preserves the MDX text must verify. + mdx = _SAMPLE_MDX_PATH.read_text(encoding="utf-8") + html = f"
{mdx}
" + result = verify_text_preservation(mdx, html, "smoke") + assert result.passed is True + + +def test_integration_fabricated_html_flags_invented_text(): + # Locks the hallucination-guard end-to-end: HTML text that has no + # keyword anchor in the source MDX must be flagged. Synthetic + # sentence chosen so its keywords (완전히, 만들어낸, 원본, 등장 …) + # do not appear in samples/mdx_batch/02.mdx. + mdx = _SAMPLE_MDX_PATH.read_text(encoding="utf-8") + fabricated_html = ( + "

완전히 새로 만들어낸 문장으로 원본에는 전혀 등장하지 않는 내용입니다.

" + ) + invented = detect_invented_text(mdx, fabricated_html) + assert isinstance(invented, list) + assert len(invented) >= 1 + + +def test_integration_no_forbidden_imports(): + # AI-isolation + Phase Z scope-lock guard. Production module and + # this test file must not import orchestrator / phase_z2_pipeline / + # Phase Q content_verifier / Kei client. AST scan of the on-disk + # source (not the imported module) so re-exports cannot mask a leak. + for path in ( + _REPO_ROOT / "src" / "phase_z2_verification_utils.py", + Path(__file__).resolve(), + ): + modules = _module_imports(path) + for module in modules: + for forbidden in _FORBIDDEN_IMPORT_ROOTS: + assert not (module == forbidden or module.startswith(forbidden + ".")), ( + f"{path.name} imports forbidden module: {module}" + ) diff --git a/tests/phase_z2/test_pz2_vu_invented.py b/tests/phase_z2/test_pz2_vu_invented.py new file mode 100644 index 0000000..5d6d623 --- /dev/null +++ b/tests/phase_z2/test_pz2_vu_invented.py @@ -0,0 +1,84 @@ +"""Tests for IMP-16-U1 unit u9: ``detect_invented_text``. + +Locks the Phase Z port of the deterministic hallucination guard +(Phase Q reference: ``src/content_verifier.py:276-315``). The function +is pure and composes u2 (extract_text_from_html), u3 +(normalize_for_comparison), and u4 (extract_keywords). No Phase Q +import is exercised. +""" +from __future__ import annotations + +from src.phase_z2_verification_utils import ( + _INVENTED_TEXT_ALLOWED_LABELS, + _INVENTED_TEXT_CSS_NUMBER_PATTERN, + _INVENTED_TEXT_KEYWORD_THRESHOLD, + _INVENTED_TEXT_MIN_LENGTH, + _INVENTED_TEXT_TRUNCATE_LEN, + detect_invented_text, +) + + +def test_detect_invented_text_constants_locked() -> None: + """Lock the five named module constants ported from Phase Q literals.""" + assert _INVENTED_TEXT_MIN_LENGTH == 15 + assert _INVENTED_TEXT_ALLOWED_LABELS == frozenset( + {"용어 정의", "핵심 메시지", "상세 비교"} + ) + assert _INVENTED_TEXT_CSS_NUMBER_PATTERN.pattern == r"^[\d\s.,%px#rgb()]+$" + assert _INVENTED_TEXT_KEYWORD_THRESHOLD == 0.4 + assert _INVENTED_TEXT_TRUNCATE_LEN == 80 + + +def test_detect_invented_text_returns_empty_when_html_is_in_mdx() -> None: + """Text whose keywords fully appear in MDX is NOT flagged.""" + mdx = "원본 콘텐츠는 분석에 관한 것입니다." + html = "

원본 콘텐츠는 분석에 관한 것입니다.

" + assert detect_invented_text(mdx, html) == [] + + +def test_detect_invented_text_flags_text_with_low_keyword_overlap() -> None: + """Text whose keywords do not appear in MDX is flagged as invented.""" + mdx = "원본 콘텐츠는 분석에 관한 것입니다." + html = "

완전히 다른 발명된 텍스트가 여기 있습니다 일반적이지 않은

" + result = detect_invented_text(mdx, html) + assert len(result) == 1 + assert "발명된" in result[0] + + +def test_detect_invented_text_skips_short_text() -> None: + """Text shorter than ``min_length`` is not even considered.""" + mdx = "원본 콘텐츠" + html = "

짧은 텍스트

" + assert detect_invented_text(mdx, html) == [] + + +def test_detect_invented_text_skips_allowed_structural_labels() -> None: + """Allowed labels are skipped even when keyword overlap is zero. + + Phase Q default ``min_length=15`` makes the allowed-label gate + unreachable for the bundled labels (all < 15 chars). The Phase Z + port preserves the gate verbatim — exercised here with + ``min_length=0`` so the structural-label short-circuit is + actually observable. + """ + mdx = "원본 콘텐츠" + html = "

용어 정의

핵심 메시지

상세 비교

" + assert detect_invented_text(mdx, html, min_length=0) == [] + + +def test_detect_invented_text_skips_css_number_pattern_fragments() -> None: + """CSS/numeric fragments (e.g. ``100px 200px 300px``) are skipped.""" + mdx = "원본 콘텐츠" + html = "
100px 200px 300px
" + assert detect_invented_text(mdx, html) == [] + + +def test_detect_invented_text_truncates_flagged_value_to_80_chars() -> None: + """A flagged fragment longer than 80 chars is truncated for reporting.""" + mdx = "원본 콘텐츠" + invented = "발명" * 50 + html = f"

{invented}

" + result = detect_invented_text(mdx, html) + assert len(result) == 1 + assert len(result[0]) == 80 + assert result[0] == invented[:80] diff --git a/tests/phase_z2/test_pz2_vu_keywords.py b/tests/phase_z2/test_pz2_vu_keywords.py new file mode 100644 index 0000000..f14bebe --- /dev/null +++ b/tests/phase_z2/test_pz2_vu_keywords.py @@ -0,0 +1,52 @@ +"""Tests for Phase Z2 IMP-16-U1 unit u4: extract_keywords. + +Locks the deterministic surface: 3+ character tokens on the Phase Z H3 +character class, longest-match trailing particle strip with a length>=2 +stem guard, and no Phase Q content_verifier import. +""" +from __future__ import annotations + +from src.phase_z2_verification_utils import _PARTICLES, extract_keywords + + +def test_extract_keywords_drops_short_tokens() -> None: + # "AI" (2 chars) and "X" (1 char) are dropped; "기술" (2 chars) is dropped too. + # "데이터" (3 chars) survives; "분석함" (3 chars) survives. + assert extract_keywords("AI 기술 X 데이터 분석함") == ["데이터", "분석함"] + + +def test_extract_keywords_strips_trailing_particle_when_stem_ge_2() -> None: + # "설계의" (3 chars) → particle "의" stripped, stem "설계" (2 chars) kept. + # "방식은" → particle "은" stripped → "방식". + assert extract_keywords("설계의 방식은") == ["설계", "방식"] + + +def test_extract_keywords_keeps_token_when_stem_would_be_too_short() -> None: + # "에서" guard: a 3-char token whose 2-char suffix is a particle + # but whose stem (1 char) is < 2 must keep the original token. + # "안에서" → suffix "에서" len 2, stem "안" len 1 → guard fires, + # falls through, then next particle "서" is NOT in _PARTICLES, + # so the whole token "안에서" remains. + assert extract_keywords("안에서") == ["안에서"] + + +def test_extract_keywords_longest_match_particle_wins() -> None: + # "_PARTICLES" is sorted longest-first, so "에서" wins over "서"/"에". + # "현장에서" → "에서" stripped → "현장". + assert "에서" in _PARTICLES + assert extract_keywords("현장에서") == ["현장"] + + +def test_extract_keywords_tokenises_korean_alnum_and_parens() -> None: + # The Phase Z H3 character class is [가-힣a-zA-Z0-9()]+. + # "프로젝트(2024)" is one token; "Hello!" splits into "Hello" only. + # Punctuation outside the class acts as a delimiter. + result = extract_keywords("프로젝트(2024) Hello! World123") + assert "프로젝트(2024)" in result + assert "Hello" in result + assert "World123" in result + assert "!" not in "".join(result) + + +def test_extract_keywords_empty_returns_empty() -> None: + assert extract_keywords("") == [] diff --git a/tests/phase_z2/test_pz2_vu_match_helper.py b/tests/phase_z2/test_pz2_vu_match_helper.py new file mode 100644 index 0000000..17442d9 --- /dev/null +++ b/tests/phase_z2/test_pz2_vu_match_helper.py @@ -0,0 +1,66 @@ +"""Tests for IMP-16-U1 unit u7: ``_sentence_matches_html``. + +Locks the Phase Z port of the deterministic per-sentence match +helper (Phase Q reference: inline body of ``verify_text_preservation`` +at src/content_verifier.py:232-251). The helper is pure; no Phase Q +import is exercised. Thresholds are locked as named constants so the +0.6 / 0.65 surface cannot drift silently. +""" +from __future__ import annotations + +from src.phase_z2_verification_utils import ( + _SENTENCE_KEYWORD_MATCH_THRESHOLD, + _SENTENCE_SEQUENCE_MATCH_THRESHOLD, + _sentence_matches_html, +) + + +def test_match_helper_thresholds_locked(): + assert _SENTENCE_KEYWORD_MATCH_THRESHOLD == 0.6 + assert _SENTENCE_SEQUENCE_MATCH_THRESHOLD == 0.65 + + +def test_match_helper_returns_true_when_no_keywords(): + # "AI" tokenises to a single 2-char token which extract_keywords drops + # (len < 3 gate). Empty keyword list -> helper returns True regardless + # of HTML side. Phase Q parity: matched += 1; continue on empty keywords. + assert _sentence_matches_html("AI", "", []) is True + + +def test_match_helper_keyword_ratio_meets_threshold(): + # Sentence "데이터 분석의 핵심" -> keywords = ["데이터", "분석"]: + # "데이터" (len 3, no particle ending) kept; + # "분석의" (len 3, ends with "의", stem "분석" len 2) -> "분석" kept; + # "핵심" (len 2 < 3) dropped. + # Both keywords are substrings of the html_combined string, so + # kw_ratio = 2 / 2 = 1.0 >= 0.6 -> True via keyword axis. + assert _sentence_matches_html( + "데이터 분석의 핵심", + "데이터 분석을 수행합니다", + ["데이터 분석을 수행합니다"], + ) is True + + +def test_match_helper_sequence_ratio_fallback(): + # Sentence "데이터 분석" -> keywords = ["데이터"] (the 2-char "분석" + # is dropped by the len<3 gate). "데이터" is NOT in html_combined, + # so kw_ratio = 0. The SequenceMatcher fallback compares the + # normalized sentence against each normalized html_text; the second + # fragment matches verbatim, yielding ratio 1.0 >= 0.65 -> True. + assert _sentence_matches_html( + "데이터 분석", + "abc xyz", + ["abc xyz", "데이터 분석"], + ) is True + + +def test_match_helper_below_both_thresholds_returns_false(): + # No keyword overlap and no high-similarity html fragment: + # kw_ratio = 0, best SequenceMatcher ratio is far below 0.65. + # Helper must return False so verify_text_preservation (u8) + # records the sentence as missing. + assert _sentence_matches_html( + "데이터 분석", + "abc xyz", + ["abc xyz"], + ) is False diff --git a/tests/phase_z2/test_pz2_vu_meta_strip.py b/tests/phase_z2/test_pz2_vu_meta_strip.py new file mode 100644 index 0000000..9ce1308 --- /dev/null +++ b/tests/phase_z2/test_pz2_vu_meta_strip.py @@ -0,0 +1,73 @@ +"""u5 — meta-line stripping surface (IMP-16-U1). + +Locks the deterministic meta-line filter contract: + - lines whose stripped form starts with any ``_META_PREFIXES`` entry + are dropped (8 prefix surface); + - lines containing any ``_META_INLINE_FRAGMENTS`` entry are dropped + (3 inline fragment surface); + - other lines pass through with original whitespace preserved; + - empty input returns the empty string; + - no import of src.content_verifier. +""" +from __future__ import annotations + + +def test_strip_meta_lines_drops_prefix_lines(): + from src.phase_z2_verification_utils import _META_PREFIXES, strip_meta_lines + + # Exactly the 8-prefix Phase Z surface — locks both content and size. + assert _META_PREFIXES == [ + "제목 라벨:", + "표현 의도:", + "슬라이드 주인공", + "가장 큰 시각적 비중", + "시각적으로", + "간결하게 제기", + "개별 증거로 제시", + "계층적으로 시각화", + ] + text = "제목 라벨: 어떤 제목\n본문 한 줄\n표현 의도: 강조" + assert strip_meta_lines(text) == "본문 한 줄" + + +def test_strip_meta_lines_matches_prefix_on_stripped_line(): + from src.phase_z2_verification_utils import strip_meta_lines + + # Leading whitespace must not protect a meta-prefix line. + text = " 제목 라벨: indented meta\n실제 본문" + assert strip_meta_lines(text) == "실제 본문" + + +def test_strip_meta_lines_drops_inline_fragment_lines(): + from src.phase_z2_verification_utils import ( + _META_INLINE_FRAGMENTS, + strip_meta_lines, + ) + + # Phase Z inline-fragment surface is exactly these three. + assert _META_INLINE_FRAGMENTS == ( + "현상-문제 인과관계", + "상위-하위 포함 관계", + "독립적 나열", + ) + text = ( + "구조: 현상-문제 인과관계 로 설계\n" + "유형: 상위-하위 포함 관계\n" + "패턴: 독립적 나열 형태\n" + "그래서 결론은 한 줄" + ) + assert strip_meta_lines(text) == "그래서 결론은 한 줄" + + +def test_strip_meta_lines_keeps_unrelated_lines_verbatim(): + from src.phase_z2_verification_utils import strip_meta_lines + + # Non-meta lines must pass through with original whitespace preserved. + text = " 본문 한 줄\n\n다른 줄" + assert strip_meta_lines(text) == " 본문 한 줄\n\n다른 줄" + + +def test_strip_meta_lines_empty_input_returns_empty_string(): + from src.phase_z2_verification_utils import strip_meta_lines + + assert strip_meta_lines("") == "" diff --git a/tests/phase_z2/test_pz2_vu_normalize.py b/tests/phase_z2/test_pz2_vu_normalize.py new file mode 100644 index 0000000..f366744 --- /dev/null +++ b/tests/phase_z2/test_pz2_vu_normalize.py @@ -0,0 +1,64 @@ +"""u3 — Korean text normalization surface (IMP-16-U1). + +Locks the deterministic text-normalization contract: + - whitespace runs collapse + strip; + - bullet markers from the Phase Q surface set are removed; + - the small HTML-entity set used by the reverse path is decoded; + - a single trailing 개조식 ending is folded to its 서술형 form; + - particle list is sorted longest-first (matching the Phase Q surface + so downstream keyword stripping is greedy); + - no import of src.content_verifier. +""" +from __future__ import annotations + + +def test_normalize_collapses_whitespace_and_strips(): + from src.phase_z2_verification_utils import normalize_for_comparison + + assert normalize_for_comparison(" hello\n\n world\t") == "hello world" + + +def test_normalize_removes_bullet_markers(): + from src.phase_z2_verification_utils import normalize_for_comparison + + # Each marker from the Phase Q surface set must be stripped. + for marker in ["•", "◦", "·", "-", "▪", "▸", "►"]: + assert normalize_for_comparison(f"{marker} 항목") == "항목" + + +def test_normalize_decodes_html_entities(): + from src.phase_z2_verification_utils import normalize_for_comparison + + text = "A & B <tag>   'q' "d"" + assert normalize_for_comparison(text) == "A & B 'q' \"d\"" + + +def test_normalize_folds_trailing_gaejo_endings(): + from src.phase_z2_verification_utils import normalize_for_comparison + + assert normalize_for_comparison("적용함") == "적용한다" + assert normalize_for_comparison("필요됨") == "필요된다" + assert normalize_for_comparison("값이 있음") == "값이 있다" + assert normalize_for_comparison("자료 없음") == "자료 없다" + assert normalize_for_comparison("결과임") == "결과이다" + assert normalize_for_comparison("적용되었음") == "적용되었다" + assert normalize_for_comparison("적용되었음.") == "적용되었음." # trailing punct blocks fold + + +def test_normalize_only_folds_one_ending_and_only_at_end(): + from src.phase_z2_verification_utils import normalize_for_comparison + + # 'break' after first match: only the suffix is folded, mid-string '함' is left alone. + assert normalize_for_comparison("함수를 적용함") == "함수를 적용한다" + # No fold when the ending is not the last token. + assert normalize_for_comparison("적용함 그리고 종료") == "적용함 그리고 종료" + + +def test_particles_sorted_longest_first(): + from src.phase_z2_verification_utils import _PARTICLES + + lengths = [len(p) for p in _PARTICLES] + assert lengths == sorted(lengths, reverse=True) + # Phase Q surface size guard (no values reused from REQUIRED_PATTERNS; + # this is the Korean-locale particle inventory). + assert "에서" in _PARTICLES and "는" in _PARTICLES diff --git a/tests/phase_z2/test_pz2_vu_preservation.py b/tests/phase_z2/test_pz2_vu_preservation.py new file mode 100644 index 0000000..c6883c4 --- /dev/null +++ b/tests/phase_z2/test_pz2_vu_preservation.py @@ -0,0 +1,119 @@ +"""Tests for IMP-16-U1 unit u8: ``verify_text_preservation``. + +Locks the Phase Z port of the deterministic text-preservation check +(Phase Q reference: ``src/content_verifier.py:206-273``). The function +is pure and composes u2 (extract_text_from_html), u3 +(normalize_for_comparison), u6 (split_into_sentences), and u7 +(_sentence_matches_html). No Phase Q import is exercised. +""" +from __future__ import annotations + +from src.phase_z2_verification_utils import ( + VerificationResult, + _MISSING_SENTENCE_REPORT_LIMIT, + _MISSING_SENTENCE_TRUNCATE_LEN, + _TEXT_PRESERVATION_DEFAULT_THRESHOLD, + verify_text_preservation, +) + + +def test_verify_text_preservation_defaults_locked(): + # Locks the Phase Q caller convention: threshold default = 0.70, + # missing-list report cap = 5, per-item truncate length = 60. + assert _TEXT_PRESERVATION_DEFAULT_THRESHOLD == 0.70 + assert _MISSING_SENTENCE_REPORT_LIMIT == 5 + assert _MISSING_SENTENCE_TRUNCATE_LEN == 60 + + +def test_verify_text_preservation_empty_sentences_returns_passed(): + # MDX that reduces to zero sentences after split_into_sentences + # (e.g. headers only) must return passed=True with score 1.0 and + # an empty errors/warnings surface. Phase Q parity: early return + # before any HTML extraction. + result = verify_text_preservation("# header only", "

anything

", "core") + assert isinstance(result, VerificationResult) + assert result.passed is True + assert result.area_name == "core" + assert result.checks == {"text_preservation": True} + assert result.score == 1.0 + assert result.errors == [] + assert result.warnings == [] + + +def test_verify_text_preservation_full_match_passes(): + # All MDX sentences preserved in HTML -> score 1.0, passed True, + # no warnings (warnings only attached when score < 1.0), no errors. + mdx = "데이터 분석은 핵심 과정입니다. 시각화로 의사 결정을 지원합니다." + html = ( + "

데이터 분석은 핵심 과정입니다.

" + "

시각화로 의사 결정을 지원합니다.

" + ) + result = verify_text_preservation(mdx, html, "body") + assert result.passed is True + assert result.score == 1.0 + assert result.warnings == [] + assert result.errors == [] + + +def test_verify_text_preservation_below_threshold_reports_errors(): + # Only one of two MDX sentences appears in the HTML -> score 0.5, + # below default threshold 0.70 -> passed False, errors list opens + # with the "누락 문장 (1/2):" header followed by quoted missing + # sentences (truncation gate not crossed). + mdx = ( + "데이터 분석은 핵심 과정입니다.\n" + "전혀 다른 문맥의 두 번째 문장입니다." + ) + html = "

데이터 분석은 핵심 과정입니다.

" + result = verify_text_preservation(mdx, html, "core") + assert result.passed is False + assert result.score == 0.5 + assert result.checks == {"text_preservation": False} + assert result.errors[0] == "누락 문장 (1/2):" + assert any("두 번째 문장" in line for line in result.errors[1:]) + assert result.warnings == ["보존율: 50% (1/2 문장)"] + + +def test_verify_text_preservation_truncates_long_missing_sentence(): + # A missing sentence longer than 60 chars must be rendered with + # the "...\"" tail. Phase Z surface lifts the 60 constant to a + # named module value (_MISSING_SENTENCE_TRUNCATE_LEN) so the gate + # is auditable. + long_sentence = "엄청나게 긴 문장이 들어가서 절단 동작을 검증합니다." + ("끝" * 60) + mdx = long_sentence + "." + html = "

관련 없는 문구

" + result = verify_text_preservation(mdx, html, "footer", threshold=0.99) + assert result.passed is False + # Header + at least one missing-line entry; the entry must end with `..."`. + assert len(result.errors) >= 2 + assert result.errors[-1].endswith("...\"") + truncated_body = result.errors[-1].split('"', 2)[1].rstrip(".") + assert len(truncated_body) == _MISSING_SENTENCE_TRUNCATE_LEN + + +def test_verify_text_preservation_caps_missing_report_at_limit(): + # Generate seven MDX-only sentences with no HTML coverage. + # passed=False, errors list = 1 header + at most 5 missing entries + # (_MISSING_SENTENCE_REPORT_LIMIT). The header reports the true + # missing/total counts even though only 5 are surfaced. + mdx_lines = [f"전혀 다른 문맥의 문장 번호 {i} 입니다." for i in range(7)] + mdx = "\n".join(mdx_lines) + html = "

관련 없는 문구

" + result = verify_text_preservation(mdx, html, "core") + assert result.passed is False + assert result.errors[0] == "누락 문장 (7/7):" + assert len(result.errors) == 1 + _MISSING_SENTENCE_REPORT_LIMIT + + +def test_verify_text_preservation_custom_threshold_passes_at_50_percent(): + # Lowering the threshold to 0.50 makes a 50% preservation pass. + mdx = ( + "데이터 분석은 핵심 과정입니다.\n" + "전혀 다른 문맥의 두 번째 문장입니다." + ) + html = "

데이터 분석은 핵심 과정입니다.

" + result = verify_text_preservation(mdx, html, "core", threshold=0.50) + assert result.passed is True + assert result.score == 0.5 + # Score < 1.0 so the 보존율 warning is still attached for trace surface. + assert result.warnings == ["보존율: 50% (1/2 문장)"] diff --git a/tests/phase_z2/test_pz2_vu_sentence_split.py b/tests/phase_z2/test_pz2_vu_sentence_split.py new file mode 100644 index 0000000..1ce84d0 --- /dev/null +++ b/tests/phase_z2/test_pz2_vu_sentence_split.py @@ -0,0 +1,69 @@ +"""Tests for IMP-16-U1 unit u6: split_into_sentences. + +Locks the Phase Z port of the H3 deterministic sentence-splitter +surface (Phase Q reference: src/content_verifier.py:174-199). The +function is deterministic, pure, and composes ``strip_meta_lines``; +no Phase Q import is exercised. +""" +from __future__ import annotations + +from src.phase_z2_verification_utils import ( + _BULLET_MARKER_PATTERN, + _MIN_SENTENCE_LEN, + _SENTENCE_SPLIT_PATTERN, + split_into_sentences, +) + + +def test_split_into_sentences_applies_strip_meta_lines_first(): + text = ( + "제목 라벨: 설계 방식의 왜곡\n" + "본문 첫 문장입니다.\n" + "본문 둘째 문장입니다." + ) + result = split_into_sentences(text) + assert result == ["본문 첫 문장입니다.", "본문 둘째 문장입니다."] + + +def test_split_into_sentences_skips_empty_and_header_lines(): + text = "\n# 대목차\n## 소목차\n실제 본문 문장입니다.\n" + assert split_into_sentences(text) == ["실제 본문 문장입니다."] + + +def test_split_into_sentences_strips_numeric_and_punctuated_markers(): + assert _BULLET_MARKER_PATTERN.match("1. 첫 단계입니다.") + assert _BULLET_MARKER_PATTERN.match("2) 둘째 단계입니다.") + assert _BULLET_MARKER_PATTERN.match("-. 첫 항목입니다.") + assert _BULLET_MARKER_PATTERN.match("•. 둘째 항목입니다.") + text = ( + "1. 첫 단계입니다.\n" + "2) 둘째 단계입니다.\n" + "-. 셋째 항목입니다." + ) + assert split_into_sentences(text) == [ + "첫 단계입니다.", + "둘째 단계입니다.", + "셋째 항목입니다.", + ] + + +def test_split_into_sentences_keeps_bare_dash_bullet_unstripped(): + assert _BULLET_MARKER_PATTERN.match("- 항목 하나입니다.") is None + text = "- 항목 하나입니다." + assert split_into_sentences(text) == ["- 항목 하나입니다."] + + +def test_split_into_sentences_splits_on_period_boundary(): + assert _SENTENCE_SPLIT_PATTERN.pattern == r"(?<=\.)\s+" + text = "첫 문장입니다. 둘째 문장입니다. 셋째 문장입니다." + assert split_into_sentences(text) == [ + "첫 문장입니다.", + "둘째 문장입니다.", + "셋째 문장입니다.", + ] + + +def test_split_into_sentences_drops_parts_shorter_than_min_len(): + assert _MIN_SENTENCE_LEN == 5 + text = "OK. 충분히 긴 문장입니다." + assert split_into_sentences(text) == ["충분히 긴 문장입니다."]