Technical

Testing methodology.

Loupe makes one promise that everything else hangs on: every claim a brief makes resolves to a labelled citation. The testing strategy below exists to defend that promise — across every layer, under every adversarial input we could think of, with the actual on-device model that ships in production.

Numbers below are reproducible on any Apple Silicon Mac with Apple Intelligence enabled. The full test corpus is checked into Tests/LoupeTests/.

Reliability metrics

Metric	Value	How to reproduce
Total automated tests	500 passing, 1 skipped	`xcodebuild test -scheme Loupe`. The skipped test is the real-model E2E suite, which only runs when Apple Intelligence is available on the test machine.
Logic-surface code coverage	87.58%	`xcodebuild test -scheme Loupe -enableCodeCoverage YES` followed by `xcrun xccov view --report`. Measured across `Models/`, `Parsers/`, `Services/`, `Utilities/`. SwiftUI views, the app entry point, and framework-integration paths (security-scoped bookmarks + LanguageModelSession glue) are excluded by design — they need NSWindow / Apple Intelligence runtime to test properly.
Full unit-test suite runtime	~38 seconds	On a 2024 Apple Silicon Mac. Excludes the 6 real-model scenarios, which add ~3 minutes when AI is available.
Real-model end-to-end scenarios	6 scenarios, all passing	Hand-crafted golden log fixtures at `Tests/LoupeTests/Fixtures/Scenarios/`. Scenarios listed below.
Hammer test on the narrator	100/100 SHIP, 0 correctness failures	Sequential narrator runs against the canonical DB-outage scenario, asserting consensus + citation correctness. Run on macOS 26.5.
Determinism-asserted layers	8 layers	RCA Markdown, manifest, hashes, CSVs, IODEF XML, HTML, redactor, audit log — each has a byte-identity test that locks output stability.

The six end-to-end scenarios

Each scenario runs the full pipeline (parser → findings → clusters → rule engine → real LanguageModelSession.respond → archetype-classified brief) on a hand-crafted golden log fixture. No mocks, no recorded outputs — the actual on-device model produces the brief, and the test asserts archetype + confidence band + hypothesis keywords.

Each scenario's pipeline diagram below shows the exact data flow: input fixture(s) → parser → detection layers (anomaly, cluster, rule engine) → CaseFile assembly → narrator with consensus voting → assertions. Diagrams are rendered to static SVG at build time from Mermaid source at useloupe-web/scripts/diagrams/scenarios/ — they update with the test code.

1
DB OOM outage
single/easy
8 events from one source — postgres invoked OOM-killer, app errors with ECONNREFUSED until DB recovers. Tests narrator on the cleanest possible signal.
Expected: Single-pass brief; archetype is capacity_or_resource_failure or infrastructure_failure; confidence ≥ 75.
▸ Pipeline + assertions diagram
2
Upstream cascade
single/medium
25 events across nginx access + app syslog. Slow DB queries → app timeouts → nginx 502 burst → recovery. Tests cross-source correlation.
Expected: Either a single-pass brief naming the cascade OR fall-through to per-cluster analysis (both doctrine-aligned). Per-cluster path verified to produce a confident brief on at least one cluster.
▸ Pipeline + assertions diagram
3
Credential stuffing in noise
single/hard
~65 events across auth.log + nginx access + app syslog. A 401-burst from one external IP buried in routine traffic, ending in a single 200 success.
Expected: Single-pass brief; archetype is external_attack or unauthorized_access; hypothesis cites the source IP and the credential-stuffing pattern.
▸ Pipeline + assertions diagram
4
Real signal mixed with unrelated noise
anomaly/easy
30 events: a real DB outage signal plus an unrelated developer laptop's Spotlight indexing, AirPods connections, screen saver activity.
Expected: Brief stays focused on the DB outage. Asserts the model does NOT mention laptop-noise keywords (airpods, spotlight, screen saver, etc.).
▸ Pipeline + assertions diagram
5
Concatenated formats in one file
anomaly/medium
20 events: nginx CLF and BSD syslog interleaved line-by-line in a single file. Tests per-line format detection.
Expected: Brief is well-formed regardless of which format dominates. Asserts archetype resolves to the closed-vocab enum and supporting evidence is non-empty.
▸ Pipeline + assertions diagram
6
Three interleaved real incidents (chaos)
anomaly/hard
~50 events: DB OOM outage + SSH brute force + HTTP credential stuffing, all in one file with mixed formats and non-linear timestamps.
Expected: Multi-cluster failure surface fires (the rare doctrine-aligned path). Analyze All over the resulting clusters produces 3 distinct threads with different archetypes.
▸ Pipeline + assertions diagram

Adversarial input suites

Every surface that takes user-supplied bytes has a test file dedicated to feeding it deliberately bad input.

AdversarialCitationTests.swift — citation index hallucination, malformed refs, file-not-loaded fallback
AdversarialRuleEngineTests.swift — ReDoS rejection, empty predicates, zero evidenceMinimum
AdversarialSurfaceTests.swift — parser fuzzing on garbage input, schema-evolution, audit-chain tamper detection
BinarySourceCitationTests.swift — pcap binary-source fallback to parsed message
CitationHallucinationTests.swift — out-of-bounds event indexes
ParserRobustnessTests.swift — empty files, gigantic single lines, mixed line endings
SyslogParserRobustnessTests.swift — truncated lines, embedded nulls, BOM, year rollover

Why we test against the real model

Mocked narrator outputs prove pipeline plumbing works. They do not prove the model classifies correctly. The narrator's job — picking the right archetype out of eleven, citing the right events — is exactly what we cannot fake without losing the audit trail.

The 6 scenario tests run the actual on-device LanguageModelSession. They take ~3 minutes total (each runs consensus voting, so 3–5 inference passes). They will refuse to run on a machine without Apple Intelligence rather than substitute a mock — the test reports SKIP, not PASS, when AI is unavailable.

What the 12.42% gap is, and isn't

The 87.58% logic-surface coverage figure has documented exemptions:

Security-scoped bookmark code paths in IngestService — require an NSWindow + user-driven file open dialog. Not testable from XCTest without a running app.
LanguageModelSession.respond glue in NarratorService — the live model invocation. Covered indirectly by the 6 real-model E2E scenarios when AI is available.
Error-display branches in views — exercised manually via the running app.

These exemptions are deliberate. Adding mock seams over NSWindow or LanguageModelSession would defeat the point of testing against the real platform. The 6 scenario fixtures are the cost-benefit balance.

Reproducibility

Test files: Tests/LoupeTests/. Scenario fixtures: Tests/LoupeTests/Fixtures/Scenarios/. Run command: xcodebuild test -scheme Loupe. Coverage report: xcrun xccov view --report <xcresult-path>.

The full Testing-Methodology.md spec is bundled inside the shipped Loupe.app at docs/Testing-Methodology.md for offline review.

Reliability metrics

The six end-to-end scenarios

DB OOM outage

Upstream cascade

Credential stuffing in noise

Real signal mixed with unrelated noise

Concatenated formats in one file

Three interleaved real incidents (chaos)

Adversarial input suites

Why we test against the real model

What the 12.42% gap is, and isn't

Reproducibility