Loupe
Documentation

Technical

Testing methodology.

Loupe makes one promise that everything else hangs on: every claim a brief makes resolves to a labelled citation. The testing strategy below exists to defend that promise — across every layer, under every adversarial input we could think of, with the actual on-device model that ships in production.

Numbers below are reproducible on any Apple Silicon Mac with Apple Intelligence enabled. The full test corpus is checked into Tests/LoupeTests/.

Reliability metrics

MetricValueHow to reproduce
Total automated tests500 passing, 1 skipped`xcodebuild test -scheme Loupe`. The skipped test is the real-model E2E suite, which only runs when Apple Intelligence is available on the test machine.
Logic-surface code coverage87.58%`xcodebuild test -scheme Loupe -enableCodeCoverage YES` followed by `xcrun xccov view --report`. Measured across `Models/`, `Parsers/`, `Services/`, `Utilities/`. SwiftUI views, the app entry point, and framework-integration paths (security-scoped bookmarks + LanguageModelSession glue) are excluded by design — they need NSWindow / Apple Intelligence runtime to test properly.
Full unit-test suite runtime~38 secondsOn a 2024 Apple Silicon Mac. Excludes the 6 real-model scenarios, which add ~3 minutes when AI is available.
Real-model end-to-end scenarios6 scenarios, all passingHand-crafted golden log fixtures at `Tests/LoupeTests/Fixtures/Scenarios/`. Scenarios listed below.
Hammer test on the narrator100/100 SHIP, 0 correctness failuresSequential narrator runs against the canonical DB-outage scenario, asserting consensus + citation correctness. Run on macOS 26.5.
Determinism-asserted layers8 layersRCA Markdown, manifest, hashes, CSVs, IODEF XML, HTML, redactor, audit log — each has a byte-identity test that locks output stability.

The six end-to-end scenarios

Each scenario runs the full pipeline (parser → findings → clusters → rule engine → real LanguageModelSession.respond → archetype-classified brief) on a hand-crafted golden log fixture. No mocks, no recorded outputs — the actual on-device model produces the brief, and the test asserts archetype + confidence band + hypothesis keywords.

Each scenario's pipeline diagram below shows the exact data flow: input fixture(s) → parser → detection layers (anomaly, cluster, rule engine) → CaseFile assembly → narrator with consensus voting → assertions. Diagrams are rendered to static SVG at build time from Mermaid source at useloupe-web/scripts/diagrams/scenarios/ — they update with the test code.

  1. 1

    DB OOM outage

    single/easy

    8 events from one source — postgres invoked OOM-killer, app errors with ECONNREFUSED until DB recovers. Tests narrator on the cleanest possible signal.

    Expected: Single-pass brief; archetype is capacity_or_resource_failure or infrastructure_failure; confidence ≥ 75.

    Pipeline + assertions diagram
    Pipeline diagram for DB OOM outage scenario
  2. 2

    Upstream cascade

    single/medium

    25 events across nginx access + app syslog. Slow DB queries → app timeouts → nginx 502 burst → recovery. Tests cross-source correlation.

    Expected: Either a single-pass brief naming the cascade OR fall-through to per-cluster analysis (both doctrine-aligned). Per-cluster path verified to produce a confident brief on at least one cluster.

    Pipeline + assertions diagram
    Pipeline diagram for Upstream cascade scenario
  3. 3

    Credential stuffing in noise

    single/hard

    ~65 events across auth.log + nginx access + app syslog. A 401-burst from one external IP buried in routine traffic, ending in a single 200 success.

    Expected: Single-pass brief; archetype is external_attack or unauthorized_access; hypothesis cites the source IP and the credential-stuffing pattern.

    Pipeline + assertions diagram
    Pipeline diagram for Credential stuffing in noise scenario
  4. 4

    Real signal mixed with unrelated noise

    anomaly/easy

    30 events: a real DB outage signal plus an unrelated developer laptop's Spotlight indexing, AirPods connections, screen saver activity.

    Expected: Brief stays focused on the DB outage. Asserts the model does NOT mention laptop-noise keywords (airpods, spotlight, screen saver, etc.).

    Pipeline + assertions diagram
    Pipeline diagram for Real signal mixed with unrelated noise scenario
  5. 5

    Concatenated formats in one file

    anomaly/medium

    20 events: nginx CLF and BSD syslog interleaved line-by-line in a single file. Tests per-line format detection.

    Expected: Brief is well-formed regardless of which format dominates. Asserts archetype resolves to the closed-vocab enum and supporting evidence is non-empty.

    Pipeline + assertions diagram
    Pipeline diagram for Concatenated formats in one file scenario
  6. 6

    Three interleaved real incidents (chaos)

    anomaly/hard

    ~50 events: DB OOM outage + SSH brute force + HTTP credential stuffing, all in one file with mixed formats and non-linear timestamps.

    Expected: Multi-cluster failure surface fires (the rare doctrine-aligned path). Analyze All over the resulting clusters produces 3 distinct threads with different archetypes.

    Pipeline + assertions diagram
    Pipeline diagram for Three interleaved real incidents (chaos) scenario

Adversarial input suites

Every surface that takes user-supplied bytes has a test file dedicated to feeding it deliberately bad input.

  • AdversarialCitationTests.swift — citation index hallucination, malformed refs, file-not-loaded fallback
  • AdversarialRuleEngineTests.swift — ReDoS rejection, empty predicates, zero evidenceMinimum
  • AdversarialSurfaceTests.swift — parser fuzzing on garbage input, schema-evolution, audit-chain tamper detection
  • BinarySourceCitationTests.swift — pcap binary-source fallback to parsed message
  • CitationHallucinationTests.swift — out-of-bounds event indexes
  • ParserRobustnessTests.swift — empty files, gigantic single lines, mixed line endings
  • SyslogParserRobustnessTests.swift — truncated lines, embedded nulls, BOM, year rollover

Why we test against the real model

Mocked narrator outputs prove pipeline plumbing works. They do not prove the model classifies correctly. The narrator's job — picking the right archetype out of eleven, citing the right events — is exactly what we cannot fake without losing the audit trail.

The 6 scenario tests run the actual on-device LanguageModelSession. They take ~3 minutes total (each runs consensus voting, so 3–5 inference passes). They will refuse to run on a machine without Apple Intelligence rather than substitute a mock — the test reports SKIP, not PASS, when AI is unavailable.

What the 12.42% gap is, and isn't

The 87.58% logic-surface coverage figure has documented exemptions:

  • Security-scoped bookmark code paths in IngestService — require an NSWindow + user-driven file open dialog. Not testable from XCTest without a running app.
  • LanguageModelSession.respond glue in NarratorService — the live model invocation. Covered indirectly by the 6 real-model E2E scenarios when AI is available.
  • Error-display branches in views — exercised manually via the running app.

These exemptions are deliberate. Adding mock seams over NSWindow or LanguageModelSession would defeat the point of testing against the real platform. The 6 scenario fixtures are the cost-benefit balance.

Reproducibility

Test files: Tests/LoupeTests/. Scenario fixtures: Tests/LoupeTests/Fixtures/Scenarios/. Run command: xcodebuild test -scheme Loupe. Coverage report: xcrun xccov view --report <xcresult-path>.

The full Testing-Methodology.md spec is bundled inside the shipped Loupe.app at docs/Testing-Methodology.md for offline review.