Skip to content

Sign in Survey a repo — free

The Codebase Assurance Index, explained

What we measure — the Codebase Assurance Index

The Codebase Assurance Index (CAI) is one reproducible 0–100 score for a whole C#/.NET codebase. It rolls up ten lenses — most dimensions measured by deterministic tools, a few given an advisory, tolerance-banded LLM read. A measurement, not an opinion.

Survey a repo — free Browse the open standard →

Sign in with GitHub · no card · C#/.NET · the first full report is €0.

A quick orientation

What makes the Codebase Assurance Index different.

Reproducible

Every dimension is computed by a deterministic tool reading your code. Commit + frozen rubric → exactly one score. Same commit, same rubric, same advisory data — same number.

Depth is never gated

Every survey — including the free first report — computes the full CAI: all dimensions, all lenses. You never pay for depth; you pay for breadth and cadence.

Verifiable

Read the exact rule each dimension was scored by, and run the same measurement on your own code — the algorithm and rubric are open.

Verify a score yourself →

How we measure

Graded by the open CAI standard — across ten lenses.

Five are always on; five light up with your architecture. Watchdog doesn't grade by house style — it measures against CAI, the Codebase Assurance Index: an open, reproducible 0–100 standard. The full algorithm, the worst-first fold, the firewall, and the four git-history-mined dimensions all live on the standard — open to read, cite, or recompute.

Always on

Code health

Complexity, duplication, code shape and naming — how maintainable the code itself is.

Always on

Architecture

Module boundaries, coupling, cohesion and dependency direction — whether structure holds up as the repo grows.

Always on

Maturity

Docs, ADRs, comments and process signals — how well the project explains and governs itself.

Always on

Readiness

Tests, CI gates, observability, resilience and rollback — readiness to run in production.

Always on

Security & Compliance

Secrets, dependency CVEs, SAST and licence/PII posture — the deep-scan security lens.

Lights up with your architecture

Domain Modelling

DDD tactical health — aggregates, value objects and the invariants your business rules depend on.

Lights up with your architecture

Event-Driven

Messaging and integration discipline — outbox, async handlers and contract coupling.

Lights up with your architecture

Event Sourcing

Event-store correctness — immutable events, deterministic folds and PII-in-events.

Lights up with your architecture

Accessibility

Text alternatives, labels, keyboard semantics, ARIA and a11y enforcement.

Lights up with your architecture

Performance

Benchmarks, allocation-aware APIs and async hygiene.

The full vocabulary — every dimension, its evaluator and rubric version — lives on the open standard. Browse the catalog →

The firewall

Nothing moves the number but the code.

The deterministic score sits on one side of a firewall; an advisory LLM read sits on the other — and it can never cross. That's the difference between a measurement and asking an LLM, which answers differently every time.

The AI only ever advises

A few findings get an advisory, tolerance-banded LLM read that can never, by construction, move the headline number. It explains in plain English; it never scores. The measurement stays pure.

Your inputs never score

Your compliance declarations, a suppressed finding, your contract profile — they change what the artifact says, never the CAI. A declaration is presentation; the score is measurement. Neither party to a contract can tilt the number — only the code changing moves it (or a disclosed advisory refresh like a new CVE).

The full firewall, drawn on the standard →

From the lenses to one number

How the lenses roll up.

The CAI is a weighted roll-up of the lens scores under the frozen rubric — not an average you can't see inside.

Core always counts; conditional lenses only when they apply

The five core lenses always contribute. The conditional lenses contribute only when the code calls for them, and the weights re-normalise — so a repo is never penalised for a lens that doesn't apply.

A critical lens caps the headline

The roll-up can't read Strong while a lens reads Critical: a single critical-band lens caps the CAI, so the one number can't hide a serious failure in one dimension behind strong scores elsewhere.

Mined from git history, not just the code

Four behavioural dimensions — hotspots, bus factor, knowledge freshness, change coupling — are read deterministically from your git history and scored into the same CAI.

So a contract floor of CAI ≥ 80 means every always-on lens is Strong or better with no lens Critical — decomposable, not opaque. The authoritative spec: cai.canine.dev/spec

How to read the number

One fixed scale, five bands — and a pin that never moves for anyone.

Every score renders on the same worst→best scale: Critical / Weak / Adequate / Strong / Exemplary, cut at 25 / 50 / 70 / 90. The pin marks the score's exact spot — position on the fixed scale *is* the reading, never a corpus-relative rank.

62

CriticalWeakAdequateStrongExemplary

The sample artifact above pins at 62 — Adequate, in the band's upper third. Banding is presentation only: it never moves a number.

Calibrated, not noisy

A reading you can act on — not a thousand findings to triage.

Watchdog is calibrated against a corpus of real .NET codebases, so the idioms a line-level checker trips over — a repository that coheres through a base class, a test that asserts through a harness, an interface a façade is obliged to implement — don't read as defects. Zero setup, no rule-tuning weekend: the false-positives are calibrated out before you ever see them.

Tuned on real code

Every detector is tested that it fires on the real defect and stays quiet on the idiom, against a public reference corpus. On that corpus the typical repository's findings are over 95% real, and reference clean-architecture codebases exceed 99%.

Disagree, and it learns

Any scored finding can be disputed in one click — routed to human triage, and a confirmed false-positive becomes a detector test so it can't recur. The instrument sharpens with use; the score never bends to the dispute.

Quiet by design

Findings are ranked by what moves the grade, folded so one stray stub barely registers, and a lens returns not-measured, with the reason rather than a phantom zero. Volume is never mistaken for rigour.

Calibration is an ongoing programme — idiom-heavy codebases still surface residual noise we keep tuning down, and every dispute feeds the next round.

Worked example — cohesionLCOM4 counts how many disconnected clusters a class splits into — and provably mis-measures some good designs. A well-designed domain aggregate (an Order with AddItem, ChangeShipping, Cancel) is cohesive by its invariant, yet looks like a god-class to the raw metric. Watchdog recognises the shapes LCOM4 provably mis-measures — domain aggregates, data-access repositories, source-generated view-models, contract-mandated plumbing — and exempts them, while still flagging the real god-object. The result: a cohesion signal you can act on, not a list to triage.

Slop by shape, not authorship

The hollow code that compiles — measured deterministically.

The failure mode of fast, high-volume code isn't bad syntax — a linter catches that. It's code that *looks finished and does nothing*. Watchdog measures the hollowness itself — by shape.

IC1

Stubs that look implemented

A CalculateTax that returns 0, an async method that never awaits, skeleton types, dead branches. Valid C#, type-checks — and does nothing.

D10

Tests that assert nothing

Green tests with no assertions; skipped tests dressed up as coverage. Coverage counts the test file; it never asks whether the test actually checks a result.

X3

Errors made invisible

Empty catch {} blocks, a bare rethrow that loses the stack trace. Type-based checks pass; the failure disappears.

D17

Untracked debt & dead code

TODO/FIXME/HACK, blanket suppressions, commented-out code, unreferenced symbols — counted raw at best by other tools; scored here.

D4

Copy-paste, never parameterised

Near-duplicate blocks — type-aware detection, scored by density across the codebase.

We don't guess whether a machine typed it — stylometric "AI detection" is a credibility trap, and we refuse to make the claim. We measure the hollowness by shape, so a rushed human and an eager model produce the same finding and the same fix.

Rubric versioning

Freeze the rubric, keep the score constant.

Watchdog scores with a versioned rubric. Any change that can move a score for unchanged code bumps the rubric version.

Versioned and contestable

The rubric is contestable: a scoring change that isn't reflected in the published spec fails our CI, so every number stays re-derivable from a rule you can read.

Rubric versions & governance →

Contract rubrics

Pin a repository to a frozen rubric and the ruler stops moving under you — the same commit re-scores to the same number under that rubric, so any movement you see is the asset changing, never the ruler. Advisory data still refreshes, so a new CVE can legitimately move a security finding — a real signal, disclosed in the changelog.

Get the measurement. No depth is ever gated.

Survey a repo — free Verify a score yourself →

Sign in with GitHub · no card · C#/.NET · the first full report is €0.