Agents & MCP · the remediation loop

Watchdog finds it. Your agent fixes it. The next scan proves it.

Watchdog is the independent codebase-assurance surveyor — it measures, it never fixes. It's read-only by doctrine, so instead of touching your code it serves every finding to your own coding agent over a Model Context Protocol server: prioritised by impact, briefed with the rule that fired and the exact file and line, and verified by the next scan. The hand on your code is always yours.

Survey a repo — free See what we measure

Works with any MCP client — Claude Code, GitHub Copilot, Cursor, opencode. C#/.NET · a measurement, not an opinion.

The closed loop

Measure → fix → prove, on a loop.

Watchdog scans on your cadence and turns the findings into a ranked, briefed task list. Your agent works the highest-leverage item in your own repo and records a provisional fix. The next scan re-measures and decides: the finding is credited as fixed only when it genuinely stops firing. Nothing the agent asserts ever moves the score.

Watchdog scans

On your calendar cadence — weekly, per sprint, monthly or quarterly — plus the daily security watch.

Prioritised tasks, over MCP

Every finding becomes a briefed task ranked by impact ÷ effort — the rule that fired, the file and line, the estimated point-gain.

Your agent fixes, in your own repo

Claude Code, Copilot, Cursor, opencode — the agent works the fix and records a provisional resolve. Watchdog never touches your code.

The re-scan verifies

The fingerprint delta on the next scan is what marks a finding fixed — never the agent's say-so. The loop repeats.

Read-only throughout — the next scan is the arbiter.

Watchdog scanson your calendar cadence — weekly, per sprint, monthly, quarterly
Prioritised tasksserved over MCP, ranked by impact ÷ effort, briefed to file:line
Your agent fixesin your own repo — records a provisional resolve
The re-scan verifiesfingerprint delta marks it fixed — never the agent's say-so

The loop repeats — the score only moves when the re-scan proves it.

Built for an agent to act on

A ranked, briefed task list — not a wall of findings.

Point an agent at the repo and it starts with audit() — repo health, every lens score, and a remediation plan ranked by impact ÷ effort. next_task() hands back the single highest-leverage item, fully briefed in one object; get_task("D34") pulls the full packet for one dimension. No round-trips, no guessing.

One call, everything needed

Each task packet carries what the dimension measures, the current score, the estimated point-gain and effort, the remediation brief, and every open instance with its fingerprint and file:line.

Ranked by leverage, not noise

The plan ranks whole dimensions by impact ÷ effort — "lift D34 from 6 to 8 for +1.8" — so the agent works what moves the grade, instead of clearing a hundred Info-level findings in an area that's already strong.

No hallucinated locations

Every finding is content-addressed — a stable fingerprint plus exact file and line. The agent never invents a location, and it can diff two scans (or two SARIF runs) and get the same adds and removes.

Honest absence

When a dimension can't be measured — coverage on a suite that won't run, an npm scan on a Python repo — it returns not-measured with the reason, never a phantom 0.

Scores are pinned to a frozen rubric version. Findings ship as SARIF too — helpUri to the rule's intent, partialFingerprints for stable cross-tool identity, security findings tagged with their CWE.

The hollow code that compiles

What we catch that line-scanners pass.

The failure mode of fast, high-volume code isn't bad syntax — a linter catches that. It's code that *looks finished and does nothing*: stubs under confident names, tests that assert nothing, errors quietly swallowed. It type-checks, it reads as done, it sails past a line-level scanner. Watchdog measures the hollowness itself — by shape, deterministically.

Signature	What it looks like	Why a line-scanner passes it
Stubs that look implemented `IC1`	A `CalculateTax` that returns 0, an `async` that never awaits, skeleton types, dead branches	Valid C#, type-checks. We weight a lone stub differently from pervasive scaffolding.
Tests that assert nothing `D10`	Green tests with no assertions; skipped tests dressed up as coverage	Coverage counts the test file; it never asks whether the test actually checks a result.
Errors made invisible `X3`	Empty `catch {}`, a bare rethrow that loses the stack trace	Type-based checks pass; the failure disappears.
Untracked debt & dead code `D17`	TODO/FIXME/HACK, blanket suppressions, commented-out code, unreferenced symbols	Counted raw at best.
Copy-paste, never parameterised `D4`	Near-duplicate blocks	Ours is type-aware and scored by density across the codebase.

One signature, whoever wrote itWe don't guess whether a machine typed it — stylometric "AI detection" is a credibility trap, and we don't make the claim. We measure the hollowness by shape, so a rushed human and an eager model produce the *same* finding and the *same* fix. And it compounds: each signature feeds a per-file quality reading with diminishing returns and floors — the "one stub versus a hundred stubs" distinction a flat rule can't make.

Why it can't be papered over

The re-scan is a ratchet — the score is earned, not talked up.

Because the measurement is deterministic and the next scan re-runs it, a finding can't be dressed up or churned away — it either stops firing or it doesn't. This holds on every scheduled scan, with or without an agent.

Line-insensitive identity

A finding's identity is a hash of its dimension, title and file — never the line number. Reformat, rename a variable, move the block: the finding stays put. You can't churn your way out of it.

Provisional until proven

An agent's "resolved" is a claim, not a verdict. The next scan re-measures and credits it only when the fingerprint is genuinely gone — and reopens it if the rule still fires. No self-reported fixes.

Deterministic, floored scoring

Repeats of the same problem decay — each costs half the last — and every category has a floor, so you can't flood a file with cosmetic findings to game it. Same commit in, same score out.

This runs the same whether the code came from a coding assistant, a contractor, or a 2 a.m. hotfix. The deterministic re-scan is the ratchet under all of it.

The MCP surface

Twelve tools, four jobs — and a way to push back.

Everything runs over one HTTPS endpoint (/mcp), Bearer-token-scoped to a single repo. There is no write-to-repo tool — no commit, no push, no open-PR.

Job	Tool	What it does
Read & plan	`audit`	Repo health, lens scores, and a plan ranked by impact ÷ effort — start here
Read & plan	`next_task`	The single highest-leverage item, fully briefed
Read & plan	`get_task`	The full packet for one dimension (e.g. D34)
Read & plan	`list_findings`	Latest-scan findings — fingerprint, dimension, level, file:line
Read & plan	`get_finding`	One finding by fingerprint
Work the fix	`claim_finding`	A short lease (≈45 min) so two agents don't collide
Work the fix	`release_finding`	Hand a claimed finding back to the open queue
Work the fix	`resolve_finding`	Record a provisional fix — the next scan confirms by fingerprint delta
Push back	`dispute_finding`	Flag a scored finding as a false positive → human triage: Fixed or Declined
Push back	`flag_advisory`	Flag an advisory LLM-judged note as unhelpful — a separate channel that never touches the score
Push back	`report_detector_gap`	Report something we missed → improves the detector, never your score
Verify	`request_rescan`	A verify re-scan — opt-in per repo, off by default, counts toward your scan budget

The agent proposes; the measurement disposesNo dispute, claim or resolve ever moves the score or erases a finding. A dispute goes to a human — and if declined, the finding carries a maintainer's note explaining why it stands, so the same one isn't re-litigated next scan. Disagreement is a first-class signal; it just isn't a back door to the number.

Open, not closed

The hand on your code is always yours.

Watchdog runs against a throwaway clone and exposes no way to write to your repo. That's deliberate: a measurer that also rewrites the thing it grades can't stay neutral, and you'd lose chain-of-custody on every change. Tools that auto-refactor inside their own engine make the opposite trade.

What Watchdog does

Serves prioritised, briefed findings over MCP; records a provisional resolve; re-measures on the next scan and proves what genuinely moved.

What it never does

Edit, commit, push, or open a PR; move the score on an agent's say-so; touch your working tree. The change — and the credit, and the chain of custody — is always yours.

Connect an agentEach repository has an MCP endpoint and a scoped bearer token. Add them to your agent's MCP configuration — Claude Code, GitHub Copilot, Cursor, opencode, or your own — and the task list is live. The exact endpoint URL and the per-repo token are shown for each repository once it's surveyed. Verify-now re-scans are opt-in per repo (off by default) and count toward your scan budget.

Stop triaging findings by hand. Hand them to your agent.

Survey a repo — free How it fits your stack →