v2.1 is live — 20 QA skills, MCP KB, all three providers

You type a command. The AI team handles the rest.

20 slash commands. One for every QA job. Powered by Anthropic, OpenAI, or Claude Code.

Nitpick gives your coding agent a full QA team: scope decisions, page exploration, test generation, failure triage, release gates. Type a command — agents and curated skills do the work.

Self-hosted · Bring your own key (Anthropic, OpenAI, or Claude Code)

Claude Code
/qa-scope I just refactored the task creation form to use a new API endpoint

Nitpick · qa-scope

Scoped 3 pages from your change description:

/tasks/new High form directly modified
/tasks/:id/edit High same API endpoint used
/dashboard Low task count widget may be affected

Run /qa-run on each page, or nitpick run --scope targeted to test all three now.

Type /qa-run, /qa-triage, /qa-release-gate

Works with your stack

Playwright test runtime
Anthropic default LLM
OpenAI alt LLM
Claude Code MCP client
Cursor MCP client
Cline MCP client
Model Context Protocol KB protocol
Node.js ≥20 runtime
Linear coming soon
GitHub coming soon
Slack coming soon
Playwright test runtime
Anthropic default LLM
OpenAI alt LLM
Claude Code MCP client
Cursor MCP client
Cline MCP client
Model Context Protocol KB protocol
Node.js ≥20 runtime
Linear coming soon
GitHub coming soon
Slack coming soon

The problem

You shouldn't have to think about QA. You should be shipping.

QA is a bottleneck on every team. Someone has to write the tests, run the regression suite, triage the failures, and decide what's safe to ship. That someone is usually an engineer who should be building.

Manual regression doesn't scale

Every release waits on a QA pass. Every pass takes a day or two. Every feature makes the pass longer. Eventually QA becomes the release cadence.

Writing E2E tests is a full-time job

Someone writes Playwright selectors, fills fixtures, debugs flaky assertions. It never stays done. It steals afternoons that should go to real work.

AI test generators don't remember anything

They generate once and forget. They don't discover your app, don't track what changed, don't classify failures, don't file bugs. Output, not a teammate.

Meet the agents

Not one tool. A team of specialist agents.

Nitpick maps directly onto how real QA orgs are structured. Senior reviews the scope. Juniors do the deep work. A Flow Lead handles cross-cutting scenarios. A Head of QA reports on trends.

Senior QA

The orchestrator

  • Reads your code change and decides scope
  • Proposes which pages look affected. You confirm
  • Spins up Junior QAs per page and per flow
  • Aggregates everything into one unified report

Junior QA

Per-page subagent

  • Explores a single page until saturation
  • Builds a Derived UI Model: fields, validations, conditionals, API calls
  • Generates 20+ Playwright tests per page
  • Retries failures with a classified fix ladder

Flow Lead

Multi-actor subagent

  • Coordinates Juniors across stages of a flow
  • Admin creates → candidate approves → admin finalizes
  • Handles cross-actor assertions across roles
  • Surfaces state-propagation bugs traditional E2E misses
Future release

Head of QA

Weekly health

  • Reads your KB after every run, tracks trends over time
  • Flags flaky tests, regressions, coverage gaps
  • Posts weekly Slack summary to your team channel
  • Files bug tickets in Linear / GitHub with dedupe

How it works

You give the command. Agents do the work.

Type /qa-scope or nitpick run and nitpick's agents handle discovery, testing, triage, and reporting — no selectors, no fixtures, no setup beyond a URL and a key.

  1. 01

    Discover

    Crawls your app as each role. Handles SPAs by expanding ARIA menus and clicking through the nav. Builds a page graph and saves auth state once, reused on every subsequent run.

  2. 02

    Plan

    Describe a change in plain English. Senior QA maps it to affected pages, proposes a scope, and waits for your sign-off before touching anything.

  3. 03

    Test

    Junior QAs run in parallel, one per page. Each one explores the UI, builds a model, generates real Playwright tests, runs them, and escalates failures with a fix ladder.

  4. 04

    Remember

    Page models, bugs, flaky tests, and your approvals all land in a local knowledge base. Query it from Claude Code or Cursor via MCP. Next run starts smarter than the last.

Under the hood

Agents and skills that do the work.

Every command you type routes to a curated skill and a provider-backed agent. Here's what runs beneath.

MCP knowledge base server

Run `nitpick mcp serve` to expose your KB as an MCP server. Claude Code, Cursor, Cline, and any MCP client can read page models, bugs, and flows without launching the full agent runtime.

20 qa-team skills

A full skill bundle: qa-scope, qa-explore, qa-flow, qa-triage, qa-release-gate, qa-kickoff, qa-smoke, qa-canary, qa-weekly-report, and 11 more. YC-product depth, gstack-style frontmatter.

SPA-aware crawler

Handles React Router, Vue, Svelte, and custom client-side routing. ARIA-tree inventory + click-and-observe. LLM-assist fallback for heavy SPAs with `--strategy agent`.

Persistent knowledge base

Every page model, bug, and flaky test persists across runs. Second run knows everything the first learned. KB is plain JSON: commit it, copy it, diff it.

Multi-role flows

Test admin-creates-then-candidate-approves-then-admin-finalizes flows with cross-actor assertions. Not just per-page tests.

Pluggable LLM providers

Anthropic (prompt caching), OpenAI (Skills + parallel tool calls), Claude Code (subprocess + MCP). All three backends consume the MCP KB natively.

Human-in-the-loop kickoff

Before generating tests, Junior QA presents a page understanding and asks for confirmation, like a real kickoff meeting. Non-interactive mode for CI.

Classified retry system

Failures are diagnosed as timing, selector drift, data collision, or real app bugs. Each gets an escalating fix ladder.

API validation

Explorer intercepts fetch/XHR during submissions. Detects UI-shows-success-but-API-returns-500 mismatches that traditional E2E misses.

Unified HTML report

GitHub-styled dashboard with per-page drill-downs, inline failure details, persistence badges, and a bugs section.

Curated skills

Every QA job. One command away.

20 curated skills built for Claude Code, Cursor, and Codex. Describe what you need and the right agent handles it — from a quick smoke check to a full release gate.

Claude Code

$/qa-smoke

✓ All 12 critical pages passed · 4.1s

$/qa-scope I merged the billing rewrite

→ 4 pages scoped · confidence scores ready

$/qa-triage

2 real bugs · 3 timing flakes · 1 selector drift

Daily work

  • Scope a change before a single test runs
  • Run tests and classify every failure automatically
  • Smoke check all critical pages in seconds
  • Triage and deduplicate failures from the last run

Before shipping

  • Get a go / no-go decision before merging
  • Catch regressions post-deploy with canary checks
  • Require explicit sign-off before tests are generated

Diagnosis

  • Drill into a failure: timing, selector, or real bug
  • Find the root cause of flaky tests with full history
  • Gate destructive actions behind a confirmation step
  • Explore any page and build a full UI model

Reporting and ops

  • Weekly pass rate, flake rate, and coverage delta
  • Audit every page with a last-tested timestamp
  • Rank pages by risk and recent failure rate
  • Manage auth state files across roles

How we're different

Not another AI test-gen SaaS.

Most AI test tools sell you a regenerating black box that resets on every run. Nitpick runs on your machine, learns your app, and gets sharper with each pass.

Nitpick

A QA team that stays with you

  • Self-hosted. Your infra, your data, nothing leaves your machine.
  • Bring your own API key. Pay the LLM directly with no markup on runs.
  • Persistent knowledge base: page models, bugs, and flake history survive across runs.
  • Query the KB from Claude Code, Cursor, or Cline via MCP at any time.
  • 20 role-based skills covering every QA job: scope, explore, write, triage, release gate.
  • Multi-actor flow testing with cross-role assertions built in.
  • Full audit trail: every LLM call and tool invocation logged per run.
  • Human-in-the-loop: your sign-off before tests are generated.

Typical AI test-gen tools

testRigor, Octomind, Qase AI, and the rest

  • Hosted SaaS. Your test data lives on their servers.
  • LLM cost bundled with markup on every run.
  • Regenerates tests from scratch each time. No persistent memory.
  • Results in a separate dashboard, not inside your coding agent.
  • One generic agent for everything. No role or skill specialization.
  • Per-page tests only. Cross-actor flows not supported.
  • Black box. You see outputs, not how they were produced.
  • Fire-and-forget generation. You review failures after the fact.

Roadmap

Where we are. Where we're going.

A clear picture of what's live today and what's next.

Shipping now

  • MCP KB server: query page models, bugs, flows from Claude Code, Cursor, or Cline
  • 20 qa-team skills covering every QA role, from daily testing to weekly reports
  • All three providers (Anthropic, OpenAI, Claude Code) read and write the KB natively
  • LLM-assist crawler for SPAs the standard crawler misses
  • Non-interactive mode for CI with no stdin blocking
  • OpenAI Skills via Responses API (skill_id, not re-sending 15 KB every call)
  • Resume interrupted runs at zero LLM cost for pages already passed
  • Unified HTML, Markdown, and JSON reports
  • nitpick mcp register: one command wires up Claude Code, Cursor, and Codex automatically

Coming next

  • Team mode: drop a .claude/ config into your app repo so the whole team shares the KB
  • Host packs for Codex CLI and Cursor with adapted skills and routing

On the horizon

  • GitHub App: run on every pull request, map diff to affected pages
  • Slack bot: post results to a channel, trigger runs with a slash command
  • Bug filer for Linear and GitHub Issues with deduplication
  • Visual regression via screenshot baseline diffing
  • Hosted version for teams that prefer not to self-host

Questions

The objections before they're asked.

Still stuck? Email us at vaibhav@kindflow.ai or file an issue on GitHub.

Can I use the knowledge base from Claude Code or Cursor?

Yes. Run nitpick mcp serve --kb ./knowledge-base and add it as an MCP server in your host's config. Claude Code, Cursor, Cline, and any MCP-aware client can then query page models, open bugs, and flow definitions without launching the full agent. The KB path needs to be absolute so your IDE resolves it correctly.

Does my app data get sent to the LLM?

What the crawler sees on a page (headings, form labels, error messages, visible text) becomes part of the LLM context. You control what gets crawled via scope.exclusions and which buttons never get clicked via terminal_guards_global. Nothing goes outside your configured LLM provider. Nitpick has no servers, no telemetry, and makes no outbound calls except to your provider.

How do I cap LLM costs?

Four levers today: max_iterations (safety cap on the agent loop), fast_model (use Haiku or gpt-4o-mini for cheap decisions), iterative scope (3-5 pages instead of everything), and critical-pages-only runs. The crawler and smoker are LLM-free, so they cost $0. A hard per-run dollar cap is on the roadmap.

What if my app uses OAuth, MFA, or a custom login?

The default auth handles email and password. For OAuth or MFA, pre-populate knowledge-base/auth/<role>.json with a Playwright storage state captured manually. Nitpick reuses it on every subsequent run. A pluggable auth hook system is planned.

Does it work on localhost, staging, or internal apps?

Yes. Nitpick needs the URL to be reachable from the machine running it. Localhost, VPN-only staging, and internal tools all work. Your LLM provider still needs internet unless you point it at an on-prem endpoint like Azure OpenAI or a self-hosted proxy.

What Node.js and Playwright versions are required?

Node.js 20 or newer. Playwright is pinned in package.json. You run npx playwright install --with-deps chromium once. No other runtime dependencies. Everything runs in one Node process.

What does a realistic monthly bill look like?

Rough estimates with prompt caching enabled. A 15-page app running a nightly full regression (~30 days × ~$40/run) lands at ~$1,000-1,500/month on Opus, ~$200-400/month on Sonnet. PR-triggered iterative runs are typically 3-5 pages, landing at ~$5-15 per run. The smoker is free (no LLM).

Get started

Four commands to your first run.

macOS and Linux. Node 20+. Bring your own Anthropic or OpenAI key.

  • nitpick run --scope iterative --change "..." iterative run
  • nitpick run --scope full --non-interactive CI / nightly
  • nitpick crawl --strategy agent LLM-assist crawler
  • nitpick mcp serve --kb ./knowledge-base MCP KB server
~/my-app-qa
$ git clone https://github.com/vaibhav-kindflow/nitpick
$ ./scripts/install.sh # deps, playwright, nitpick on PATH
$ nitpick init # configure once per app
$ nitpick run --scope full
# then connect your IDE to the KB:
$ nitpick mcp serve --kb ./knowledge-base
→ KB now queryable from Claude Code, Cursor, Cline

Get in touch

Questions, feedback, or trying this on your team?

Reach out directly. No forms, no drip campaigns. Just a real reply from us.