v2.1 is live — 20 QA skills, MCP KB, all three providers

You type a command.
The AI team handles the rest.

20 slash commands. One for every QA job. Powered by Anthropic, OpenAI, or Claude Code.

Nitpick gives your coding agent a full QA team: scope decisions, page exploration, test generation, failure triage, release gates. Type a command — agents and curated skills do the work.

Install Read the docs

Self-hosted · Bring your own key (Anthropic, OpenAI, or Claude Code)

Orchestrator

Senior QA

Live

Routing 3 pages in scope

72%

/dashboard

Tests: 18

/settings/team

Tests: 12

/billing

Running…

Pass

98.4%

Tests

127

Flaky

Bugs

Claude Code

/qa-scope I just refactored the task creation form to use a new API endpoint

Nitpick · qa-scope

Scoped 3 pages from your change description:

/tasks/new High form directly modified

/tasks/:id/edit High same API endpoint used

/dashboard Low task count widget may be affected

Run /qa-run on each page, or nitpick run --scope targeted to test all three now.

Type /qa-run, /qa-triage, /qa-release-gate…

Works with your stack

Playwright test runtime

Anthropic default LLM

OpenAI alt LLM

Claude Code MCP client

Cursor MCP client

Cline MCP client

Model Context Protocol KB protocol

Node.js ≥20 runtime

Linear coming soon

GitHub coming soon

Slack coming soon

Playwright test runtime

Anthropic default LLM

OpenAI alt LLM

Claude Code MCP client

Cursor MCP client

Cline MCP client

Model Context Protocol KB protocol

Node.js ≥20 runtime

Linear coming soon

GitHub coming soon

Slack coming soon

The problem

You shouldn't have to think about QA. You should be shipping.

QA is a bottleneck on every team. Someone has to write the tests, run the regression suite, triage the failures, and decide what's safe to ship. That someone is usually an engineer who should be building.

Manual regression doesn't scale

Every release waits on a QA pass. Every pass takes a day or two. Every feature makes the pass longer. Eventually QA becomes the release cadence.

Writing E2E tests is a full-time job

Someone writes Playwright selectors, fills fixtures, debugs flaky assertions. It never stays done. It steals afternoons that should go to real work.

AI test generators don't remember anything

They generate once and forget. They don't discover your app, don't track what changed, don't classify failures, don't file bugs. Output, not a teammate.

Meet the agents

Not one tool. A team of specialist agents.

Nitpick maps directly onto how real QA orgs are structured. Senior reviews the scope. Juniors do the deep work. A Flow Lead handles cross-cutting scenarios. A Head of QA reports on trends.

Senior QA

The orchestrator

Reads your code change and decides scope
Proposes which pages look affected. You confirm
Spins up Junior QAs per page and per flow
Aggregates everything into one unified report

Junior QA

Per-page subagent

Explores a single page until saturation
Builds a Derived UI Model: fields, validations, conditionals, API calls
Generates 20+ Playwright tests per page
Retries failures with a classified fix ladder

Flow Lead

Multi-actor subagent

Coordinates Juniors across stages of a flow
Admin creates → candidate approves → admin finalizes
Handles cross-actor assertions across roles
Surfaces state-propagation bugs traditional E2E misses

Future release

Head of QA

Weekly health

Reads your KB after every run, tracks trends over time
Flags flaky tests, regressions, coverage gaps
Posts weekly Slack summary to your team channel
Files bug tickets in Linear / GitHub with dedupe

How it works

You give the command. Agents do the work.

Type /qa-scope or nitpick run and nitpick's agents handle discovery, testing, triage, and reporting — no selectors, no fixtures, no setup beyond a URL and a key.

01

Discover

Crawls your app as each role. Handles SPAs by expanding ARIA menus and clicking through the nav. Builds a page graph and saves auth state once, reused on every subsequent run.
02

Plan

Describe a change in plain English. Senior QA maps it to affected pages, proposes a scope, and waits for your sign-off before touching anything.
03

Test

Junior QAs run in parallel, one per page. Each one explores the UI, builds a model, generates real Playwright tests, runs them, and escalates failures with a fix ladder.
04

Remember

Page models, bugs, flaky tests, and your approvals all land in a local knowledge base. Query it from Claude Code or Cursor via MCP. Next run starts smarter than the last.

Under the hood

Agents and skills that do the work.

Every command you type routes to a curated skill and a provider-backed agent. Here's what runs beneath.

MCP knowledge base server

Run `nitpick mcp serve` to expose your KB as an MCP server. Claude Code, Cursor, Cline, and any MCP client can read page models, bugs, and flows without launching the full agent runtime.

20 qa-team skills

A full skill bundle: qa-scope, qa-explore, qa-flow, qa-triage, qa-release-gate, qa-kickoff, qa-smoke, qa-canary, qa-weekly-report, and 11 more. YC-product depth, gstack-style frontmatter.

SPA-aware crawler

Handles React Router, Vue, Svelte, and custom client-side routing. ARIA-tree inventory + click-and-observe. LLM-assist fallback for heavy SPAs with `--strategy agent`.

Persistent knowledge base

Every page model, bug, and flaky test persists across runs. Second run knows everything the first learned. KB is plain JSON: commit it, copy it, diff it.

Multi-role flows

Test admin-creates-then-candidate-approves-then-admin-finalizes flows with cross-actor assertions. Not just per-page tests.

Pluggable LLM providers

Anthropic (prompt caching), OpenAI (Skills + parallel tool calls), Claude Code (subprocess + MCP). All three backends consume the MCP KB natively.

Human-in-the-loop kickoff

Before generating tests, Junior QA presents a page understanding and asks for confirmation, like a real kickoff meeting. Non-interactive mode for CI.

Classified retry system

Failures are diagnosed as timing, selector drift, data collision, or real app bugs. Each gets an escalating fix ladder.

API validation

Explorer intercepts fetch/XHR during submissions. Detects UI-shows-success-but-API-returns-500 mismatches that traditional E2E misses.

Unified HTML report

GitHub-styled dashboard with per-page drill-downs, inline failure details, persistence badges, and a bugs section.

Curated skills

Every QA job. One command away.

20 curated skills built for Claude Code, Cursor, and Codex. Describe what you need and the right agent handles it — from a quick smoke check to a full release gate.

Claude Code

$/qa-smoke

✓ All 12 critical pages passed · 4.1s

$/qa-scope I merged the billing rewrite

→ 4 pages scoped · confidence scores ready

$/qa-triage

2 real bugs · 3 timing flakes · 1 selector drift

Daily work

Scope a change before a single test runs
Run tests and classify every failure automatically
Smoke check all critical pages in seconds
Triage and deduplicate failures from the last run

Before shipping

Get a go / no-go decision before merging
Catch regressions post-deploy with canary checks
Require explicit sign-off before tests are generated

Diagnosis

Drill into a failure: timing, selector, or real bug
Find the root cause of flaky tests with full history
Gate destructive actions behind a confirmation step
Explore any page and build a full UI model

Reporting and ops

Weekly pass rate, flake rate, and coverage delta
Audit every page with a last-tested timestamp
Rank pages by risk and recent failure rate
Manage auth state files across roles

View full skill catalog

How we're different

Not another AI test-gen SaaS.

Most AI test tools sell you a regenerating black box that resets on every run. Nitpick runs on your machine, learns your app, and gets sharper with each pass.

Nitpick

A QA team that stays with you

Self-hosted. Your infra, your data, nothing leaves your machine.
Bring your own API key. Pay the LLM directly with no markup on runs.
Persistent knowledge base: page models, bugs, and flake history survive across runs.
Query the KB from Claude Code, Cursor, or Cline via MCP at any time.
20 role-based skills covering every QA job: scope, explore, write, triage, release gate.
Multi-actor flow testing with cross-role assertions built in.
Full audit trail: every LLM call and tool invocation logged per run.
Human-in-the-loop: your sign-off before tests are generated.

Typical AI test-gen tools

testRigor, Octomind, Qase AI, and the rest

Hosted SaaS. Your test data lives on their servers.
LLM cost bundled with markup on every run.
Regenerates tests from scratch each time. No persistent memory.
Results in a separate dashboard, not inside your coding agent.
One generic agent for everything. No role or skill specialization.
Per-page tests only. Cross-actor flows not supported.
Black box. You see outputs, not how they were produced.
Fire-and-forget generation. You review failures after the fact.

Roadmap

Where we are. Where we're going.

A clear picture of what's live today and what's next.

Shipping now

MCP KB server: query page models, bugs, flows from Claude Code, Cursor, or Cline
20 qa-team skills covering every QA role, from daily testing to weekly reports
All three providers (Anthropic, OpenAI, Claude Code) read and write the KB natively
LLM-assist crawler for SPAs the standard crawler misses
Non-interactive mode for CI with no stdin blocking
OpenAI Skills via Responses API (skill_id, not re-sending 15 KB every call)
Resume interrupted runs at zero LLM cost for pages already passed
Unified HTML, Markdown, and JSON reports
nitpick mcp register: one command wires up Claude Code, Cursor, and Codex automatically

Coming next

Team mode: drop a .claude/ config into your app repo so the whole team shares the KB
Host packs for Codex CLI and Cursor with adapted skills and routing

On the horizon

GitHub App: run on every pull request, map diff to affected pages
Slack bot: post results to a channel, trigger runs with a slash command
Bug filer for Linear and GitHub Issues with deduplication
Visual regression via screenshot baseline diffing
Hosted version for teams that prefer not to self-host

Questions

The objections before they're asked.

Still stuck? Email us at vaibhav@kindflow.ai or file an issue on GitHub.

Can I use the knowledge base from Claude Code or Cursor?

Yes. Run nitpick mcp serve --kb ./knowledge-base and add it as an MCP server in your host's config. Claude Code, Cursor, Cline, and any MCP-aware client can then query page models, open bugs, and flow definitions without launching the full agent. The KB path needs to be absolute so your IDE resolves it correctly.

Does my app data get sent to the LLM?

What the crawler sees on a page (headings, form labels, error messages, visible text) becomes part of the LLM context. You control what gets crawled via scope.exclusions and which buttons never get clicked via terminal_guards_global. Nothing goes outside your configured LLM provider. Nitpick has no servers, no telemetry, and makes no outbound calls except to your provider.

How do I cap LLM costs?

Four levers today: max_iterations (safety cap on the agent loop), fast_model (use Haiku or gpt-4o-mini for cheap decisions), iterative scope (3-5 pages instead of everything), and critical-pages-only runs. The crawler and smoker are LLM-free, so they cost $0. A hard per-run dollar cap is on the roadmap.

What if my app uses OAuth, MFA, or a custom login?

The default auth handles email and password. For OAuth or MFA, pre-populate knowledge-base/auth/<role>.json with a Playwright storage state captured manually. Nitpick reuses it on every subsequent run. A pluggable auth hook system is planned.

Does it work on localhost, staging, or internal apps?

Yes. Nitpick needs the URL to be reachable from the machine running it. Localhost, VPN-only staging, and internal tools all work. Your LLM provider still needs internet unless you point it at an on-prem endpoint like Azure OpenAI or a self-hosted proxy.

What Node.js and Playwright versions are required?

Node.js 20 or newer. Playwright is pinned in package.json. You run npx playwright install --with-deps chromium once. No other runtime dependencies. Everything runs in one Node process.

What does a realistic monthly bill look like?

Rough estimates with prompt caching enabled. A 15-page app running a nightly full regression (~30 days × ~$40/run) lands at ~$1,000-1,500/month on Opus, ~$200-400/month on Sonnet. PR-triggered iterative runs are typically 3-5 pages, landing at ~$5-15 per run. The smoker is free (no LLM).

Get started

Four commands to your first run.

macOS and Linux. Node 20+. Bring your own Anthropic or OpenAI key.

nitpick run --scope iterative --change "..." iterative run
nitpick run --scope full --non-interactive CI / nightly
nitpick crawl --strategy agent LLM-assist crawler
nitpick mcp serve --kb ./knowledge-base MCP KB server

Full install guide View on GitHub

~/my-app-qa

$ git clone https://github.com/vaibhav-kindflow/nitpick

$ ./scripts/install.sh # deps, playwright, nitpick on PATH

$ nitpick init # configure once per app

$ nitpick run --scope full

# then connect your IDE to the KB:

$ nitpick mcp serve --kb ./knowledge-base

→ KB now queryable from Claude Code, Cursor, Cline

Get in touch

Questions, feedback, or trying this on your team?

Reach out directly. No forms, no drip campaigns. Just a real reply from us.

vaibhav@kindflow.ai

GitHub Issues

Report a bug or request a feature

We usually reply within a day.

You type a command. The AI team handles the rest.

You shouldn't have to think about QA. You should be shipping.

Manual regression doesn't scale

Writing E2E tests is a full-time job

AI test generators don't remember anything

Not one tool. A team of specialist agents.

Senior QA

Junior QA

Flow Lead

Head of QA

You give the command. Agents do the work.

Discover

Plan

Test

Remember

Agents and skills that do the work.

MCP knowledge base server

20 qa-team skills

SPA-aware crawler

Persistent knowledge base

Multi-role flows

Pluggable LLM providers

Human-in-the-loop kickoff

Classified retry system

API validation

Unified HTML report

Every QA job. One command away.

Daily work

Before shipping

Diagnosis

Reporting and ops

Not another AI test-gen SaaS.

A QA team that stays with you

testRigor, Octomind, Qase AI, and the rest

Where we are. Where we're going.

Shipping now

Coming next

On the horizon

The objections before they're asked.

Can I use the knowledge base from Claude Code or Cursor?

Does my app data get sent to the LLM?

How do I cap LLM costs?

What if my app uses OAuth, MFA, or a custom login?

Does it work on localhost, staging, or internal apps?

What Node.js and Playwright versions are required?

What does a realistic monthly bill look like?

Four commands to your first run.

Questions, feedback, or trying this on your team?

You type a command.
The AI team handles the rest.