
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
Doing reveals what thinking can't predict — spec-driven iterative development for Claude Code
██████╗ ███████╗ ███████╗ ██████╗ ███████╗ ██╗ ██████╗ ██╗ ██╗
██╔══██╗ ██╔════╝ ██╔════╝ ██╔══██╗ ██╔════╝ ██║ ██╔═══██╗ ██║ ██║
██║ ██║ █████╗ █████╗ ██████╔╝ █████╗ ██║ ██║ ██║ ██║ █╗ ██║
██║ ██║ ██╔══╝ ██╔══╝ ██╔═══╝ ██╔══╝ ██║ ██║ ██║ ██║███╗██║
██████╔╝ ███████╗ ███████╗ ██║ ██║ ███████╗ ╚██████╔╝ ╚███╔███╔╝
╚═════╝ ╚══════╝ ╚══════╝ ╚═╝ ╚═╝ ╚══════╝ ╚═════╝ ╚══╝╚══╝
Doing reveals what thinking can't predict
Quick Start • Two Modes • Commands • What It Rejects • Principles
You can't foresee what you don't know to ask. Doing reveals — at every layer.
Most spec-driven frameworks start from a finished spec and execute a static plan. Deepflow treats the entire process as discovery: asking reveals hidden requirements, debating reveals blind spots, spiking reveals technical risks, implementing reveals edge cases. Each step makes the next one sharper.
Deepflow started with adversarial selection: one AI evaluated another AI's code in a fresh context. The "doing reveals" philosophy applied to the system itself — we discovered that LLM judging LLM produces gaming: agents that estimated instead of measuring, simulated instead of implementing, presented shortcuts as deliverables.
The fix: eliminate subjective judgment. Only objective metrics decide. Tests created by the agent itself are excluded from the baseline to prevent self-validation. We call this a ratchet — inspired by Karpathy's autoresearch: a mechanism where the metric can only improve, never regress. Each cycle ratchets quality forward.
# Install (or update)
npx deepflow
# Uninstall
npx deepflow --uninstall
The installer configures granular permissions so background agents can read, write, run git, and execute health checks (build/test/typecheck/lint) without blocking on approval prompts. All permissions are scoped and cleaned up on uninstall.
You explore the problem, shape the spec, and trigger execution — all inside a Claude Code session.
claude
# 1. Discover — understand the problem before solving it
/df:discover image-upload
# "Why do you need image upload? What exists today?
# What file sizes? What formats? Where are images stored?
# What does 'done' look like? What should this NOT do?"
# 2. Debate — stress-test the approach (optional)
/df:debate upload-strategy
# User Advocate: "Drag-and-drop is table stakes, not a feature"
# Tech Skeptic: "Client-side resize before upload, or you'll hit memory limits"
# Systems Thinker: "What happens when storage goes down mid-upload?"
# LLM Efficiency: "Split this into two specs: upload + processing"
# 3. Spec — now the conversation is rich enough to produce a solid spec
/df:spec image-upload
# 4-6: the AI takes over
/df:plan # Compare spec to code, create tasks
/df:execute # Parallel agents in worktree, ratchet validates
/df:verify # Check spec satisfied, merge to main
What requires you: Steps 1-3 (defining the problem and approving the spec). Steps 4-6 run autonomously but you trigger each one and can intervene.
The human loop comes first — discover and debate are where intent gets shaped. You refine the problem, stress-test ideas, and produce a spec that captures what you actually need. That's the living contract. Then you hand it off.
# First: the human loop — discover, debate, refine until the spec is solid
$ claude
> /df:discover auth
> /df:debate auth-strategy
> /df:spec auth # specs/auth.md — the handoff point
> /exit
# Then: the AI loop — plan, execute, validate, merge
$ claude
> /df:auto
# Next morning
$ cat .deepflow/auto-report.md
$ git log --oneline
What the AI does alone:
/df:plan if no PLAN.md exists/loop 1m /df:auto-cycle) — fresh context each cycle/df:verify, merges to mainSafety: Never pushes to remote. Failed approaches recorded in .deepflow/experiments/ and never repeated. Specs validated before processing.
HUMAN LOOP AI LOOP
───────────────────────────────── ──────────────────────────────────
/df:discover — ask, surface gaps /df:plan — compare spec to code
/df:debate — stress-test approach /df:execute — spike, implement
/df:spec — produce living contract /df:verify — health checks, merge
↻ refine until solid ↻ retry until converged
───────────────────────────────── ──────────────────────────────────
specs/*.md is the handoff point
Spec lifecycle: feature.md (new) → doing-feature.md (in progress) → done-feature.md (decisions extracted, then deleted)
| Command | Purpose |
|---|---|
/df:discover <name> | Explore problem space with Socratic questioning |
/df:debate <topic> | Multi-perspective analysis (4 agents) |
/df:spec <name> | Generate spec from conversation |
/df:plan | Compare specs to code, create tasks |
/df:execute | Run tasks with parallel agents |
/df:verify | Check specs satisfied (L0-L5), merge to main |
/df:update | Update deepflow to latest |
/df:auto | Autonomous mode (plan → loop → verify, no human needed) |
your-project/
+-- specs/
| +-- auth.md # new spec
| +-- doing-upload.md # in progress
+-- PLAN.md # active tasks
+-- .deepflow/
+-- config.yaml # project settings
+-- decisions.md # auto-extracted + ad-hoc decisions
+-- auto-report.md # morning report (autonomous mode)
+-- auto-memory.yaml # cross-cycle learning
+-- token-history.jsonl # per-render token usage (auto)
+-- experiments/ # spike results (pass/fail)
+-- worktrees/ # isolated execution
+-- upload/ # one worktree per spec
Deepflow's design isn't opinionated — it's a direct response to measured LLM limitations:
Focused tasks > giant context — LLMs lose ~2% effectiveness per 100K additional tokens, even on trivial tasks (Chroma "Context Rot", 2025, 18 models tested). Accuracy drops from 89% at 8K tokens to 25% at 1M tokens (Augment Code, 2025). Deepflow keeps each task's context minimal and focused instead of loading the entire codebase.
Search efficiency > model capability — Coding agents spend 60% of their time searching, not coding (Cognition, 2025). Input tokens dominate cost with up to 10x variance driven entirely by search efficiency, not coding ability. Deepflow's LSP-first search and 3-phase explore protocol (DIVERSIFY/CONVERGE/EARLY STOP) minimize search waste.
The framework matters more than the model — Same model, same tasks, different orchestration: 25.6 percentage point swing on SWE-Bench Lite (GPT-4: 2.7% with naive retrieval vs 28.3% with structured orchestration). On SWE-Bench Pro, three products using the same model scored 17 problems apart on 731 issues — the only difference was how they managed context, search, and edits. Deepflow is that orchestration layer.
Tool use > context stuffing — Information in the middle of context has up to 40% less recall than at the start/end (Lost in the Middle, 2024, Stanford/TACL). LongMemEval (ICLR 2025) found GPT-4O scoring 60-64% at full context vs 87-92% with oracle retrieval. Agents access code on-demand via LSP (findReferences, incomingCalls) and grep — always fresh, no attention dilution.
Fresh context beats long sessions — Every AI agent's success rate decreases after 35 minutes of equivalent task time; doubling duration quadruples failure rate. Deepflow's autonomous mode (/df:auto) starts a fresh context each cycle — checkpoint state, not conversation history.
Input:output ratio matters — Agent token ratio is ~100:1 input to output (Manus, 2025). Deepflow truncates ratchet output (success = zero tokens), context-forks high-ratio skills, and strips prompt sections by effort level to keep the ratio low.
Model routing > one-size-fits-all — Mechanical tasks with cheap models (haiku), complex tasks with powerful models (opus). Fewer tokens per task = less degradation = better results. Effort-aware context budgets strip unnecessary sections from prompts for simpler tasks.
Prompt order follows attention — Execute prompts follow the attention U-curve: critical instructions (task definition, failure history, success criteria) at start and end, navigable data (impact analysis, dependency context) in the middle. Distractors eliminated by design.
LSP-powered impact analysis — Plan-time uses findReferences and incomingCalls to map blast radius precisely. Execute-time runs a freshness check before implementing — catching callers added after planning. Grep as fallback — though embedding-based retrieval has a hard mathematical ceiling (Google DeepMind, 2025) that LSP doesn't share.
| Skill | Purpose |
|---|---|
browse-fetch | Fetch external API docs via headless Chromium (replaces context-hub) |
browse-verify | L5 browser verification — Playwright a11y tree assertions |
atomic-commits | One logical change per commit |
code-completeness | Find TODOs, stubs, and missing implementations |
gap-discovery | Surface missing requirements during ideation |
MIT
FAQs
Doing reveals what thinking can't predict — spec-driven iterative development for Claude Code
The npm package deepflow receives a total of 1,011 weekly downloads. As such, deepflow popularity was classified as popular.
We found that deepflow demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.