fix: ignore binary/data files in skills watcher to prevent FD exhaustion by household-bard · Pull Request #11325 · openclaw/openclaw

household-bard · 2026-02-07T18:04:53Z

Problem

A skills directory containing thousands of binary files (e.g. .wav training data, .onnx models, .joblib classifiers) causes chokidar to open every file, exhausting the process file-descriptor limit. Once FDs are exhausted, all exec/spawn calls fail with EBADF, and the agent cannot self-heal because the fix requires the very exec that is broken.

Root Cause

DEFAULT_SKILLS_WATCH_IGNORED in src/agents/skills/refresh.ts only ignores .git, node_modules, and dist directories. Binary files are watched despite having no relevance to skill definitions.

Fix

Add a regex to ignore common binary and data file extensions (.wav, .mp3, .onnx, .pt, .bin, .zip, etc.). The skills watcher only needs to react to skill definition changes (.md, .py, .ts, .json), never to bulk training data or model weights.

Discovered

The hard way: 11,071 .wav files in a wake-word training directory exhausted all FDs on macOS, causing 6+ hours of complete exec failure. The circular nature of the bug (resource exhaustion prevents the commands needed to fix it) made manual intervention necessary.

Testing

Existing test passes: pnpm vitest run src/agents/skills/refresh.test.ts ✅
Full unit test suite: 5,383 tests pass ✅
Deployed and running in production on our gateway ✅

Greptile Overview

Greptile Summary

This PR updates the skills filesystem watcher (src/agents/skills/refresh.ts) by expanding DEFAULT_SKILLS_WATCH_IGNORED to include a case-insensitive regex that ignores a list of common binary/model/archive/data extensions. The intent is to prevent chokidar from opening thousands of irrelevant files in large skill trees (e.g., training datasets or model weights), which can exhaust file descriptors and cause downstream exec/spawn failures.

The change plugs into the existing watcher creation in ensureSkillsWatcher(...), where ignored: DEFAULT_SKILLS_WATCH_IGNORED is passed to chokidar.watch(...), so it directly impacts what files are watched without altering watcher scheduling/version bump logic.

Confidence Score: 4/5

This PR is likely safe to merge; it narrows the watcher scope to avoid FD exhaustion with minimal behavioral impact.
The change is isolated to the chokidar ignore list and doesn’t alter watcher lifecycle or debounce/version bump logic. Main risk is accidental over-ignoring of files whose changes should trigger a refresh, but the regex is anchored to extensions and is case-insensitive, which matches the stated intent.
src/agents/skills/refresh.ts

_{(2/5) Greptile learns from your feedback when you react with thumbs up/down!}

A skills directory containing thousands of binary files (e.g. .wav training data, .onnx models) can cause chokidar to exhaust the process file-descriptor limit. This breaks all exec/spawn calls with EBADF and requires manual intervention to recover. Add a regex to DEFAULT_SKILLS_WATCH_IGNORED that skips common binary and data file extensions. The skills watcher only needs to react to skill definition changes (.md, .py, .ts, .json etc.), never to bulk training data or model files. Discovered the hard way: 11,071 .wav files in a wake-word training directory exhausted all FDs and caused 6+ hours of downtime.

- child.unref() on backgrounded processes (zombie prevention) - 30-second graceful shutdown timeout (prevents infinite hang) - configurable skills watcher ignore patterns (FD protection) - subagent concurrency limit (default 8, prevents runaway spawns) - memory manager interval timer unref (clean process exit) - process registry sweeper stop on shutdown (memory leak fix) All 5,396 tests passing. Tested on 24/7 household deployment.

- Timing-safe token comparison: pad to equal length before compare, eliminates token-length side channel in auth - SQLite corruption recovery: auto-detect and rebuild from scratch on SQLITE_CORRUPT/NOTADB/IOERR (memory search self-heals) - Config reload max-wait debounce: force reload after 3x debounce period, prevents indefinite delay from rapid editor saves - FD monitoring in health checks: reports open FD count/limit/percent on Linux and macOS (catches FD exhaustion before it kills exec) All tests passing. 256 lines added across 6 files.

The plugin hook system defined before_compaction and after_compaction hooks but they were never called — dead code. This wires them into the actual compaction flow in compact.ts: - before_compaction fires BEFORE session.compact() with messageCount + tokenCount - after_compaction fires AFTER with compactedCount - Both are non-fatal (errors logged as warnings, don't block compaction) - Hook context includes sessionKey + workspaceDir for plugin use This enables plugins to flush memory, save project state, create handoffs, or log session summaries before context is lost to compaction. 5 new tests for hook infrastructure. 6,546/6,546 full suite passing.

…e /compact New bundled hook that fires on 'command:compact' and injects a system message reminding the agent to: - Update daily logs with session progress - Update PROJECT-STATE.md files touched this session - Update PROJECTS.md last-touched dates - Create handoff if mid-complex-work This ensures the agent has a chance to persist context before compaction discards conversation history. 4 tests: inject on compact, skip non-compact, skip non-command, content validation.

Fixes critical issue found by 3-way code review (Opus + Sonnet consensus): 1. Wire triggerInternalHook('command:compact') in commands-compact.ts - The pre-compaction-flush handler was dead code — command:compact was never dispatched. Now fires before compaction, matching the pattern used by /new and /reset in commands-core.ts. - Hook messages are enqueued as system events for the agent to process in the compacted session's first turn. 2. Add 10-second timeout on plugin hook execution (compact.ts) - Prevents a hanging plugin from blocking compaction indefinitely. - Uses Promise.race with timeout rejection. 3. Add 2000-message threshold for token estimation (compact.ts) - Skips expensive token estimation for very large sessions. - tokenCount will be undefined for sessions >2000 messages. 4. Add architecture comment explaining dual hook systems - Internal hooks (command:compact) fire in commands-compact.ts - Plugin hooks (before/after_compaction) fire in compact.ts - Both are non-fatal with independent error handling. 5. Improve hook error logging with session context Code review: docs/CODE-REVIEW-OPUS.md, docs/CODE-REVIEW-SONNET.md, docs/CODE-REVIEW-SYNTHESIS.md

Changes by gpt-5.2-codex (Codex CLI), reviewed by Will: 1. Fix flush message wording — now post-compaction reminder, not misleading 'imminent' warning (handler.ts) 2. Remove unused 'vi' import, fix createHookRunner() to use proper PluginRegistry type instead of [] (compact-hooks.test.ts) 3. Update test assertions for new message wording (handler.test.ts) 4. Move tokensAfter computation before after_compaction hook call, pass it as tokenCount instead of undefined (compact.ts) 9/9 hook tests passing. Co-authored-by: Codex CLI <codex@openai.com>

Resolved 2 conflicts in skills watcher (refresh.ts, refresh.test.ts): - Merged upstream Python venv/cache ignore patterns with our binary extension patterns - Combined both test suites with vi.hoisted() for mock compatibility - Used dynamic imports in tests to work with chokidar mock 8 files auto-merged cleanly (verified by Codex investigation agent): - bash-process-registry.ts, bash-tools.exec.ts, compact.ts - subagent-registry.ts, zod-schema.ts, auth.ts - config-reload.ts, memory/manager.ts Our 7 custom commits preserved: - Skills watcher binary file ignore + configurable patterns - Compaction hooks (before/after) wiring - Pre-compaction flush bundled hook - Resource leak fixes (child unref, sweeper, concurrency guards) - Auth/config-reload safety improvements - Health command FD diagnostics - Config schema for watchIgnoredPatterns Investigation: Agent A (Codex, 80K tokens) reviewed all 10 merge files. Risk rating: 4/10 (moderate-low). Test suite: 7,114 passed, 0 failed (1,027 test files).

…ions Adds a thin HTTP proxy that forwards requests directly to the configured upstream LLM provider without running a full agent turn. This eliminates ~2s of overhead (session resolution, history loading, system prompt injection, tool registration) that the existing /v1/chat/completions endpoint incurs via agentCommand(). Benchmarks: - Proxy route: ~200-500ms (mostly Cerebras inference time) - Agent route: ~1,800-2,500ms (full agent lifecycle) - Speedup: 4-10x for latency-sensitive use cases (voice pipelines) The proxy reuses gateway auth and provider config from openclaw.json. Model routing uses 'provider/model' format (e.g., cerebras/zai-glm-4.7). Supports both streaming and non-streaming responses.

steipete · 2026-02-14T18:17:20Z

This PR does a lot of different things, not just what you wrote.

steipete · 2026-02-14T18:27:12Z

Thanks @household-bard.

I implemented the skills watcher FD exhaustion fix directly on main using a narrower watcher strategy (watch SKILL.md only, plus one-level <skillsRoot>/*/SKILL.md) so chokidar never tracks large unrelated data trees under skills/.

Landed as:

0e046f6 (fix)
9409942 (test move into unit suite)

Changelog includes: "Skills: watch SKILL.md only ... Thanks @household-bard." and the fix commit includes:
Co-authored-by: household-bard shakespeare@hessianinformatics.com

Closing this PR as superseded by the main change.

openclaw-barnacle bot added the agents Agent runtime and tooling label Feb 7, 2026

openclaw-barnacle bot added the gateway Gateway runtime label Feb 7, 2026

openclaw-barnacle bot added the commands Command implementations label Feb 7, 2026

household-bard and others added 5 commits February 9, 2026 14:01

openclaw-barnacle bot added the size: XL label Feb 12, 2026

household-bard added 3 commits February 13, 2026 21:51

Merge upstream v2026.2.13 (592 commits) into fork

2955324

fix: resolve lint error in compact hook (restrict-template-expressions)

58bd013

steipete self-assigned this Feb 14, 2026

steipete closed this Feb 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ignore binary/data files in skills watcher to prevent FD exhaustion#11325

fix: ignore binary/data files in skills watcher to prevent FD exhaustion#11325
household-bard wants to merge 11 commits intoopenclaw:mainfrom
household-bard:fix/skills-watcher-binary-ignore

household-bard commented Feb 7, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

steipete commented Feb 14, 2026

Uh oh!

steipete commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

household-bard commented Feb 7, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

Discovered

Testing

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Uh oh!

steipete commented Feb 14, 2026

Uh oh!

steipete commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

household-bard commented Feb 7, 2026 •

edited by greptile-apps bot

Loading