Context Economy Thresholds Refinement
Status: ✅ Complete · Priority: High · Created: 2025-11-11 · Tags: validation, philosophy, context-economy, quality
Overview
The Suspicion: Hard line thresholds (>300 warning, >400 error) may not accurately reflect spec complexity and readability.
The Investigation: Deep dive into existing specs + research on LLM performance reveals that structure, density, and token count matter more than raw line count.
Key Finding: A well-structured 394-line spec with sub-specs can be more readable than a dense 316-line spec with 26 code blocks. Token count is critical - research shows 39% performance drop in multi-turn contexts and quality degradation beyond 50K tokens.
Evidence from Current Specs
Top 4 Largest Specs (All Near Threshold)
| Spec | Lines | Tokens* | Sections | Code Blocks | Lines/Section | Sub-specs | Readable? |
|---|---|---|---|---|---|---|---|
| 059-programmatic | 394 | ~2,100 | 32 | 8 | ~12 | 6 files | ✅ Yes |
| 049-first-principles | 374 | ~1,700 | 38 | 0 | ~9 | 5 files | ✅ Yes |
| 051-docs-system-prompt | 339 | ~1,600 | 28 | 4 | ~12 | 0 | ✅ Yes |
| 016-github-action | 315 | ~2,400 | 20 | 26 | ~15 | 0 | ⚠️ Dense |
*Estimated tokens (code is denser: ~3 chars/token vs prose: ~4 chars/token)
Key Observations
1. Sub-specs Improve Readability
- Spec 059 (394 lines): Has 6 sub-spec files → README is just an overview
- Spec 049 (374 lines): Has 5 sub-spec files → Progressive disclosure works
- These are easier to navigate than 300-line single-file specs
2. Section Density Matters More Than Line Count
- Spec 049: 9 lines/section → Easy to scan and understand
- Spec 016: 15 lines/section + 26 code blocks → Cognitively heavier despite fewer lines
3. Code Block Density is a Complexity Factor
- High code density (like spec 016 with 26 blocks) increases cognitive load
- Code requires more attention than prose
- Not captured by simple line counting
4. Structure Trumps Size
- A well-organized 400-line spec with clear sections and sub-specs
- Is MORE readable than a poorly structured 250-line spec
- Current validation misses this
5. Token Count Reveals True Cognitive Load
- Spec 016: Only 315 lines but ~2,400 tokens (26 code blocks = dense)
- Spec 049: 374 lines but only ~1,700 tokens (pure prose, no code)
- Research shows: Quality drops beyond 50K tokens, 6x cost difference between 2,000-line vs 300-line specs
- Token count better predicts AI performance than line count
The Problem with Hard Thresholds
What Current Validation Checks
// Current: Simple line counting
if (lines > 400) return ERROR;
if (lines > 300) return WARNING;
Issues:
- ❌ Doesn't account for structure (section organization)
- ❌ Doesn't account for density (code blocks, lists, tables)
- ❌ Doesn't account for sub-specs (progressive disclosure)
- ❌ Doesn't account for content type (prose vs. code vs. data)
- ❌ Doesn't account for token count (true cognitive load for AI)
- ❌ False positives: 394-line spec with 6 sub-specs → WARNING (but it's fine!)
- ❌ False negatives: 280-line dense spec with no structure → PASS (but it's hard to read!)
What Actually Affects Readability
Cognitive Load Factors (in priority order):
-
Cognitive Chunking - Can you break it into 7±2 concepts?
- Well-sectioned spec with 15-30 sections: Easy to chunk
- Monolithic wall of text: Hard to process
-
Information Density - How much attention does each line require?
- Code blocks: High cognitive load
- Tables: Medium load
- Narrative prose: Lower load
- Frontmatter/lists: Scannable
-
Progressive Disclosure - Can you defer details?
- Spec with sub-specs: Read README for overview, dive into DESIGN.md when needed
- Single file: Must read everything to understand
-
Signal-to-Noise - How much is fluff vs. decision-critical?
- High signal: Every sentence informs decisions
- Low signal: Obvious content, verbose explanations
-
Token Count - True AI cognitive load (CRITICAL)
- Research: 39% performance drop in multi-turn contexts
- Quality degrades beyond 50K tokens despite 200K limits
- 6x cost difference: 2,000 lines vs 300 lines
- Code is denser (~3 chars/token) than prose (~4 chars/token)
- Better predictor of AI effectiveness than line count
-
Total Length - Raw line count (legacy metric)
- Yes, it matters, but LESS than the above factors
- A necessary but not sufficient condition
- Proxy for token count but less accurate
Proposed Refined Approach
Multi-Dimensional Complexity Score
Instead of just line count, calculate Cognitive Load Score:
type ComplexityMetrics = {
lineCount: number;
sectionCount: number;
codeBlockCount: number;
codeBlockChars: number; // Total characters in code blocks
listItemCount: number;
tableCount: number;
tableChars: number; // Total characters in tables
hasSubSpecs: boolean;
subSpecCount: number;
averageSectionLength: number;
estimatedTokens: number; // Estimated token count for LLM input
estimatedReadingTime: number; // minutes
};
type ComplexityScore = {
score: number; // 0-100
factors: {
tokens: number; // Primary: token-based score (0-60)
structure: number; // Modifier: structure quality (-30 to +20)
};
recommendation: 'good' | 'review' | 'split';
costMultiplier: number; // vs 300-line baseline
aiEffectiveness: number; // 0-100% (hypothesis to validate)
};
Scoring Algorithm (Draft)
// Token estimation using tokenx package (https://www.npmjs.com/package/tokenx)
// 94% accuracy, 2kB size, zero dependencies
import { estimateTokenCount } from 'tokenx';
function calculateComplexityScore(metrics: ComplexityMetrics): ComplexityScore {
// PRIMARY: Token count (research-backed predictor of AI performance)
// Thresholds to be validated empirically - these are hypotheses
const tokenScore =
metrics.estimatedTokens < 2000 ? 0 : // Excellent
metrics.estimatedTokens < 3500 ? 20 : // Good
metrics.estimatedTokens < 5000 ? 40 : // Warning
60; // Should split
// MODIFIERS: Structure quality adjusts token-based score
// Sub-specs enable progressive disclosure (big win for Context Economy)
// Good sectioning enables cognitive chunking (7±2 rule)
const structureModifier =
metrics.hasSubSpecs ? -30 : // Progressive disclosure bonus
(metrics.sectionCount >= 15 && metrics.sectionCount <= 35) ? -15 : // Good chunking
(metrics.sectionCount < 8) ? +20 : // Too monolithic
0; // Acceptable
const finalScore = Math.max(0, Math.min(100, tokenScore + structureModifier));
// Calculate cost multiplier (vs 300-line baseline ≈ 1,200 tokens)
const baselineTokens = 1200;
const costMultiplier = metrics.estimatedTokens / baselineTokens;
// AI effectiveness estimate (to be validated empirically)
// Research suggests degradation, but exact thresholds need testing
let aiEffectiveness = 100;
if (metrics.estimatedTokens > 10000) {
aiEffectiveness = 50; // Severe degradation (hypothesis)
} else if (metrics.estimatedTokens > 5000) {
aiEffectiveness = 65; // Significant degradation (hypothesis)
} else if (metrics.estimatedTokens > 3500) {
aiEffectiveness = 80; // Noticeable degradation (hypothesis)
} else if (metrics.estimatedTokens > 2000) {
aiEffectiveness = 90; // Slight degradation (hypothesis)
}
return {
score: finalScore,
factors: {
tokens: tokenScore, // Primary factor
structure: structureModifier, // Modifier
},
recommendation:
finalScore <= 25 ? 'good' :
finalScore <= 50 ? 'review' :
'split',
costMultiplier: Math.round(costMultiplier * 10) / 10,
aiEffectiveness: Math.round(aiEffectiveness),
};
}
New Thresholds
Instead of hard line limits:
- Score 0-30: ✅ Good - Readable and well-structured
- Score 31-60: ⚠️ Review - Consider simplification or splitting
- Score 61-100: 🔴 Split - Too complex, should split
Examples Applied to Current Specs
Spec 059 (394 lines, ~2,100 tokens, 32 sections, 8 code blocks, 6 sub-specs):
- Token score: 20 (~2,100 tokens)
- Structure modifier: -30 (has sub-specs)
- Total: -10 points → ✅ Excellent | Cost: 1.8x | AI: 90%
Spec 016 (315 lines, ~2,400 tokens, 20 sections, 26 code blocks, no sub-specs):
- Token score: 20 (~2,400 tokens)
- Structure modifier: -15 (20 sections, good chunking)
- Total: 5 points → ✅ Good | Cost: 2.0x | AI: 90%
- Key insight: Token count captures code density automatically
Spec 051 (339 lines, ~1,600 tokens, 28 sections, 4 code blocks, no sub-specs):
- Token score: 0 (~1,600 tokens)
- Structure modifier: -15 (28 sections, good chunking)
- Total: -15 points → ✅ Excellent | Cost: 1.3x | AI: 100%
Spec 049 (374 lines, ~1,700 tokens, 38 sections, 0 code blocks, 5 sub-specs):
- Token score: 0 (~1,700 tokens)
- Structure modifier: -30 (has sub-specs, 38 sections)
- Total: -30 points → ✅ Excellent | Cost: 1.4x | AI: 100%
- Key insight: Sub-specs + good structure = optimal
Hypothetical: 280 lines, ~1,400 tokens, 5 sections, no code blocks, no sub-specs:
- Token score: 0 (~1,400 tokens)
- Structure modifier: +20 (only 5 sections, poor chunking)
- Total: 20 points → ✅ Good | Cost: 1.2x | AI: 100%
- Key insight: Short with poor structure still acceptable (tokens dominate)
Validation Changes Needed
Phase 1: Add Complexity Metrics (v0.3.0)
Enhance validation to collect:
interface SpecComplexity {
lineCount: number;
sectionCount: number;
codeBlockCount: number;
listItemCount: number;
tableCount: number;
subSpecFiles: string[];
averageSectionLength: number;
estimatedReadingTime: number;
}
Phase 2: Implement Complexity Scoring (v0.3.0)
Add new validator:
class ComplexityScoreValidator implements ValidationRule {
name = 'complexity-score';
description = 'Multi-dimensional complexity analysis';
validate(spec: SpecInfo, content: string): ValidationResult {
const metrics = analyzeComplexity(spec, content);
const score = calculateComplexityScore(metrics);
if (score.recommendation === 'split') {
return {
passed: false,
errors: [{
message: `Spec complexity too high (score: ${score.score}/100)`,
suggestion: `Consider splitting. Main issues: ${identifyTopIssues(score.factors)}`,
}],
};
}
if (score.recommendation === 'review') {
return {
passed: true,
warnings: [{
message: `Spec complexity moderate (score: ${score.score}/100)`,
suggestion: `Consider: ${suggestImprovements(metrics, score.factors)}`,
}],
};
}
return { passed: true, errors: [], warnings: [] };
}
}
Phase 3: Keep Line Count as Backstop (v0.3.0)
Don't remove line count validation entirely - use it as a backstop:
// Complexity score is primary
// Line count is secondary safety net
if (complexityScore < 60 && lineCount < 500) {
// Good - pass both checks
} else if (complexityScore < 60 && lineCount >= 500) {
// Warning - good structure but very long
warning("Well-structured but consider splitting for Context Economy");
} else if (complexityScore >= 60 && lineCount < 400) {
// Error - complex despite being shorter
error("Poor structure or high density - needs refactoring");
} else {
// Error - both metrics problematic
error("Too complex - split into sub-specs");
}
Phase 4: Educate Users (v0.3.0)
Update guidance:
- AGENTS.md: Explain complexity factors beyond line count
- README.md: Show examples of good vs. poor structure
- Validation output: Explain WHY a spec is complex
- CLI: Add
lean-spec complexity <spec>command for detailed analysis
Research Evidence
1. Token Count is Critical
Source: AI Agent Performance Blog Post
- Finding: 2,000-line spec costs 6x more than 300-line spec
- Finding: Quality degradation happens even within context limits (not just at 50K)
- Key Quote: "Quality drops beyond 50K tokens despite 200K limits" - but degradation starts much earlier
- Why: Attention dilution (N² complexity), context rot, option overload, premature convergence
2. Multi-Turn Performance Degradation
Source: arXiv:2505.06120 - "LLMs Get Lost In Multi-Turn Conversation"
- Finding: 39% average performance drop across six generation tasks
- Root Cause: LLMs make premature assumptions and can't recover
- Key Quote: "When LLMs take a wrong turn, they get lost and do not recover"
3. Function-Calling Performance
Source: Berkeley Function-Calling Leaderboard (BFCL)
- Finding: ALL models perform worse with more tools/options
- Implication: More context = more confusion = lower accuracy
4. Information Density Matters
Source: arXiv:2407.11963 - "NeedleBench"
- Finding: Models struggle with information-dense scenarios even at shorter context lengths
- Phenomenon: "Under-thinking" - premature reasoning termination
5. Long-Context RAG Performance
Source: Databricks Research
- Finding: Long-context performance degrades significantly even within theoretical limits
- Implication: Smaller models degrade earlier
Key Takeaway
Token count is a better predictor of AI performance than line count because:
- Direct measure of LLM input cost
- Accounts for content density (code vs prose)
- Backed by research showing non-linear degradation patterns
- Correlates with actual AI effectiveness
Degradation Gradient (based on research):
- <2K tokens: Baseline performance (~100% effectiveness)
- 2-5K tokens: Early degradation begins (~85-95% effectiveness)
- 5-10K tokens: Noticeable degradation (~65-85% effectiveness)
- 10-20K tokens: Moderate degradation (~50-65% effectiveness)
- 50K+ tokens: Severe "cliff" effect (~40% performance drop or worse)
For validation, we use conservative thresholds (5K tokens = severe penalty) to catch specs before they reach problematic sizes.
Implementation Plan
Phase 1: Research & Metrics ✅ (This spec)
- Investigate current specs
- Identify complexity factors
- Propose scoring algorithm
- Design empirical validation plan
- Get feedback on approach
Phase 2: Core Implementation (v0.3.0 - Next)
- Install
tokenxfor token estimation - Implement simplified
calculateComplexityScore()function - Use hypothesis thresholds (2K/3.5K/5K tokens) initially
- Create
ComplexityScoreValidatorclass - Add tests for edge cases
- Integrate with existing validation framework
Phase 3: CLI Integration (v0.3.0)
- Add
lean-spec complexity <spec>command - Show breakdown: token score, structure modifier
- Display cost multiplier and AI effectiveness estimates
- Provide actionable suggestions
- Update
lean-spec validateoutput
Phase 4: Documentation (v0.3.0)
- Update AGENTS.md with complexity guidance
- Update README.md with examples
- Create "good structure" showcase
- Document that thresholds are hypotheses pending validation
Phase 5: Empirical Validation (Future - v0.4.0+)
Deferred until we have:
- More real-world usage data from v0.3.0
- Clear methodology for LLM integration
- Resources for comprehensive benchmarking
Tasks when ready:
- Implement benchmark framework (already stubbed in
src/benchmark/) - Define benchmark tasks for 10+ specs
- Run benchmarks to validate thresholds
- Refine scoring weights based on data
- Publish empirical findings
Phase 6: Advanced Features (v0.4.0+)
- Complexity trends over time
- Project-wide complexity dashboard
- Automated splitting suggestions
- Model-specific thresholds (if empirical data shows need)
Implementation Details
Token Estimation: Two Options
We have two viable approaches for token counting, each with different tradeoffs:
Option 1: tokenx (Recommended for Validation)
Fast, lightweight estimation - Best for validation thresholds where perfect accuracy isn't critical.
Pros:
- ✅ 94% accuracy compared to full tokenizers
- ✅ Just 2kB bundle size with zero dependencies
- ✅ Very fast - no tokenization overhead
- ✅ Multi-language support (English, German, French, Chinese, etc.)
- ✅ Good enough for validation warnings/errors
- ✅ 45K+ weekly downloads
Cons:
- ❌ Not 100% accurate (6-12% error margin)
- ❌ Estimation-based, not true BPE encoding
Installation:
npm install tokenx
Usage:
import { estimateTokenCount, isWithinTokenLimit } from 'tokenx';
// Fast estimation for validation
const tokens = estimateTokenCount(specContent);
// Check if within limit (e.g., 5000 token warning threshold)
const needsReview = !isWithinTokenLimit(specContent, 5000);
Accuracy benchmarks:
- English prose: 10-12% error margin
- Code (TypeScript): 6.18% error margin
- Large text (31K tokens): 12.29% error margin
Option 2: gpt-tokenizer (For Exact Counts)
Precise tokenization - Port of OpenAI's tiktoken with 100% accuracy.
Pros:
- ✅ 100% accurate - exact BPE encoding
- ✅ Supports all OpenAI models (GPT-4o, GPT-4, GPT-3.5, etc.)
- ✅ Fastest full tokenizer on NPM (faster than WASM bindings)
- ✅ Built-in cost estimation with
estimateCost() - ✅ Chat-specific tokenization with
encodeChat() - ✅ 283K+ weekly downloads, trusted by Microsoft, Elastic
Cons:
- ❌ 53.1 MB unpacked size (vs 2kB for tokenx)
- ❌ Slower than estimation (but still fastest full tokenizer)
- ❌ Model-specific - need to import correct encoding
Installation:
npm install gpt-tokenizer
Usage:
import { encode, countTokens, isWithinTokenLimit } from 'gpt-tokenizer';
// or model-specific: from 'gpt-tokenizer/model/gpt-4o'
// Exact token count
const tokens = encode(specContent);
const count = tokens.length;
// Or use helper
const exactCount = countTokens(specContent);
// Check limit with exact counting
const needsReview = !isWithinTokenLimit(specContent, 5000);
Accuracy:
- 100% accurate (port of OpenAI's tiktoken)
- Benchmarked against OpenAI's Python library
Recommendation: Hybrid Approach
For v0.3.0, use tokenx:
- Fast validation during CLI commands
- 2kB size won't bloat the package
- 94% accuracy is sufficient for warnings/errors
- 6-12% margin is acceptable for thresholds
Future: Offer gpt-tokenizer as optional
- Add as peer dependency (optional install)
- Use if available for exact counts
- Fall back to tokenx if not installed
- Display "estimated" vs "exact" in output
Implementation:
// Try exact tokenizer first, fall back to estimation
let tokenCount: number;
let isExact = false;
try {
const { countTokens } = await import('gpt-tokenizer');
tokenCount = countTokens(content);
isExact = true;
} catch {
const { estimateTokenCount } = await import('tokenx');
tokenCount = estimateTokenCount(content);
isExact = false;
}
// Display in output
console.log(`Tokens: ${tokenCount} ${isExact ? '(exact)' : '(estimated ±6%)'}`);
Empirical Validation Plan (Future Work)
Status: Deferred to v0.4.0+ - too early for comprehensive benchmarking
The Problem: Current thresholds (2K/3.5K/5K tokens) are hypotheses based on research, not validated on LeanSpec's actual use case.
The Challenge: Building a proper benchmark suite requires:
- Clear methodology for LLM integration and evaluation
- Significant time investment for framework + test data
- Real-world usage patterns from v0.3.0 to guide validation
- Resources for running benchmarks across multiple models
The Pragmatic Approach:
- v0.3.0: Ship complexity scoring with research-based thresholds (good enough to start)
- Collect data: Gather real-world usage patterns, see which specs trigger warnings
- v0.4.0+: Build benchmark framework when we have clearer requirements
Validation Framework Stub (see src/benchmark/ for implementation):
- Type definitions and interfaces ready
- Complexity analysis functions implemented
- Benchmark task examples defined
- LLM integration and statistical analysis deferred
When ready to validate, the framework will answer:
- Does token count predict performance better than line count?
- Where does degradation actually start? (2K? 3K? 5K?)
- How much do sub-specs improve AI comprehension?
- What's the real cost multiplier for large specs?
For now: Use hypothesis thresholds, document them as such, refine based on user feedback in v0.3.0.
Open Questions
-
Token Thresholds: Are 2K/3.5K/5K correct?
- Current: Using research-based hypotheses
- To validate (v0.4.0+): Run benchmark suite when methodology is clear
- For now: Gather user feedback in v0.3.0, adjust if obviously wrong
-
Structure Impact: How much does it matter?
- Current: -30 bonus for sub-specs, -15 for good sectioning
- To validate (v0.4.0+): Compare monolithic vs sub-spec variants
- For now: Based on cognitive science (7±2 chunks) and intuition
-
Section Count Sweet Spot: Is 15-35 sections right?
- Current: Based on cognitive load theory
- To validate (v0.4.0+): Test specs with varying section counts
- For now: Seems reasonable, may adjust based on user feedback
-
Model Differences: Do thresholds vary by model?
- Current: Assume similar across Claude/GPT
- To validate (v0.4.0+): Test multiple models if data shows divergence
- For now: Single set of thresholds
-
Performance: Can we run this efficiently?
- Current: tokenx is very fast (2kB, no dependencies)
- To benchmark: Test on 100+ specs to verify <100ms per spec
- For now: Should be fine, optimize if issues arise
Success Criteria
Phase 2-4: Initial Implementation (v0.3.0)
- ✅ Spec 059 (394 lines, 6 sub-specs) scores well (≤25 points)
- ✅ Poorly structured specs flagged even if short
- ✅ Users understand WHY a spec is complex (clear breakdown)
- ✅ Validation guides toward better structure, not just length reduction
- ✅ AI agents make informed splitting decisions based on token count + structure
- ✅ No false negatives: Truly oversized specs (>600 lines) caught
- ✅ Thresholds documented as hypotheses, not validated facts
Phase 5: Empirical Validation (v0.4.0+ - When Ready)
- ✅ Token count predicts performance better than line count (R² > 0.7 vs < 0.5)
- ✅ Degradation thresholds validated within ±500 tokens of hypothesis
- ✅ Sub-specs show measurable quality improvement (>5% accuracy)
- ✅ Cost multiplier validated against actual API usage
- ✅ Multi-turn degradation measured and documented
User Experience (All Phases)
- ✅ Complexity scores align with user intuition
- ✅ Suggestions are actionable and specific
- ✅ AGENTS.md reflects current best practices (hypothesis-based for v0.3.0)
Related Specs
- 048-spec-complexity-analysis - Identified line count thresholds
- 049-leanspec-first-principles - Context Economy principle
- 059-programmatic-spec-management - Context engineering and programmatic analysis
- 018-spec-validation - Current validation framework
Notes
Why This Matters
Current Problem: False positives and false negatives
- We're warning about well-structured 394-line specs (false positive)
- We're missing dense 280-line specs with poor structure (false negative)
Impact:
- Users may ignore warnings if they seem arbitrary
- AI agents get confused about when to split
- We're not measuring what we actually care about (readability, not just length)
Solution: Measure complexity more holistically
- Line count remains important but not sufficient
- Structure, density, and progressive disclosure matter
- Give users actionable feedback
The Meta-Learning
This spec itself demonstrates the principle:
- 410 lines, ~2,200 tokens (includes code examples)
- Well-structured with clear sections (28 sections)
- Each section is scannable and focused
- Tables and lists make information easy to parse
- References research with clear citations
Applying the simplified scoring to this spec:
- Token score: 20 (~2,200 tokens, in good range)
- Structure modifier: -15 (28 sections, good chunking, no sub-specs)
- Total: 5 points → ✅ Good | Cost: 1.8x | AI: 90%
Insight: Well-structured with clear sections. Could benefit from sub-specs for benchmark details (would get -30 modifier).
Using old rules: "🔴 Error: 410/400 lines - must split!" Using new rules: "✅ Good: Score 5/100 - well-structured, token count acceptable, consider sub-specs to reach 'excellent'"
Implementation Status (2025-11-11)
✅ What's Done
Core Implementation Exists:
ComplexityValidatorclass implemented inpackages/core/src/validators/complexity.ts- Token estimation using
tokenxpackage (installed and working) - Multi-dimensional scoring algorithm implemented
- Registered in validation pipeline (
packages/cli/src/commands/validate.ts) - All token thresholds and structure modifiers coded as designed
Build Status:
@leanspec/corepackage builds successfully@leanspec/clipackage builds successfullytokenxdependency properly installed
🐛 Issues Found
1. Sub-Spec Detection Bug (Critical)
- Problem:
hasSubSpecsdetected by text pattern matching, not actual file existence - Current Code:
/\b(DESIGN|IMPLEMENTATION|TESTING|CONFIGURATION|API|MIGRATION)\.md\b/.test(content) - Issue: Spec 066 mentions "DESIGN.md" in documentation → gets -30 bonus it doesn't deserve
- Impact: False negatives - specs get structure bonuses for merely documenting sub-specs
2. Silent Warning Issue
- Problem: Complexity validator produces no output even when it should warn
- Actual Score for Spec 066:
- 706 lines
- 7,307 tokens (very high, >5000 threshold)
- 45 sections
- Token score: 60
- Structure modifier: -30 (false positive due to bug #1)
- Final score: 30 → "review" recommendation (should show warning)
- Expected: Warning message displayed
- Actual: No output from complexity validator
- Hypothesis: Warning not being formatted/displayed (need to debug formatter or result handling)
3. Old Line Count Validator Still Active
- Current Behavior: Shows "Error: Spec exceeds 400 lines (706 lines)"
- Expected: Complexity validator should be primary, line count as backstop
- Decision Needed: Should old validator be:
- Disabled entirely?
- Adjusted to only warn at 500+ lines?
- Kept as-is for redundancy?
🔧 Fixes Needed for Next Session
Priority 1: Fix Sub-Spec Detection
// Current (WRONG):
const hasSubSpecs = /\b(DESIGN|IMPLEMENTATION|TESTING|CONFIGURATION|API|MIGRATION)\.md\b/.test(content);
// Should be (need to check actual files):
// Option A: Pass file list to validator
const hasSubSpecs = subSpecFiles.length > 0;
// Option B: Check spec directory for .md files (requires fs access)
const files = await fs.readdir(path.dirname(spec.filePath));
const mdFiles = files.filter(f => f.endsWith('.md') && f !== 'README.md');
const hasSubSpecs = mdFiles.length > 0;
Location: packages/core/src/validators/complexity.ts, line ~160
Priority 2: Debug Silent Warning
- Add logging to see if validator is running and producing results
- Check if
ValidationResultwith warnings is being filtered out - Verify formatter (
validate-formatter.ts) handles complexity validator output - Test with simpler spec to isolate issue
Priority 3: Coordinate Line Count Validator
- Decide on line count validator role (keep, adjust, or remove)
- Update thresholds if keeping (suggest 500/600 instead of 300/400)
- Document relationship between validators in code comments
📊 Test Cases for Validation
Test with Spec 066 (this spec):
- Expected: Score 60 (no sub-specs) → "split" recommendation → ERROR
- Currently: Score 30 (false bonus) → "review" → WARNING (but silent)
Test with Spec 049 (has 5 sub-specs):
- 374 lines, ~1,700 tokens, 38 sections, 5 sub-spec files
- Expected: Score -30 → "excellent" → PASS
- Should verify this works correctly
Test with Spec 059 (has 6 sub-specs):
- 394 lines, ~2,100 tokens, 32 sections, 6 sub-spec files
- Expected: Score -10 → "excellent" → PASS
- Should verify this works correctly
🎯 Next Steps
-
Fix sub-spec detection (30 min):
- Modify
analyzeComplexity()to check actual files - May need to pass spec path or file list to validator
- Update tests to verify correct detection
- Modify
-
Debug warning output (20 min):
- Add temporary console.log in validator
- Rebuild and test
- Check if result is being produced but not displayed
-
Verify full pipeline (10 min):
- Run validation on multiple specs
- Confirm token counts and scores match expectations
- Validate formatter displays all validator results
-
Update line count validator (10 min):
- Adjust thresholds to 500/600 or disable
- Update messages to reference complexity validator
- Document as backstop in comments
-
Test and document (20 min):
- Validate spec 049, 059, 066 with corrected logic
- Update AGENTS.md with complexity guidance
- Mark spec 066 as fully implemented
Estimated Total: ~90 minutes of focused work
💡 Design Questions to Resolve
-
File Detection Approach: Should we:
- A) Pass file list to validator (cleaner, requires API change)
- B) Let validator read directory (simpler, but core needs fs access)
- C) Pre-compute in CLI and pass as metadata (best separation of concerns)
-
Line Count Validator: Should we:
- A) Remove it (complexity validator handles everything)
- B) Keep with raised thresholds (500/600) as backstop
- C) Keep current thresholds for redundancy
-
Sections Outside 15-35 Range: Currently gives 0 modifier. Should we:
- A) Keep as-is (only penalize <8 sections)
- B) Penalize >35 sections (too fragmented)
- C) Use sliding scale instead of fixed ranges
Status: Implementation exists but has bugs. Ready for debugging and fixing in next session.