M-EVAL-LOOP Architecture (v2.0)

Quick Reference

Running Evaluations

# Quick dev eval (default: gpt5-mini, gemini-2-5-flash)
ailang eval-suite

# Full comprehensive eval (gpt5, claude-sonnet-4-5, gemini-2-5-pro)
ailang eval-suite --full

# Custom models
ailang eval-suite --models gpt5,claude-sonnet-4-5

# Create baseline
make eval-baseline                # Quick baseline (dev models)
make eval-baseline FULL=true      # Full baseline (all models)

# Compare results
ailang eval-compare eval_results/baselines/v0.3.0 eval_results/current

# Validate specific fix
ailang eval-validate float_eq

Available Commands

Command	Purpose	Example
`ailang eval-suite`	Run full benchmark suite	`ailang eval-suite --full`
`ailang eval-compare`	Compare two eval runs	`ailang eval-compare baseline current`
`ailang eval-validate`	Validate specific fix	`ailang eval-validate records_person`
`ailang eval-matrix`	Generate performance matrix	`ailang eval-matrix results/ v0.3.0`
`ailang eval-summary`	Export to JSONL	`ailang eval-summary results/`
`ailang eval-report`	Generate reports	`ailang eval-report results/ v0.3.0 --format=html`

Architecture Overview

Tier 1: Native Go Commands

Location: internal/eval_analysis/ + internal/eval_harness/ + cmd/ailang/

eval-suite: Full Benchmark Execution

Key Flags:

--full: Use expensive models (gpt5, claude-sonnet-4-5, gemini-2-5-pro)
--models X,Y,Z: Custom model list (default: gpt5-mini, gemini-2-5-flash)
--benchmarks X,Y,Z: Specific tests (default: all)
--langs X,Y: Target languages (default: python,ailang)
--parallel N: Concurrent API calls (default: 5)
--self-repair: Enable self-repair on errors
--output DIR: Output directory (default: eval_results)

Examples:

# Quick dev check (cheap/fast)
ailang eval-suite

# Full validation (expensive)
ailang eval-suite --full

# Single model + self-repair
ailang eval-suite --models gpt5 --self-repair

# Custom subset
ailang eval-suite --models gpt5 --benchmarks fizzbuzz,json_parse

Model Cost Comparison:

Dev models (default): ~$0.0003-0.002 per benchmark
Full models (--full): ~$0.003-0.015 per benchmark
5-10x cheaper for day-to-day development

eval-compare: Diff Two Runs

ailang eval-compare baseline/ current/

Shows:

Success rate changes
Newly passing/failing tests
Token usage deltas
Cost differences

eval-validate: Check Specific Fix

ailang eval-validate records_person [version]

Validates that a fix works by comparing current implementation against baseline.

Other Commands

ailang eval-matrix results/ v0.3.0     # Aggregate stats
ailang eval-summary results/           # Export JSONL
ailang eval-report results/ v0.3.0 -f html  # Generate report

Tier 2: Smart Agents

eval-orchestrator (.claude/agents/eval-orchestrator.md)

Interprets natural language and routes to correct commands:

User: "validate the float_eq fix"
Agent: → ailang eval-validate float_eq
      → Interprets results
      → Suggests next steps

eval-fix-implementer (.claude/agents/eval-fix-implementer.md)

Automates fix implementation from design docs:

User: "implement the float_eq fix"
Agent: → Reads design_docs/planned/EVAL_ANALYSIS_float_eq.md
      → Implements fix
      → Runs tests
      → Validates with ailang eval-validate
      → Reports metrics

Make Targets (Convenience)

make eval-baseline              # Quick baseline
make eval-baseline FULL=true    # Full baseline
make eval-baseline MODELS=X,Y   # Custom baseline

make eval-suite                 # Run benchmarks
make eval-analyze               # Generate design docs
make eval-diff BASELINE=X NEW=Y # Compare runs

User Experience

Natural Language (Recommended)

✅ "validate my fix for records"
✅ "how is AILANG performing?"
✅ "compare baseline to current"
✅ "generate a release report"

Direct Commands (Power Users)

ailang eval-validate records_person
ailang eval-compare baselines/v0.3.0 current
ailang eval-report results/ v0.3.0 --format=html

Model Selection Strategy

Default: Cheap & Fast (Dev Models)

Models: gpt5-mini, gemini-2-5-flash
Cost: ~1/5 of full suite
Use for: Daily development, rapid iteration, CI checks
Command: ailang eval-suite (no flags needed)

Full Suite: Comprehensive (Production Models)

Models: gpt5, claude-sonnet-4-5, gemini-2-5-pro
Cost: Full price
Use for: Release validation, final QA, baseline creation
Command: ailang eval-suite --full

Custom: Mix & Match

Models: Your choice
Cost: Varies
Use for: Targeted testing, specific model evaluation
Command: ailang eval-suite --models X,Y,Z

File Organization

.claude/
  agents/
    eval-orchestrator.md          # Smart workflow router
    eval-fix-implementer.md       # Automated fix implementation
  EVAL_ARCHITECTURE.md            # This file

internal/
  eval_analysis/                  # Native Go implementation (2,070 LOC)
    types.go, loader.go, comparison.go, matrix.go, formatter.go,
    validate.go, export.go, *_test.go (90%+ coverage)
  eval_harness/                   # Benchmark execution
    models.yml                    # Model configurations

cmd/ailang/
  eval_suite.go                   # eval-suite command
  eval_tools.go                   # Other eval commands
  main.go                         # Command routing

tools/
  eval_baseline.sh                # Baseline creation helper

Makefile                          # Convenience targets

Design Principles

Native Go First: Fast, type-safe, testable
Smart Agents Layer: Add intelligence without forcing syntax
Cost-Conscious Defaults: Cheap models for dev, expensive for release
Flexible: Natural language, direct commands, or make targets
Separation of Concerns: Execution vs. interpretation

Why This Architecture?

Performance

Native Go = 5-10x faster than old bash scripts
No jq/sed/awk overhead
Parallel execution built-in

Reliability

90%+ test coverage
Type-safe (no division by zero!)
Proper error handling

Usability

Natural language interface (agents)
Power user direct commands
Cost-conscious defaults

Maintainability

Clear layer boundaries
Easy to test and debug
No brittle bash scripts

Version: 2.1 Updated: 2025-10-15 Status: Production Ready

Quick Reference​

Running Evaluations​

Available Commands​

Architecture Overview​

Tier 1: Native Go Commands​

eval-suite: Full Benchmark Execution​

eval-compare: Diff Two Runs​

eval-validate: Check Specific Fix​

Other Commands​

Tier 2: Smart Agents​

eval-orchestrator (.claude/agents/eval-orchestrator.md)​

eval-fix-implementer (.claude/agents/eval-fix-implementer.md)​

Make Targets (Convenience)​

User Experience​

Natural Language (Recommended)​

Direct Commands (Power Users)​

Model Selection Strategy​

Default: Cheap & Fast (Dev Models)​

Full Suite: Comprehensive (Production Models)​

Custom: Mix & Match​

File Organization​

Design Principles​

Why This Architecture?​

Performance​

Reliability​

Usability​

Maintainability​