AI Usability Benchmarking Guide

Overview

AILANG includes a built-in benchmarking system (M-EVAL) to empirically measure AI efficiency when generating code in AILANG vs Python. This helps validate AILANG's core value proposition: reducing AI token usage and improving code generation success rates.

NOTE: Current Phase: Baseline Single-Shot (M-EVAL Phase 1)

This guide covers single-shot code generation — one prompt, one response, no iteration. This provides baseline data on:

Initial token efficiency
Syntax familiarity
First-attempt success rate

Multi-turn agentic evaluation (M-EVAL2) is planned for v0.3.0 and will measure:

Total effort across multiple iterations
Debugging/fix cycles
Real-world AI coding workflows

See M-EVAL2 Design Doc for details.

Quick Start

Running a Benchmark (Mock Mode)

# Run FizzBuzz benchmark with mock AI agent
ailang eval --benchmark fizzbuzz --mock

# Run with both Python and AILANG
ailang eval --benchmark fizzbuzz --mock --langs python,ailang

# Generate report
make eval-report

Running with Real AI Models

# Set API keys
export OPENAI_API_KEY="your-key-here"
# or
export ANTHROPIC_API_KEY="your-key-here"

# Run benchmark with GPT-4
ailang eval --benchmark fizzbuzz --model gpt-4 --seed 42

# Run with Claude
ailang eval --benchmark adt_option --model claude-3 --seed 42

Available Benchmarks

ID	Description	Difficulty	Focus Area
`fizzbuzz`	Classic FizzBuzz (1-100)	Easy	Control flow
`json_parse`	Parse JSON, filter, output	Medium	Data parsing
`pipeline`	stdin → transform → stdout	Medium	IO + lists
`cli_args`	Read file, process, sum	Hard	IO + FS
`adt_option`	Option/Maybe monad ops	Medium	Algebraic types

Latest Benchmark Results (v0.3.5)

Model: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) Date: October 13, 2025 Success Rate: 10/19 (52.6%)

Benchmark	Compile	Runtime	Output	Status
adt_option	Yes	Yes	Yes	PASS
cli_args	No	No	No	FAIL
error_handling	Yes	Yes	Yes	PASS
fizzbuzz	Yes	Yes	Yes	PASS
float_eq	No	No	No	FAIL
higher_order_functions	Yes	Yes	Yes	PASS
json_parse	No	No	No	FAIL
list_comprehension	Yes	Yes	No	FAIL
list_operations	Yes	Yes	Yes	PASS
nested_records	Yes	Yes	Yes	PASS
numeric_modulo	Yes	No	No	FAIL
pattern_matching_complex	No	No	No	FAIL
pipeline	No	No	No	FAIL
record_update	Yes	Yes	Yes	PASS
records_person	Yes	Yes	Yes	PASS
recursion_factorial	Yes	Yes	No	FAIL
recursion_fibonacci	No	No	No	FAIL
string_manipulation	Yes	Yes	Yes	PASS
targeted_repair_test	Yes	Yes	Yes	PASS

Key Insights:

Strong areas: Records, ADTs, basic recursion, string operations
Needs improvement: File I/O, JSON parsing, complex pattern matching, numeric operations
Compile success: 16/19 (84.2%) - AI understands AILANG syntax well
Runtime success: 14/19 (73.7%) - Most syntax issues resolved
Output correctness: 10/19 (52.6%) - Logic errors in complex scenarios

Command-Line Options

ailang eval [options]

Required:
  --benchmark <id>     Benchmark ID to run (e.g., fizzbuzz)

Optional:
  --langs <list>       Comma-separated languages (default: python,ailang)
  --model <name>       LLM model (default: gpt-4)
  --seed <n>           Random seed for reproducibility (default: 42)
  --output <dir>       Output directory (default: eval_results)
  --timeout <dur>      Execution timeout (default: 30s)
  --mock               Use mock AI agent (for testing)

Understanding Results

JSON Metrics

Each run produces a JSON file in eval_results/ with:

{
  "id": "fizzbuzz",
  "lang": "python",
  "model": "gpt-4",
  "seed": 42,
  "tokens": 290,
  "cost_usd": 0.0087,
  "compile_ok": true,
  "runtime_ok": true,
  "stdout_ok": true,
  "duration_ms": 120,
  "error_category": "none",
  "timestamp": "2025-10-02T12:34:56Z",
  "code": "..."
}

Error Categories

none: All checks passed
compile_error: Syntax or type error
runtime_error: Crashed during execution
logic_error: Wrong output (compile/runtime OK)

Reports

The make eval-report command generates:

CSV (eval_results/summary.csv):

benchmark,lang,model,seed,tokens,cost_usd,compile,runtime,stdout,duration_ms,error_type
fizzbuzz,python,gpt-4,42,290,0.0087,true,true,true,120,none
fizzbuzz,ailang,gpt-4,42,180,0.0054,true,true,true,80,none

Markdown (eval_results/leaderboard.md):

# AILANG vs Python Benchmark Results

| Benchmark | Lang | Tokens | Cost | Compile | Run | Pass | Duration |
|-----------|------|--------|------|---------|-----|------|----------|
| fizzbuzz | python | 290 | $0.0087 | Yes | Yes | Yes | 0.12s |
| fizzbuzz | ailang | 180 | $0.0054 | Yes | Yes | Yes | 0.08s |

## Summary
- **Avg Token Reduction:** 37.9%
- **AILANG Success Rate:** 100% (1/1)
- **Python Success Rate:** 100% (1/1)

Creating Custom Benchmarks

1. Create YAML Spec

Create benchmarks/my_benchmark.yml:

id: my_benchmark
description: "My custom benchmark"
languages: ["python", "ailang"]
entrypoint: "main"
caps: ["IO"]
difficulty: "medium"
expected_gain: "high"
prompt: |
  Write a program in <LANG> that:
  1. Does something interesting
  2. Produces specific output

  Output only the code, no explanations.
expected_stdout: |
  expected
  output
  here

2. Run It

ailang eval --benchmark my_benchmark --mock

3. Validate Output

The benchmark passes if:

Code compiles without errors
Executes without crashing
Output matches expected_stdout (exact string match after trimming whitespace)

Reproducibility

Deterministic Runs

Use --seed for reproducible results:

# Same seed → same model output (when supported)
ailang eval --benchmark fizzbuzz --model gpt-4 --seed 123
ailang eval --benchmark fizzbuzz --model gpt-4 --seed 123

Comparing Across Models

# Run same benchmark with different models
ailang eval --benchmark adt_option --model gpt-4 --seed 42
ailang eval --benchmark adt_option --model gpt-3.5-turbo --seed 42
ailang eval --benchmark adt_option --model claude-3 --seed 42

# Generate unified report
make eval-report

Best Practices

1. Start with Mock Mode

# Verify benchmark works before using real API
ailang eval --benchmark my_benchmark --mock

2. Use Consistent Seeds

# Always use same seed for fair comparisons
ailang eval --benchmark fizzbuzz --model gpt-4 --seed 42

3. Test Both Languages

# Always compare AILANG vs Python
ailang eval --benchmark fizzbuzz --langs python,ailang

4. Clean Between Runs

# Avoid confusion from stale results
make eval-clean

Interpreting Results

Token Reduction

Good: 30-50% reduction

Indicates AILANG's concise syntax is effective
Lower API costs
Faster generation

Neutral: 0-30% reduction

May indicate benchmark doesn't stress AILANG's strengths
Consider more ADT-heavy or effect-heavy tasks

Bad: Negative reduction (AILANG uses more tokens)

Investigate: Is AILANG verbose for this task?
May indicate need for language improvements

Success Rate

Target: 80%+ for both languages

If AILANG < 80%:

Check runtime errors (capability issues? stdlib gaps?)
Review generated code for common patterns
May indicate missing language features

If Python < 80%:

Check if task is too hard for AI
Simplify prompt or expected output

Error Patterns

Monitor error_category distribution:

High compile_error: AI doesn't understand syntax
High runtime_error: Missing builtins or capabilities
High logic_error: AI misunderstood requirements

Troubleshooting

"OPENAI_API_KEY environment variable not set"

export OPENAI_API_KEY="sk-..."

"ailang binary not found"

make install
# Ensure $GOPATH/bin is in PATH

"Benchmark not found"

ls benchmarks/
# Check that <id>.yml exists

Mock code doesn't match expected output

This is expected! Mock code is for testing the harness, not for real evaluation. Use --model gpt-4 for actual benchmarks.

Makefile Targets

make eval            # Run FizzBuzz with mock agent
make eval-report     # Generate CSV + Markdown reports
make eval-clean      # Remove all eval_results files

Future Extensions

v0.3.0:

Net/Clock benchmarks (once effects land)
Auto-retry on failure
Cost tracking across vendors

v0.4.0:

Concurrency benchmarks (spawn, channels)
Performance metrics
Memory usage tracking

v0.5.0:

Web leaderboard
Historical trending
CI integration

Contributing

When adding benchmarks:

Use neutral prompts (no language bias)
Test with --mock first
Verify expected output is achievable
Document difficulty and expected gain
Submit PR with results from at least 2 models

Next Steps:

Run your first benchmark: ailang eval --benchmark fizzbuzz --mock
Create a custom benchmark for your domain
Share results to help guide AILANG development

Overview​

Quick Start​

Running a Benchmark (Mock Mode)​

Running with Real AI Models​

Available Benchmarks​

Latest Benchmark Results (v0.3.5)​

Command-Line Options​

Understanding Results​

JSON Metrics​

Error Categories​

Reports​

Creating Custom Benchmarks​

1. Create YAML Spec​

2. Run It​

3. Validate Output​

Reproducibility​

Deterministic Runs​

Comparing Across Models​

Best Practices​

1. Start with Mock Mode​

2. Use Consistent Seeds​

3. Test Both Languages​

4. Clean Between Runs​

Interpreting Results​

Token Reduction​

Success Rate​

Error Patterns​

Troubleshooting​

"OPENAI_API_KEY environment variable not set"​

"ailang binary not found"​

"Benchmark not found"​

Mock code doesn't match expected output​

Makefile Targets​

Future Extensions​

Contributing​