Skip to main content

AI Code Generation Benchmarks

Real-world performance metrics for AILANG vs Python across multiple AI models.

Model Comparison (Radar Plot)

Loading benchmark data...

Detailed Results

Loading benchmark data...

What These Numbers Mean

Our benchmark suite tests AI models' ability to generate correct, working code in both AILANG and Python.

Success Metrics

  • 0-Shot Success: Code works on first try (no repairs)
  • Final Success: Code works after M-EVAL-LOOP self-repair
  • Token Efficiency: Lower tokens = more concise code

Why This Matters

These benchmarks demonstrate:

  1. Type Safety Works: AILANG's type system catches errors early
  2. Effects Are Clear: Explicit effect annotations help AI models
  3. Patterns Are Learnable: AI models understand functional programming
  4. Room to Grow: Benchmarks identify language gaps and guide development

Where AILANG Shines

AILANG excels at these problem types:

  • Fizzbuzz: 100.0% success rate
  • Records Person: 100.0% success rate
  • Nested Records: 100.0% success rate
  • String Manipulation: 100.0% success rate
  • Simple Print: 100.0% success rate

How Benchmarks Guide Development

The M-EVAL-LOOP system uses these benchmarks to:

  1. Identify Bugs: Failing benchmarks reveal language issues
  2. Validate Fixes: Compare before/after to confirm improvements
  3. Track Progress: Historical data shows language evolution
  4. Prioritize Features: High-impact failures guide roadmap

Case Study: Float Equality Bug

The adt_option benchmark caught a critical bug where float comparisons with variables called eq_Int instead of eq_Float. The benchmark suite detected it, guided the fix, and validated the solution.

Result: Benchmark went from runtime_error → PASSING ✅

Try It Yourself

Want to see AILANG in action?

Technical Details

Version: 0.3.14

Total Runs: 227

Generated: 2025-10-18 22:52:34

Model Performance Details

ModelRuns0-ShotFinalAvg TokensCost/RunBaseline
Claude Sonnet 4.54264.3%71.4%2523$0.00900.3.14
gpt5-mini3171.0%71.0%2396$0.00080.3.14
gpt54259.5%61.9%2267$0.00370.3.14
Gemini 2.5 Pro3961.5%61.5%2150$0.00410.3.14
claude-haiku-4-54250.0%59.5%2633$0.00330.3.14
gemini-2-5-flash3151.6%58.1%2676$0.00100.3.14

Benchmark Details

BenchmarkSuccess RateAvg TokensLanguages
✅ Fizzbuzz100.0%129ailang, python
✅ Nested Records100.0%211ailang, python
✅ Records Person100.0%120ailang, python
✅ Simple Print100.0%20python
✅ String Manipulation100.0%101ailang, python
⚠️ Adt Option91.7%273ailang, python
⚠️ Pattern Matching Complex91.7%378ailang, python
⚠️ Recursion Fibonacci91.7%84ailang, python
⚠️ Recursion Factorial83.3%80ailang, python
⚠️ Error Handling80.0%552ailang, python
⚠️ Targeted Repair Test80.0%53ailang
⚠️ Record Update58.3%160ailang, python
⚠️ Higher Order Functions55.6%171ailang, python
⚠️ Json Encode50.0%95ailang, python
⚠️ Json Parse50.0%88ailang, python
⚠️ List Operations50.0%243ailang, python
❌ Numeric Modulo45.5%30ailang, python
❌ Api Call Json33.3%131ailang, python
❌ List Comprehension33.3%296ailang, python
❌ Float Eq18.2%23ailang, python
❌ Cli Args0.0%128ailang, python
❌ Pipeline0.0%52ailang, python

Methodology: Benchmarks use deterministic seeds across multiple AI models. Each benchmark tests code generation, compilation, and execution. The M-EVAL-LOOP system provides structured error feedback for automatic repair.

Learn More: M-EVAL-LOOP Design | Evaluation Guide