AILANG Vision

A language where AI-generated code is easier to debug, replay, and fix.

AILANG provides:

Effect boundaries that constrain what code can do
Structured traces that show exactly what happened
Deterministic execution that enables replay and caching
Inline tests as machine-readable specs

What Makes AILANG Different

1. Effects as Capability Boundaries (Not Just Types)

Normal compilers:

doAnything() is fine as long as types line up
Side-effects are invisible to the type system
Logs are ad-hoc; no semantic boundary between "pure logic" and "this talks to the database"

AILANG:

-- The type signature constrains what this function CAN do
let processUser: User -> ! {DB} UserRecord =
  \user. dbWrite(transformUser(user))

-- This CANNOT compile - no DB capability granted
let pureTransform: User -> UserRecord =
  \user. dbWrite(user)  -- ERROR: Effect mismatch

Why this matters for AI:

The model literally cannot "hallucinate" a network call in a pure function
Effect boundaries are machine-checkable constraints, not just documentation
The search space for valid implementations is genuinely smaller

2. Traces Are First-Class (Not Forensic Artifacts)

Normal languages give you:

Arbitrary logs scattered throughout code
Mutable globals and half-observable state
Trace reconstruction is forensic art

AILANG gives you:

A deterministic core plus effect calls as the interaction points
Traces structured as: pure steps + {effect invocations with typed inputs/outputs}
Sliceable by effect type: "Show only DB writes in this function for this test"

Example trace slice:

Function: aggregateUsers
  Step 1: map(transform, users)         [pure]
  Step 2: DB.read(userTable)            [effect: DB]
  Step 3: filter(isActive, _)           [pure]
  Step 4: DB.write(aggregates)          [effect: DB]

-- Slice to DB effects only:
  DB.read(userTable) -> [User×10]
  DB.write(aggregates) -> [Agg×10]

Why this matters: When something goes wrong, you know exactly where to look. The model gets structured feedback ("DB.write called 10 times but expected 5 after dedupe"), not "it crashed somewhere."

3. Inline Tests Are a Spec Surface (Not Magic)

The critique is valid: "If the AI is smart enough to write a good test, it's smart enough to write correct code."

But that misunderstands the purpose:

Inline tests are not "AI spontaneously inventing the spec."

They are:

A spec channel — Humans write/approve tests, or they're generated from API contracts, examples in docs, acceptance criteria
A portable contract — Tests live next to the function, in the same language, which is what the AI sees and optimizes against
Strengthened over time — Today's weak test can be improved by tomorrow's model or human reviewer

let dedupe: List[User] -> List[User] =
  \users. uniqueBy(_.id, users)

test dedupe {
  assert dedupe([{id: 1}, {id: 1}, {id: 2}]) == [{id: 1}, {id: 2}]
  assert len(dedupe(users)) <= len(users)  -- property
}

Why this matters: The test isn't magic; it's a machine-readable spec that travels with the code and provides leverage for both humans and future models.

Better Feedback Loops

AI code generation is iterative. AILANG makes each iteration more productive:

Traditional	AILANG
Arbitrary logs	Structured effect traces
"Something crashed"	"Effect mismatch at line 42"
Blind retry	Targeted fix

The result: faster convergence to working code.

What Effects Actually Buy You

The effect system isn't just "types for side effects." It's a capability constraint surface.

Example: The Missed Dedupe Problem

A human asks: "Aggregate users from the database and write a summary."

In Python/Go:

def aggregate_users():
    users = db.read_all()  # hidden DB access
    # Oops, AI forgot to dedupe
    summary = compute_summary(users)
    db.write(summary)  # hidden DB write
    log.info("Done")  # ad-hoc logging

Debugging this:

Where did it go wrong? Check logs (if you added them)
What did the DB see? Unknown without tracing infrastructure
Can you replay? Only if you captured all inputs somewhere

In AILANG:

let aggregateUsers: () -> ! {DB, IO} Summary =
  let users = dbReadAll()
  -- AI forgot dedupe here
  let summary = computeSummary(users)
  let _ = dbWrite(summary)
  let _ = print("Done")
  summary

-- Trace output:
--   DB.readAll -> [User×100]
--   DB.write(Summary{count: 100})  -- Wait, why 100? Should be unique count

Debugging this:

Effect trace shows exactly what DB operations happened
You can see: "read 100 users, wrote 100 aggregates" — the bug is obvious
Replay: deterministic, just re-run with same inputs

The AI gets structured feedback:

Expected: DB.write(Summary{count: <unique_users>})
Actual: DB.write(Summary{count: 100})
Hint: No dedupe operation between DB.read and computeSummary

Benchmark Results — M-EVAL

We continuously track AI code generation success across 46 benchmarks with Claude, GPT, and Gemini.

Key metrics:

Zero-Shot: Code works on first try
Final Success: Code works after self-repair
Agent Success: Multi-turn agent completes task

The jump from zero-shot to agent mode shows structured error feedback helping models self-correct.

See the Benchmark Dashboard for live results, trends, and per-model breakdowns.

Current Capabilities — v0.5.x

AILANG today provides:

Algebraic Effects — ! {IO, FS, Net, DB} declares capabilities in types
Deterministic Core — pure functions are referentially transparent
Structured Traces — effect calls logged with typed inputs/outputs
Inline Tests — specs travel with code, machine-readable
Go Codegen — compile to native performance

Roadmap — Where We're Going

v0.6: Execution Profiles

Formalize the three execution modes:

Profile	Entry Shape	Use Case
SimProfile	`step(World, Input) -> (World, Output)`	Simulations, games, RL
ServiceProfile	`handle(Request) -> Response`	Microservices, agents
CliProfile	`main(args) -> ()`	CLI tools

v0.7: Deterministic Tooling

AI-friendly code transformation tools:

ailang normalize — canonical form for semantic comparison
ailang suggest-imports — auto-fix missing imports
ailang apply — structured code edits

v0.8: Shared Semantic State

Multi-agent coordination through language-level shared memory:

Semantic caching keyed by (problem + types + tests)
CAS-based coordination for deterministic updates
Effect-tracked caching patterns

What AILANG Deliberately Excludes

AILANG prioritizes machine reasoning over human ergonomics:

❌ LSP/IDE servers — AIs use CLI/API, not text editors
❌ Multiple syntaxes — one canonical way to express each concept
❌ Implicit behaviors — all effects are explicit

These aren't limitations — they're design choices that make the language more predictable and constrainable for AI generation.

Summary

AILANG gives you:

Effect constraints — Models can't generate impossible side effects
Structured traces — See exactly what happened, slice by effect type
Better error signals — Specific feedback, not "it crashed"
Deterministic replay — Same input, same output, every time

Get Involved

Try it: See Getting Started
Benchmark it: Run ailang eval-suite to test AI code generation
Contribute: github.com/sunholo-data/ailang

AILANG — AI-first programming, done right.

What Makes AILANG Different​

1. Effects as Capability Boundaries (Not Just Types)​

2. Traces Are First-Class (Not Forensic Artifacts)​

3. Inline Tests Are a Spec Surface (Not Magic)​

Better Feedback Loops​

What Effects Actually Buy You​

Example: The Missed Dedupe Problem​

Benchmark Results — M-EVAL​

Current Capabilities — v0.5.x​

Roadmap — Where We're Going​

v0.6: Execution Profiles​

v0.7: Deterministic Tooling​

v0.8: Shared Semantic State​

What AILANG Deliberately Excludes​

Summary​

Get Involved​