No description

Python 76.2%
JavaScript 14.2%
CSS 8.6%
HTML 1%

Find a file

kevindowling dd9cd07015 removed redundant line in gitignore		2026-06-02 07:12:58 -04:00
engine	first commit	2026-05-23 21:53:15 -04:00
replays	Add core functionality for Balatro benchmark framework	2026-06-02 07:11:44 -04:00
src	Add core functionality for Balatro benchmark framework	2026-06-02 07:11:44 -04:00
tests	Add core functionality for Balatro benchmark framework	2026-06-02 07:11:44 -04:00
viewer	first commit	2026-05-23 21:53:15 -04:00
.gitignore	removed redundant line in gitignore	2026-06-02 07:12:58 -04:00
hatch_build.py	Add core functionality for Balatro benchmark framework	2026-06-02 07:11:44 -04:00
pyproject.toml	Add core functionality for Balatro benchmark framework	2026-06-02 07:11:44 -04:00
README.md	Add core functionality for Balatro benchmark framework	2026-06-02 07:11:44 -04:00

README.md

balatro-bench

A long-horizon LLM planning benchmark built on a from-scratch re-implementation of Balatro's game engine. Stdlib-only Python. No UI on the engine itself — a benchmark harness drives the state machine from a model (or a deterministic baseline) and records a full JSONL replay that a browser viewer can scrub.

The engine is config-driven: jokers, consumables, decks, vouchers, blinds, etc. live in data/*.json and are validated against schemas/*.schema.json at load.

Layout

src/
  balatro/        engine (stdlib Python)
    core/          card, deck, joker, enums, ids
    scoring/       pipeline, hand_eval, card_eval
    run/           state machine, actions, lifecycle, shop, modifiers
    effects/       per-owner handlers + templates
    shop/          generation, economy, booster_pack
    content/       JSON+schema loader, immutable registry
    rng/           seeded, addressable
    events/        event bus
  bench/          LLM harness — see src/bench/README.md
    models/        Anthropic native, OpenAI-compatible
    baselines/     greedy heuristic
    runner.py, prompt.py, obs.py, actions_io.py, replay.py, score.py

data/           JSON content (jokers, consumables, blinds, decks, ...)
schemas/        JSON Schemas (enforced at load)
engine/         Design specs (.md) — predates the Python impl; still useful
                as a reading guide to conventions, scoring order, RNG paths
replays/        JSONL replays from past runs
viewer/         Browser replay viewer (index.html + app.js, no build step)
tests/          Smoke tests (stdlib unittest)

Quickstart

# Smoke-test the engine — deterministic run, no API
PYTHONPATH=src python -m balatro

# Greedy baseline through the bench harness
PYTHONPATH=src python -m balatro_bench --model greedy --seed 42

# Play yourself (stdin REPL)
PYTHONPATH=src python -m balatro_bench --model human --seed 42

# Real model
ANTHROPIC_API_KEY=sk-... \
  PYTHONPATH=src python -m balatro_bench --model claude-opus-4-7 \
    --compare-with greedy --seeds 0-19

# Tests
PYTHONPATH=src python -m unittest tests.test_smoke

Full harness docs — providers, adapters, observation format, scratchpad, replay event schema, metrics — live in src/bench/README.md.

What the benchmark measures

Long-horizon planning in a stochastic, partially-observable environment where economy, joker synergy, and hand levels compound across antes.

Three outcomes:

Defeat — ran out of hands below the chip requirement.
Victory (ante 8 cleared) — finished the standard run.
Victory (naninf) — score overflowed to +inf. The competitive Balatro win condition. With --endless, this becomes the only win condition: chip requirements scale via the canonical endless formula and the run continues until defeat or overflow.

Replay viewer

Open viewer/index.html in a browser and load any replays/*.jsonl file. Scrubs turn-by-turn: observation, model output, parsed action, events emitted. No build step, no server — pure static assets.

The two-axis idea

Engine code splits into mechanisms (general, in modules under src/balatro/) and effects (specific, indexed by three-segment string id <scope>.<owner>.<name>).

Adding most new jokers is JSON-only:

+4 Mult joker → one JSON row, effect: "template.joker.flat_mult"
+3 Mult per Diamond → one JSON row, template.joker.per_card
Something a template can't express → one JSON row pointing at custom.joker.<name> + a small handler in src/balatro/effects/.

The same pattern covers consumables, boss blinds, vouchers, and tags.

Scoring is order-sensitive

Per-card resolution order: rank → enhancement → edition → seal → jokers. Joker resolution: one ordered pass, left → right; slot order is the arithmetic order (no separate "additive then multiplicative" phase). Worked example in engine/scoring/pipeline.md.

Determinism

Runs are deterministic from a seed. Every RNG roll is addressed by a tuple path (("shop", visit_index), ("glass_break", card_id, pass), …) so new RNG consumers don't shift existing rolls. Path catalog: engine/rng/seeded_rng.md.

A save is (seed, action_log); replay reconstructs state.