- Python 76.2%
- JavaScript 14.2%
- CSS 8.6%
- HTML 1%
| engine | ||
| replays | ||
| src | ||
| tests | ||
| viewer | ||
| .gitignore | ||
| hatch_build.py | ||
| pyproject.toml | ||
| README.md | ||
balatro-bench
A long-horizon LLM planning benchmark built on a from-scratch re-implementation of Balatro's game engine. Stdlib-only Python. No UI on the engine itself — a benchmark harness drives the state machine from a model (or a deterministic baseline) and records a full JSONL replay that a browser viewer can scrub.
The engine is config-driven: jokers, consumables, decks, vouchers,
blinds, etc. live in data/*.json and are validated against
schemas/*.schema.json at load.
Layout
src/
balatro/ engine (stdlib Python)
core/ card, deck, joker, enums, ids
scoring/ pipeline, hand_eval, card_eval
run/ state machine, actions, lifecycle, shop, modifiers
effects/ per-owner handlers + templates
shop/ generation, economy, booster_pack
content/ JSON+schema loader, immutable registry
rng/ seeded, addressable
events/ event bus
bench/ LLM harness — see src/bench/README.md
models/ Anthropic native, OpenAI-compatible
baselines/ greedy heuristic
runner.py, prompt.py, obs.py, actions_io.py, replay.py, score.py
data/ JSON content (jokers, consumables, blinds, decks, ...)
schemas/ JSON Schemas (enforced at load)
engine/ Design specs (.md) — predates the Python impl; still useful
as a reading guide to conventions, scoring order, RNG paths
replays/ JSONL replays from past runs
viewer/ Browser replay viewer (index.html + app.js, no build step)
tests/ Smoke tests (stdlib unittest)
Quickstart
# Smoke-test the engine — deterministic run, no API
PYTHONPATH=src python -m balatro
# Greedy baseline through the bench harness
PYTHONPATH=src python -m balatro_bench --model greedy --seed 42
# Play yourself (stdin REPL)
PYTHONPATH=src python -m balatro_bench --model human --seed 42
# Real model
ANTHROPIC_API_KEY=sk-... \
PYTHONPATH=src python -m balatro_bench --model claude-opus-4-7 \
--compare-with greedy --seeds 0-19
# Tests
PYTHONPATH=src python -m unittest tests.test_smoke
Full harness docs — providers, adapters, observation format, scratchpad,
replay event schema, metrics — live in
src/bench/README.md.
What the benchmark measures
Long-horizon planning in a stochastic, partially-observable environment where economy, joker synergy, and hand levels compound across antes.
Three outcomes:
- Defeat — ran out of hands below the chip requirement.
- Victory (ante 8 cleared) — finished the standard run.
- Victory (naninf) — score overflowed to
+inf. The competitive Balatro win condition. With--endless, this becomes the only win condition: chip requirements scale via the canonical endless formula and the run continues until defeat or overflow.
Replay viewer
Open viewer/index.html in a browser and load any replays/*.jsonl
file. Scrubs turn-by-turn: observation, model output, parsed action,
events emitted. No build step, no server — pure static assets.
The two-axis idea
Engine code splits into mechanisms (general, in modules under
src/balatro/) and effects (specific, indexed by three-segment
string id <scope>.<owner>.<name>).
Adding most new jokers is JSON-only:
+4 Multjoker → one JSON row,effect: "template.joker.flat_mult"+3 Mult per Diamond→ one JSON row,template.joker.per_card- Something a template can't express → one JSON row pointing at
custom.joker.<name>+ a small handler insrc/balatro/effects/.
The same pattern covers consumables, boss blinds, vouchers, and tags.
Scoring is order-sensitive
Per-card resolution order: rank → enhancement → edition → seal → jokers. Joker resolution: one ordered pass, left → right; slot order
is the arithmetic order (no separate "additive then multiplicative"
phase). Worked example in engine/scoring/pipeline.md.
Determinism
Runs are deterministic from a seed. Every RNG roll is addressed by a
tuple path (("shop", visit_index), ("glass_break", card_id, pass),
…) so new RNG consumers don't shift existing rolls. Path catalog:
engine/rng/seeded_rng.md.
A save is (seed, action_log); replay reconstructs state.