Grammar Quest Backend: An Illustrated Tour
Grammar Quest is a language-learning game where a neural network redraws a comic-style scene literally from the player's English sentence. Correct grammar advances the story. A missing article blurs an object toward a generic silhouette, the wrong tense shifts a character across decades, subject-verb disagreement splits an entity into conflicting states. Errors accumulate across panels. The game runs at inkmorph.com.
This article walks the backend visually — diagrams where they help (the two-phase pipeline, data shapes, cache decisions, the eval harness, judge rubric, CEFR matrix) and text where they don't (schema enforcement, model map).
1. Two-phase request pipeline
The full pipeline runs four LLM calls plus per-entity image generation. To keep the UI responsive, the server returns as soon as Phase 1 finishes and continues Phase 2 in the background.
Phase 1 runs two parallel LLM calls: the Grammar Analyzer returns typed mistakes with severity scores (an ErrorMap); Scene Coverage returns which entities the player mentioned, which they missed, and a hint if the input is off-topic or too short. Both use temperature 0.1. The HTTP response ships when both finish.
Phase 2 runs three steps in a background worker: Unmentioned Instructions handles entities the player never mentioned (CEFR-dependent — at A2 they're left alone; at C2 they can be deleted); the Mutation Engine turns typed errors into per-entity changes (temperature 0.8); then per-entity sprite generation runs (parallel across entities — each with its own cache lookup and fallback image call). A semantic sprite cache sits in front of the image model to short-circuit repeat work.
Phase 1 calls use temperature 0.1; the Mutation Engine in Phase 2 uses 0.8. The panel's server-side status walks canonical → mutating → mutated (simplified — there is also a sprite_failed terminal state on image-gen failures, and the cycle can repeat for pre/post-mutation on the same panel). The client polls the panel endpoint and swaps its display once the status reaches mutated.
2. Data around the Mutation Engine
The Mutation Engine does not receive prose. Its input is a structured JSON object assembled on each submission; its output is a strict ApplyResponse. The shape of both sides is below.
effects string — both read as input (from the prior panel) and written as output (to seed the next one). The dashed loop shows the carry-over. The other input fields (marked_text, mistakes[], panel_context) come from Phase 1 or the quest definition; on the output side applied_errors[] records which mistakes drove each change and scene_state summarizes overall damage (normal on a clean turn, collapsed on a run-ending one).The deleted flag removes the entity from the scene; fallen marks an entity whose support was removed (for example, a teacup on a desk that just got deleted) so the sprite compositor can offset its position. Both get written into changes[] alongside the usual edit instructions.
Only the narration, explanation, and hint fields render in the player's language; correction, fragment, edit_instruction, and effects always stay English because the image model downstream expects English instructions.
3. Effects accumulate across panels
Each entity carries a short string describing what it has become. The Mutation Engine reads the previous value as input, can extend or replace it, and the new value is stored for the next panel. This is how mutations compound without re-sending entity history on every request — traced across three panels of the same game below.
effects string is extended in panel 2 (new error), untouched in panel 3 (no errors). Prior state persists even when the player writes cleanly.4. Structured outputs, enforced
Every OpenAI-compatible LLM call (OpenRouter, Cerebras) uses response_format: json_schema with strict: true; the Anthropic fallback goes through forced tool-use (tools + tool_choice) instead. Schemas are either Pydantic models passed through a _strict_schema() pass that sets additionalProperties: false on every object and marks every field required, or hand-written JSON schemas with the same constraints baked in.
class EntityChange(BaseModel):
game_entity_id: str
edit_instruction: str
applied_errors: list[str]
effects: str
deleted: bool = False
fallen: bool = False
class ApplyResponse(BaseModel):
changes: list[EntityChange]
scene_state: str # "normal" | "degraded" | "collapsed"
narration: str
Optional fields and extra keys are treated as bugs at parse time. A common wrapper, call_llm_structured(), selects the provider, retries transient errors (408, 429, 5xx), and validates the response against the schema.
5. Model map
| Role | Model | Provider |
|---|---|---|
| Grammar Analyzer, Scene Coverage, Mutation Engine | gpt-oss-120b (preset gpt-oss-120b-cerebras) | Cerebras |
Alternative text presets (via MODEL_PRESET) | google/gemini-2.5-flash | OpenRouter |
claude-haiku-4-5-20251001 | Anthropic | |
| Sprite mutation + alpha mask | gemini-3.1-flash-image-preview | OpenRouter |
| Sprite cache embeddings | text-embedding-3-small | OpenRouter |
| Eval judge (eval harness only) | Claude Opus | Anthropic via claude -p |
Foreground sprites are mutated in two image-model calls: a mutation call at temperature=2.0 producing a sprite on a black background, followed by an alpha-mask call at temperature=0.0. Background scenes (where z = -1) skip the alpha step and use a single call, since the background is full-frame opaque.
6. Sprite cache decision path
Sprite generation is the expensive step. The cache key is semantic, not hash-based: two different strings with the same meaning should hit the same sprite. The lookup flow:
0.35 lives as _SPRITE_DISTANCE_THRESHOLD in server/game_db.py. On a miss, the newly generated sprite is embedded and stored, so subsequent near-matches become hits.A quest manifest's meta.sprite_group lets quests share a cache bucket — the lookup's quest_id component is actually meta.sprite_group or meta.id, so a demo quest can alias to a production quest and inherit its cache.
7. The eval harness
Prompt changes are gated by an evaluation harness at tests/mutation_evals/run_eval.py against a frozen dataset of 53 hand-written cases. Two layers, shown below: structural checkers (pure functions) and an LLM judge (Claude Opus via claude -p).
The six structural checkers:
- check_binding — every mistake must land on the correct entity.
- check_edit_instruction_quality — rejects weak words (
slightly,subtly,faintly,gently,barely,a bit). - check_effects_quality — rejects appearance-only descriptions (
effectsmust describe what the entity has become, not what it looks like). - check_entity_isolation — flags when an
edit_instructionreferences other entities (they would leak into the target sprite at render time). - check_expected_cascades — physical dependencies must propagate (items on a deleted support fall).
- check_expected_scene_state — scene_state (
normal/degraded/collapsed) matches expected severity.
8. Judge dimensions
The judge scores every EntityChange on four integer dimensions, each on a 1–5 scale. Individual scores matter less than the distribution shift between runs. The radar below plots an example scoring against the rubric.
EntityChange (relevance 4, intensity 3, essence 4, coherence 4) — in practice, what matters is the distribution shift across all cases between runs, not any single point. The judge's rubric pins one canonical visual direction per error type (missing article → identity loss; wrong tense → temporal shift; etc.), so relevance is unambiguous. The harness runs judge calls in a ThreadPoolExecutor across test cases, not inside a single call.9. CEFR as a prompt switch
Difficulty level (A2–C2) is a single global setting that selects different Grammar Analyzer system prompts and different Mutation Engine deletion rules. No extra runtime cost — it's a dictionary lookup per request. The matrix below shows each knob per level.
At B2, "tiny prop" means a small peripheral object (a teacup, a lamp) — deletions at this level are rare and cosmetic. At C2 the analyzer enters a "Devil's Advocate" posture: if a sentence can be parsed two ways, it picks the reading with an error. Any ambiguity is treated as an error signal, so the only defense at this level is precise prose. (C1 is "very strict" but does not flip on Devil's Advocate in the current prompt — the adversarial posture is a C2-specific escalation.)
The "Analyzer stance" column names the Grammar Analyzer system prompt in use at each level: lenient (A2) ignores most ambiguous constructions; moderate (B1) flags the obvious and a few ambiguous cases; strict (B2) flags what a B2 learner should catch; very strict (C1) ignores only pure stylistic variants; maximum strictness (C2) layers Devil's Advocate on top.
Summary
Four load-bearing decisions show up across the diagrams: split the pipeline into a fast user-facing phase and a slower background phase; carry per-entity state as a persisted effects string so the model doesn't re-invent history; cache expensive sprite output by semantic similarity; gate prompt changes on before/after diffs of a judge-scored dataset. The rest — model choices, field-level language routing, CEFR as a prompt switch — follows from those.