← Back

Grammar Quest Backend: An Illustrated Tour

April 20, 2026

Grammar Quest is a language-learning game where a neural network redraws a comic-style scene literally from the player's English sentence. Correct grammar advances the story. A missing article blurs an object toward a generic silhouette, the wrong tense shifts a character across decades, subject-verb disagreement splits an entity into conflicting states. Errors accumulate across panels. The game runs at inkmorph.com.

This article walks the backend visually — diagrams where they help (the two-phase pipeline, data shapes, cache decisions, the eval harness, judge rubric, CEFR matrix) and text where they don't (schema enforcement, model map).

1. Two-phase request pipeline

The full pipeline runs four LLM calls plus per-entity image generation. To keep the UI responsive, the server returns as soon as Phase 1 finishes and continues Phase 2 in the background.

Phase 1 runs two parallel LLM calls: the Grammar Analyzer returns typed mistakes with severity scores (an ErrorMap); Scene Coverage returns which entities the player mentioned, which they missed, and a hint if the input is off-topic or too short. Both use temperature 0.1. The HTTP response ships when both finish.

Phase 2 runs three steps in a background worker: Unmentioned Instructions handles entities the player never mentioned (CEFR-dependent — at A2 they're left alone; at C2 they can be deleted); the Mutation Engine turns typed errors into per-entity changes (temperature 0.8); then per-entity sprite generation runs (parallel across entities — each with its own cache lookup and fallback image call). A semantic sprite cache sits in front of the image model to short-circuit repeat work.

CLIENT PHASE 1 / SYNC PHASE 2 / BACKGROUND player submit sentence Grammar Analyzer ErrorMap / temp 0.1 Scene Coverage covered / missing / hint HTTP response mistakes + hint Unmentioned per-CEFR adjust / delete Mutation Engine changes / temp 0.8 Sprite cache pgvector Gemini 3.1 Flash Image sprite gen on miss mutated panel client polls sync response client polls Phase 1 (sync) Phase 2 (background) cache / persisted success
Two phases exist so the HTTP response doesn't wait on image generation. Phase 1 returns as soon as Grammar Analyzer and Scene Coverage finish; Phase 2 continues in the background, with the sprite cache short-circuiting the expensive image model on hits. The client polls for the final mutated state.

Phase 1 calls use temperature 0.1; the Mutation Engine in Phase 2 uses 0.8. The panel's server-side status walks canonical → mutating → mutated (simplified — there is also a sprite_failed terminal state on image-gen failures, and the cycle can repeat for pre/post-mutation on the same panel). The client polls the panel endpoint and swaps its display once the status reaches mutated.

2. Data around the Mutation Engine

The Mutation Engine does not receive prose. Its input is a structured JSON object assembled on each submission; its output is a strict ApplyResponse. The shape of both sides is below.

INPUT (JSON) marked_text string mistakes[] id, type, fragment, severity, correction panel_context string entities{} description effects / persisted previous-panel state carried forward Mutation Engine temp 0.8 · strict JSON schema gpt-oss-120b on Cerebras OUTPUT (ApplyResponse) changes[] game_entity_id edit_instruction applied_errors[] effects / new state deleted, fallen scene_state normal / degraded / collapsed narration localized string new effects → persisted seeds next panel's input
Yellow fields are the persisted effects string — both read as input (from the prior panel) and written as output (to seed the next one). The dashed loop shows the carry-over. The other input fields (marked_text, mistakes[], panel_context) come from Phase 1 or the quest definition; on the output side applied_errors[] records which mistakes drove each change and scene_state summarizes overall damage (normal on a clean turn, collapsed on a run-ending one).

The deleted flag removes the entity from the scene; fallen marks an entity whose support was removed (for example, a teacup on a desk that just got deleted) so the sprite compositor can offset its position. Both get written into changes[] alongside the usual edit instructions.

Only the narration, explanation, and hint fields render in the player's language; correction, fragment, edit_instruction, and effects always stay English because the image model downstream expects English instructions.

3. Effects accumulate across panels

Each entity carries a short string describing what it has become. The Mutation Engine reads the previous value as input, can extend or replace it, and the new value is stored for the next panel. This is how mutations compound without re-sending entity history on every request — traced across three panels of the same game below.

t → PANEL 1 player text Detective examine desk detective.effects "fading identity," "outline softening" PANEL 2 player text Detective walked to door detective.effects (extended) "fading identity," "drifting through decades" PANEL 3 player text / clean The detective opens the door. detective.effects (unchanged) "fading identity," "drifting through decades"
Three successive panels for the same game. The effects string is extended in panel 2 (new error), untouched in panel 3 (no errors). Prior state persists even when the player writes cleanly.

4. Structured outputs, enforced

Every OpenAI-compatible LLM call (OpenRouter, Cerebras) uses response_format: json_schema with strict: true; the Anthropic fallback goes through forced tool-use (tools + tool_choice) instead. Schemas are either Pydantic models passed through a _strict_schema() pass that sets additionalProperties: false on every object and marks every field required, or hand-written JSON schemas with the same constraints baked in.

class EntityChange(BaseModel):
    game_entity_id: str
    edit_instruction: str
    applied_errors: list[str]
    effects: str
    deleted: bool = False
    fallen: bool = False

class ApplyResponse(BaseModel):
    changes: list[EntityChange]
    scene_state: str   # "normal" | "degraded" | "collapsed"
    narration: str

Optional fields and extra keys are treated as bugs at parse time. A common wrapper, call_llm_structured(), selects the provider, retries transient errors (408, 429, 5xx), and validates the response against the schema.

5. Model map

RoleModelProvider
Grammar Analyzer, Scene Coverage, Mutation Enginegpt-oss-120b (preset gpt-oss-120b-cerebras)Cerebras
Alternative text presets (via MODEL_PRESET)google/gemini-2.5-flashOpenRouter
claude-haiku-4-5-20251001Anthropic
Sprite mutation + alpha maskgemini-3.1-flash-image-previewOpenRouter
Sprite cache embeddingstext-embedding-3-smallOpenRouter
Eval judge (eval harness only)Claude OpusAnthropic via claude -p

Foreground sprites are mutated in two image-model calls: a mutation call at temperature=2.0 producing a sprite on a black background, followed by an alpha-mask call at temperature=0.0. Background scenes (where z = -1) skip the alpha step and use a single call, since the background is full-frame opaque.

6. Sprite cache decision path

Sprite generation is the expensive step. The cache key is semantic, not hash-based: two different strings with the same meaning should hit the same sprite. The lookup flow:

effects string for one entity embed text-embedding-3-small pgvector cosine search scoped (quest_id, panel_id, entity_id) distance ≤ 0.35 ? HIT return cached MISS generate Gemini 3.1 Flash Image mutate + mask store in index embedding + sprite yes no
The threshold 0.35 lives as _SPRITE_DISTANCE_THRESHOLD in server/game_db.py. On a miss, the newly generated sprite is embedded and stored, so subsequent near-matches become hits.

A quest manifest's meta.sprite_group lets quests share a cache bucket — the lookup's quest_id component is actually meta.sprite_group or meta.id, so a demo quest can alias to a production quest and inherit its cache.

7. The eval harness

Prompt changes are gated by an evaluation harness at tests/mutation_evals/run_eval.py against a frozen dataset of 53 hand-written cases. Two layers, shown below: structural checkers (pure functions) and an LLM judge (Claude Opus via claude -p).

dataset.jsonl 53 hand-written cases Mutation Engine real call per case STRUCTURAL CHECKERS • check_binding • check_edit_instruction_quality • check_effects_quality • check_entity_isolation • check_expected_cascades • check_expected_scene_state LLM JUDGE Claude Opus / claude -p case-level parallelism 1–5 on four dimensions warn at ≤ 3 aggregate report pass / warn / fail save JSON compare before.json / after.json per-case dimension regressions / gating two runs
Each prompt or model change is evaluated before and after; the comparator reports which cases regressed on which dimensions. Net-negative diffs do not ship.

The six structural checkers:

8. Judge dimensions

The judge scores every EntityChange on four integer dimensions, each on a 1–5 scale. Individual scores matter less than the distribution shift between runs. The radar below plots an example scoring against the rubric.

relevance intensity essence coherence 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 Per-dimension rubric relevance does edit_instruction match the error type's one direction? intensity does drama match severity? essence does effects describe what the entity has become? coherence does it make sense for THIS entity?
The polygon shown is a sample scoring for a single EntityChange (relevance 4, intensity 3, essence 4, coherence 4) — in practice, what matters is the distribution shift across all cases between runs, not any single point. The judge's rubric pins one canonical visual direction per error type (missing article → identity loss; wrong tense → temporal shift; etc.), so relevance is unambiguous. The harness runs judge calls in a ThreadPoolExecutor across test cases, not inside a single call.

9. CEFR as a prompt switch

Difficulty level (A2–C2) is a single global setting that selects different Grammar Analyzer system prompts and different Mutation Engine deletion rules. No extra runtime cost — it's a dictionary lookup per request. The matrix below shows each knob per level.

CEFR Adversarial parsing Entity deletion Analyzer stance A2 off — flag only clear errors DISABLED lenient / pity the student B1 low disabled moderate B2 off rare / only if tiny prop strict C1 off severity > 0.7, secondary entities aggressive C2 ANY ambiguity is a flag severity > 0.7 + destroys essence maximum strictness
Green → tolerant, yellow → strict, red → adversarial. Each level is a different set of system prompts and deletion rules; model and schema are identical.

At B2, "tiny prop" means a small peripheral object (a teacup, a lamp) — deletions at this level are rare and cosmetic. At C2 the analyzer enters a "Devil's Advocate" posture: if a sentence can be parsed two ways, it picks the reading with an error. Any ambiguity is treated as an error signal, so the only defense at this level is precise prose. (C1 is "very strict" but does not flip on Devil's Advocate in the current prompt — the adversarial posture is a C2-specific escalation.)

The "Analyzer stance" column names the Grammar Analyzer system prompt in use at each level: lenient (A2) ignores most ambiguous constructions; moderate (B1) flags the obvious and a few ambiguous cases; strict (B2) flags what a B2 learner should catch; very strict (C1) ignores only pure stylistic variants; maximum strictness (C2) layers Devil's Advocate on top.

Summary

Four load-bearing decisions show up across the diagrams: split the pipeline into a fast user-facing phase and a slower background phase; carry per-entity state as a persisted effects string so the model doesn't re-invent history; cache expensive sprite output by semantic similarity; gate prompt changes on before/after diffs of a judge-scored dataset. The rest — model choices, field-level language routing, CEFR as a prompt switch — follows from those.