gpt-5 context-free grammar guide

gpt-5 (finally!) introduced native context-free grammars (CFG), enabling formally constrained text generation. this doc contains my notes on the correct api usage, lark grammar, and related performance observations/issues.

tl;dr - cfg today in the openai api is extremely slow, especially relative to options that work with transformers or llama.cpp. if you really need a cfg, you should look at local generation; otherwise, cfg with gpt-5 only makes sense if you are already retrying invalid structured outputs at least 3-5 times.

note on documentation - as of writing, there’s only a single post on community.openai.com about cfg and just one documentation page in the cookbook. official information is extremely limited, so much of this comes from testing and inference.

alternative approaches

while this guide focuses on gpt-5’s native cfg support, most “local” llm stacks have supported this for years.

guidance: powerful library supporting cfgs, json schemas, and interleaved control flow. works with multiple backends (transformers, llama.cpp, openai)
llguidance: the rust library that powers guidance’s grammar engine. openai uses this internally for their structured outputs feature (json schema only). extremely fast (~50μs per token) with earley’s algorithm
transformers logitsprocessor: custom processors in huggingface transformers for manual token filtering
outlines: structured generation with regex, json schemas, and cfgs
llama.cpp grammars: built-in grammar support for local models
openai structured outputs: json schema enforcement via function calling (powered by llguidance internally)

each approach has different performance characteristics and compatibility requirements, but in theory, cfg provides the strongest guarantees. the performance impact varies by library and grammar, so you always need to benchmark if performance is important.

note - currently, the call overhead is inexcusably bad, rendering the usefuleness of openai’s implementation very limited. see the performance details below.

how cfg really works - logit filtering

all constrained generation methods fundamentally work by manipulating token probabilities:

logit generation: the model produces raw scores (logits) for each token in its vocabulary
filtering/masking: invalid tokens according to constraints get their logits set to negative infinity or masked out
temperature scaling: remaining logits are divided by temperature to control randomness
softmax: converts scaled logits to probabilities that sum to 1.0
sampling: a token is selected based on the resulting probability distribution

cfg automates this process by computing valid next tokens based on the grammar state, ensuring only syntactically valid tokens can be selected at each step. this eliminates parse errors but requires evaluating the grammar at every token, causing the performance overhead.

note that this process does not require any modifification of the forward pass of the model itself. it only changes how the logits are processed before or during sampling at the end.

critical api difference

cfg in gpt-5 uses a different api than standard chat completions:

endpoint: client.responses.create() (not chat.completions.create())
method: custom tools with grammar format
model: gpt-5 or gpt-5-2025-08-07
availability: public api as of august 2025

# correct cfg usage
response = client.responses.create(
    model="gpt-5",
    input="prompt",
    text={"format": {"type": "text"}},
    tools=[{
        "type": "custom",
        "name": "tool_name",
        "format": {
            "type": "grammar",
            "syntax": "lark" | "regex",
            "definition": grammar_string
        }
    }]
)

what is context-free grammar?

a context-free grammar (cfg) is a formal system for defining valid strings in a language. each grammar consists of:

terminals: literal strings or patterns that appear in the output
non-terminals: rules that expand into terminals or other non-terminals
production rules: define how non-terminals can be rewritten
start symbol: the root rule where parsing begins

cfg in gpt-5 ensures output always matches your specified format, eliminating parsing errors and validation overhead.

supported grammar syntaxes

gpt-5 supports two grammar syntax types:

lark grammar

lark is a modern parsing library with ebnf-based syntax:

terminals (uppercase): matched by lexer, longest match wins
rules (lowercase): define structure, can be recursive
operators:
- | - alternation (choice)
- + - one or more
- * - zero or more
- ? - optional
- () - grouping

regex grammar

standard regular expression syntax for simpler patterns:

faster compilation than lark
suitable for fixed formats
no recursion support
uses rust regex syntax

lark grammar basics

simple example

grammar = r'''
start: greeting
greeting: "Hello" | "Hi" | "Hey"
'''

terminals vs rules

grammar = r'''
// rules (lowercase) - define structure
start: sentence
sentence: subject " " verb " " object

// terminals (uppercase) - actual tokens
subject: NOUN
verb: VERB
object: NOUN

NOUN: "cat" | "dog" | "bird"
VERB: "chases" | "sees" | "hears"
'''

common patterns

# repetition
items: item+              // one or more
optional_items: item*     // zero or more
maybe_item: item?         // optional

# alternation
choice: option_a | option_b | option_c

# grouping
grouped: (item " ")+ item  // "a b c d"

# character classes
NUMBER: /[0-9]+/
WORD: /[a-zA-Z]+/

practical examples

binary choice

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-5",
    input="Call answer to: Is 2+2 equal to 4?",
    text={"format": {"type": "text"}},
    tools=[{
        "type": "custom",
        "name": "answer",
        "description": "Binary answer",
        "format": {
            "type": "grammar",
            "syntax": "lark",
            "definition": 'start: "YES" | "NO"'
        }
    }],
    parallel_tool_calls=False,
    timeout=300
)

result = response.output[1].input  # "YES"

finite state automaton

# fsa that accepts binary strings ending in "01"
fsa_grammar = r'''
start: trace "\n" result

trace: "TRACE:" SP transition+
transition: state " --[" bit "]--> " state SP
state: "q0" | "q1" | "q2"
bit: "0" | "1"

result: "ACCEPT" | "REJECT"
SP: " "
'''

regex for structured data

# us phone number pattern
phone_regex = r'^\\d{3}-\\d{3}-\\d{4}$'

response = client.responses.create(
    model="gpt-5",
    input="Generate a phone number for area code 555",
    text={"format": {"type": "text"}},
    tools=[{
        "type": "custom",
        "name": "phone_gen",
        "description": "Generate phone number",
        "format": {
            "type": "grammar",
            "syntax": "regex",
            "definition": phone_regex
        }
    }],
    parallel_tool_calls=False
)

performance characteristics

timing comparison

based on empirical testing with gpt-5:

method	average time	format guarantee	validation needed
with cfg	~9.5s	100%	no
without cfg	~1.0s	variable	yes

key findings:

cfg adds 8-10x latency overhead for simple grammars
first compilation can take 1-5 minutes for complex grammars
subsequent calls may benefit from caching
overhead increases with grammar complexity

factors affecting performance

grammar complexity
- number of rules
- nesting depth
- use of repetition operators
- terminal count
grammar type
- regex: faster compilation, simpler patterns
- lark: slower compilation, more expressive
compilation phases
- initial parse: grammar syntax validation
- optimization: rule simplification
- compilation: internal representation building

observed performance

from limited testing:

simple 2-rule grammar: ~9.5s total latency
without cfg: ~1.0s for same task
openai documentation notes: “first compilation can take 1-5 minutes for complex grammars”

lark compilation comparison

to understand the cfg overhead, we benchmarked the same grammars using the lark parser library (benchmark script):

grammar	rules	lark (ms)	gpt-5 cfg (ms)	overhead
binary	1	0.5-3.2	~9,500	~3000x
classifier	4	1.5-1.8	estimated	-
sentence	7	1.9-2.0	estimated	-
json	10	3.1-4.4	estimated	-
sir model	13	3.2-4.0	estimated	-

the massive overhead (1000-10000x) suggests gpt-5 cfg latency comes from:

network/api round-trip time
token-by-token grammar validation during generation
integration overhead with the llm inference pipeline
smaller pool of resources serving cfg requests

pure grammar compilation is negligible (<5ms) compared to the generative process. for context, llguidance (which openai uses for json schema) takes only ~50μs per token, suggesting the bottleneck is not the grammar engine itself but the integration and api layers.

advanced examples

s-i-r epidemiological model

sir_grammar = r'''
start: header "\n" transitions "\n" summary

header: "TIME_STEP:" SP number

transitions: transition+
transition: person ":" SP state " -> " state SP reason "\n"

person: /[A-Z]/
state: "S" | "I" | "R"
reason: "[" /[a-z_]+/ "]"

summary: "COUNTS: S=" number ", I=" number ", R=" number

number: /[0-9]+/
SP: " "
'''

chemical equation balancer

chemistry_grammar = r'''
start: equation "\n" balanced

equation: "EQUATION: " reactants " → " products
reactants: compound (SP "+" SP compound)*
products: compound (SP "+" SP compound)*

compound: coefficient? molecule
coefficient: /[2-8]/
molecule: "H2O" | "O2" | "H2" | "CO2" | "CH4" | "NH3"

balanced: "BALANCED: " ("Yes" | "No")
SP: " "
'''

json with specific schema

# enforce exact json structure
json_grammar = r'''
start: object

object: "{" WS pair (WS "," WS pair)* WS "}"
pair: '"name"' WS ":" WS string
    | '"age"' WS ":" WS number
    | '"active"' WS ":" WS boolean

string: '"' /[a-zA-Z ]+/ '"'
number: /[0-9]+/
boolean: "true" | "false"

WS: /[ \t\n]*/
'''

best practices

grammar design

keep terminals bounded

# good - bounded pattern
IDENTIFIER: /[a-z][a-z0-9_]{0,31}/

# avoid - unbounded
TEXT: /.*/

explicit whitespace handling

# good - explicit spaces
sentence: word SP word SP word
SP: " "

# avoid - implicit whitespace
sentence: word word word

prefer terminals over complex rules

# good - terminal for common patterns
EMAIL: /[a-z]+@[a-z]+\.[a-z]+/

# avoid - complex rule structure
email: local "@" domain
local: letter+
domain: subdomain "." tld

use meaningful names

# good
sql_statement: select_clause from_clause where_clause?

# avoid
s: sc fc wc?

error handling

import time

def call_with_cfg(client, prompt, grammar, syntax="lark", max_retries=3):
    """wrapper with retry logic for cfg calls"""

    for attempt in range(max_retries):
        try:
            response = client.responses.create(
                model="gpt-5",
                input=prompt,
                text={"format": {"type": "text"}},
                tools=[{
                    "type": "custom",
                    "name": "tool",
                    "description": "cfg tool",
                    "format": {
                        "type": "grammar",
                        "syntax": syntax,
                        "definition": grammar
                    }
                }],
                parallel_tool_calls=False,
                timeout=300
            )
            return response.output[1].input

        except Exception as e:
            if "timeout" in str(e).lower() and attempt < max_retries - 1:
                print(f"timeout on attempt {attempt + 1}, retrying...")
                time.sleep(2 ** attempt)  # exponential backoff
            else:
                raise

limitations and gotchas

unsupported lark features

lookaround in regex: (?=...), (?!...)
lazy modifiers: *?, +?
terminal priorities: %priority
templates: %template
most imports: %import (except common.WS)

common issues

grammar compilation timeout
- complex grammars can take minutes to compile
- set timeout=300 or higher
- consider simplifying grammar

terminal vs rule confusion

# wrong - terminals can't be recursive
TEXT: text_part TEXT?

# correct - use rules for recursion
text: text_part text?

greedy lexing issues

# problem - NUMBER matches everything
NUMBER: /[0-9]+/
DIGIT: /[0-9]/  # never matches

# solution - order matters or use rules
start: NUMBER | DIGIT

complete working example

#!/usr/bin/env python3
# /// script
# requires-python = ">=3.11"
# dependencies = ["openai"]
# ///

from openai import OpenAI
import time

def binary_classifier_with_cfg():
    """classify text as positive or negative sentiment"""

    client = OpenAI()

    # define grammar for classification
    classifier_grammar = r'''
    start: classification

    classification: sentiment " (" confidence ")"
    sentiment: "POSITIVE" | "NEGATIVE" | "NEUTRAL"
    confidence: "HIGH" | "MEDIUM" | "LOW"
    '''

    texts = [
        "this product is amazing!",
        "terrible experience, would not recommend",
        "it's okay, nothing special"
    ]

    for text in texts:
        start_time = time.time()

        response = client.responses.create(
            model="gpt-5",
            input=f"Call classifier to analyze: '{text}'",
            text={"format": {"type": "text"}},
            tools=[{
                "type": "custom",
                "name": "classifier",
                "description": "sentiment classifier",
                "format": {
                    "type": "grammar",
                    "syntax": "lark",
                    "definition": classifier_grammar
                }
            }],
            parallel_tool_calls=False,
            timeout=300
        )

        result = response.output[1].input
        elapsed = time.time() - start_time

        print(f"text: {text}")
        print(f"result: {result}")
        print(f"time: {elapsed:.2f}s\n")

if __name__ == "__main__":
    binary_classifier_with_cfg()

when to use cfg

use cfg when

format compliance is critical: legal documents, protocols, apis
parsing errors are expensive: production systems
schema is well-defined: structured data extraction
validation logic is complex: nested conditions

avoid cfg when

latency is critical: real-time systems (<1s requirement)
output is free-form: creative writing, summaries
grammar changes frequently: prototyping phase
simple validation suffices: basic string checks

resources

openai cookbook - gpt-5 cfg
lark documentation
lark online ide
example implementations - complete working examples

summary

context-free grammars in gpt-5 provide guaranteed format compliance at the cost of increased latency. the 8-10x performance overhead is often justified when correctness matters more than speed. for production use:

start with simple grammars and iterate
cache compiled grammars when possible
use regex for simple patterns, lark for complex structures
always set appropriate timeouts
validate grammar syntax before deployment

cfg transforms llms from probabilistic text generators into formally constrained computational engines, opening new possibilities for reliable ai systems.