gpt-5 context-free grammar guide
on this page
gpt-5 (finally!) introduced native context-free grammars (CFG), enabling formally constrained text generation. this doc contains my notes on the correct api usage, lark grammar, and related performance observations/issues.
tl;dr - cfg today in the openai api is extremely slow, especially relative to options that work with transformers
or llama.cpp
. if you really need a cfg, you should look at local generation; otherwise, cfg with gpt-5 only makes sense if you are already retrying invalid structured outputs at least 3-5 times.
note on documentation - as of writing, there’s only a single post on community.openai.com about cfg and just one documentation page in the cookbook. official information is extremely limited, so much of this comes from testing and inference.
alternative approaches
while this guide focuses on gpt-5’s native cfg support, most “local” llm stacks have supported this for years.
- guidance: powerful library supporting cfgs, json schemas, and interleaved control flow. works with multiple backends (transformers, llama.cpp, openai)
- llguidance: the rust library that powers guidance’s grammar engine. openai uses this internally for their structured outputs feature (json schema only). extremely fast (~50μs per token) with earley’s algorithm
- transformers logitsprocessor: custom processors in huggingface transformers for manual token filtering
- outlines: structured generation with regex, json schemas, and cfgs
- llama.cpp grammars: built-in grammar support for local models
- openai structured outputs: json schema enforcement via function calling (powered by llguidance internally)
each approach has different performance characteristics and compatibility requirements, but in theory, cfg provides the strongest guarantees. the performance impact varies by library and grammar, so you always need to benchmark if performance is important.
note - currently, the call overhead is inexcusably bad, rendering the usefuleness of openai’s implementation very limited. see the performance details below.
how cfg really works - logit filtering
all constrained generation methods fundamentally work by manipulating token probabilities:
- logit generation: the model produces raw scores (logits) for each token in its vocabulary
- filtering/masking: invalid tokens according to constraints get their logits set to negative infinity or masked out
- temperature scaling: remaining logits are divided by temperature to control randomness
- softmax: converts scaled logits to probabilities that sum to 1.0
- sampling: a token is selected based on the resulting probability distribution
cfg automates this process by computing valid next tokens based on the grammar state, ensuring only syntactically valid tokens can be selected at each step. this eliminates parse errors but requires evaluating the grammar at every token, causing the performance overhead.
note that this process does not require any modifification of the forward pass of the model itself. it only changes how the logits are processed before or during sampling at the end.
critical api difference
cfg in gpt-5 uses a different api than standard chat completions:
- endpoint:
client.responses.create()
(notchat.completions.create()
) - method: custom tools with grammar format
- model:
gpt-5
orgpt-5-2025-08-07
- availability: public api as of august 2025
# correct cfg usage
response = client.responses.create(
model="gpt-5",
input="prompt",
text={"format": {"type": "text"}},
tools=[{
"type": "custom",
"name": "tool_name",
"format": {
"type": "grammar",
"syntax": "lark" | "regex",
"definition": grammar_string
}
}]
)
what is context-free grammar?
a context-free grammar (cfg) is a formal system for defining valid strings in a language. each grammar consists of:
- terminals: literal strings or patterns that appear in the output
- non-terminals: rules that expand into terminals or other non-terminals
- production rules: define how non-terminals can be rewritten
- start symbol: the root rule where parsing begins
cfg in gpt-5 ensures output always matches your specified format, eliminating parsing errors and validation overhead.
supported grammar syntaxes
gpt-5 supports two grammar syntax types:
lark grammar
lark is a modern parsing library with ebnf-based syntax:
- terminals (uppercase): matched by lexer, longest match wins
- rules (lowercase): define structure, can be recursive
- operators:
|
- alternation (choice)+
- one or more*
- zero or more?
- optional()
- grouping
regex grammar
standard regular expression syntax for simpler patterns:
- faster compilation than lark
- suitable for fixed formats
- no recursion support
- uses rust regex syntax
lark grammar basics
simple example
grammar = r'''
start: greeting
greeting: "Hello" | "Hi" | "Hey"
'''
terminals vs rules
grammar = r'''
// rules (lowercase) - define structure
start: sentence
sentence: subject " " verb " " object
// terminals (uppercase) - actual tokens
subject: NOUN
verb: VERB
object: NOUN
NOUN: "cat" | "dog" | "bird"
VERB: "chases" | "sees" | "hears"
'''
common patterns
# repetition
items: item+ // one or more
optional_items: item* // zero or more
maybe_item: item? // optional
# alternation
choice: option_a | option_b | option_c
# grouping
grouped: (item " ")+ item // "a b c d"
# character classes
NUMBER: /[0-9]+/
WORD: /[a-zA-Z]+/
practical examples
binary choice
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-5",
input="Call answer to: Is 2+2 equal to 4?",
text={"format": {"type": "text"}},
tools=[{
"type": "custom",
"name": "answer",
"description": "Binary answer",
"format": {
"type": "grammar",
"syntax": "lark",
"definition": 'start: "YES" | "NO"'
}
}],
parallel_tool_calls=False,
timeout=300
)
result = response.output[1].input # "YES"
finite state automaton
# fsa that accepts binary strings ending in "01"
fsa_grammar = r'''
start: trace "\n" result
trace: "TRACE:" SP transition+
transition: state " --[" bit "]--> " state SP
state: "q0" | "q1" | "q2"
bit: "0" | "1"
result: "ACCEPT" | "REJECT"
SP: " "
'''
regex for structured data
# us phone number pattern
phone_regex = r'^\\d{3}-\\d{3}-\\d{4}$'
response = client.responses.create(
model="gpt-5",
input="Generate a phone number for area code 555",
text={"format": {"type": "text"}},
tools=[{
"type": "custom",
"name": "phone_gen",
"description": "Generate phone number",
"format": {
"type": "grammar",
"syntax": "regex",
"definition": phone_regex
}
}],
parallel_tool_calls=False
)
performance characteristics
timing comparison
based on empirical testing with gpt-5:
method | average time | format guarantee | validation needed |
---|---|---|---|
with cfg | ~9.5s | 100% | no |
without cfg | ~1.0s | variable | yes |
key findings:
- cfg adds 8-10x latency overhead for simple grammars
- first compilation can take 1-5 minutes for complex grammars
- subsequent calls may benefit from caching
- overhead increases with grammar complexity
factors affecting performance
-
grammar complexity
- number of rules
- nesting depth
- use of repetition operators
- terminal count
-
grammar type
- regex: faster compilation, simpler patterns
- lark: slower compilation, more expressive
-
compilation phases
- initial parse: grammar syntax validation
- optimization: rule simplification
- compilation: internal representation building
observed performance
from limited testing:
- simple 2-rule grammar: ~9.5s total latency
- without cfg: ~1.0s for same task
- openai documentation notes: “first compilation can take 1-5 minutes for complex grammars”
lark compilation comparison
to understand the cfg overhead, we benchmarked the same grammars using the lark parser library (benchmark script):
grammar | rules | lark (ms) | gpt-5 cfg (ms) | overhead |
---|---|---|---|---|
binary | 1 | 0.5-3.2 | ~9,500 | ~3000x |
classifier | 4 | 1.5-1.8 | estimated | - |
sentence | 7 | 1.9-2.0 | estimated | - |
json | 10 | 3.1-4.4 | estimated | - |
sir model | 13 | 3.2-4.0 | estimated | - |
the massive overhead (1000-10000x) suggests gpt-5 cfg latency comes from:
- network/api round-trip time
- token-by-token grammar validation during generation
- integration overhead with the llm inference pipeline
- smaller pool of resources serving cfg requests
pure grammar compilation is negligible (<5ms) compared to the generative process. for context, llguidance (which openai uses for json schema) takes only ~50μs per token, suggesting the bottleneck is not the grammar engine itself but the integration and api layers.
advanced examples
s-i-r epidemiological model
sir_grammar = r'''
start: header "\n" transitions "\n" summary
header: "TIME_STEP:" SP number
transitions: transition+
transition: person ":" SP state " -> " state SP reason "\n"
person: /[A-Z]/
state: "S" | "I" | "R"
reason: "[" /[a-z_]+/ "]"
summary: "COUNTS: S=" number ", I=" number ", R=" number
number: /[0-9]+/
SP: " "
'''
chemical equation balancer
chemistry_grammar = r'''
start: equation "\n" balanced
equation: "EQUATION: " reactants " → " products
reactants: compound (SP "+" SP compound)*
products: compound (SP "+" SP compound)*
compound: coefficient? molecule
coefficient: /[2-8]/
molecule: "H2O" | "O2" | "H2" | "CO2" | "CH4" | "NH3"
balanced: "BALANCED: " ("Yes" | "No")
SP: " "
'''
json with specific schema
# enforce exact json structure
json_grammar = r'''
start: object
object: "{" WS pair (WS "," WS pair)* WS "}"
pair: '"name"' WS ":" WS string
| '"age"' WS ":" WS number
| '"active"' WS ":" WS boolean
string: '"' /[a-zA-Z ]+/ '"'
number: /[0-9]+/
boolean: "true" | "false"
WS: /[ \t\n]*/
'''
best practices
grammar design
-
keep terminals bounded
# good - bounded pattern IDENTIFIER: /[a-z][a-z0-9_]{0,31}/ # avoid - unbounded TEXT: /.*/
-
explicit whitespace handling
# good - explicit spaces sentence: word SP word SP word SP: " " # avoid - implicit whitespace sentence: word word word
-
prefer terminals over complex rules
# good - terminal for common patterns EMAIL: /[a-z]+@[a-z]+\.[a-z]+/ # avoid - complex rule structure email: local "@" domain local: letter+ domain: subdomain "." tld
-
use meaningful names
# good sql_statement: select_clause from_clause where_clause? # avoid s: sc fc wc?
error handling
import time
def call_with_cfg(client, prompt, grammar, syntax="lark", max_retries=3):
"""wrapper with retry logic for cfg calls"""
for attempt in range(max_retries):
try:
response = client.responses.create(
model="gpt-5",
input=prompt,
text={"format": {"type": "text"}},
tools=[{
"type": "custom",
"name": "tool",
"description": "cfg tool",
"format": {
"type": "grammar",
"syntax": syntax,
"definition": grammar
}
}],
parallel_tool_calls=False,
timeout=300
)
return response.output[1].input
except Exception as e:
if "timeout" in str(e).lower() and attempt < max_retries - 1:
print(f"timeout on attempt {attempt + 1}, retrying...")
time.sleep(2 ** attempt) # exponential backoff
else:
raise
limitations and gotchas
unsupported lark features
- lookaround in regex:
(?=...)
,(?!...)
- lazy modifiers:
*?
,+?
- terminal priorities:
%priority
- templates:
%template
- most imports:
%import
(except common.WS)
common issues
-
grammar compilation timeout
- complex grammars can take minutes to compile
- set
timeout=300
or higher - consider simplifying grammar
-
terminal vs rule confusion
# wrong - terminals can't be recursive TEXT: text_part TEXT? # correct - use rules for recursion text: text_part text?
-
greedy lexing issues
# problem - NUMBER matches everything NUMBER: /[0-9]+/ DIGIT: /[0-9]/ # never matches # solution - order matters or use rules start: NUMBER | DIGIT
complete working example
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.11"
# dependencies = ["openai"]
# ///
from openai import OpenAI
import time
def binary_classifier_with_cfg():
"""classify text as positive or negative sentiment"""
client = OpenAI()
# define grammar for classification
classifier_grammar = r'''
start: classification
classification: sentiment " (" confidence ")"
sentiment: "POSITIVE" | "NEGATIVE" | "NEUTRAL"
confidence: "HIGH" | "MEDIUM" | "LOW"
'''
texts = [
"this product is amazing!",
"terrible experience, would not recommend",
"it's okay, nothing special"
]
for text in texts:
start_time = time.time()
response = client.responses.create(
model="gpt-5",
input=f"Call classifier to analyze: '{text}'",
text={"format": {"type": "text"}},
tools=[{
"type": "custom",
"name": "classifier",
"description": "sentiment classifier",
"format": {
"type": "grammar",
"syntax": "lark",
"definition": classifier_grammar
}
}],
parallel_tool_calls=False,
timeout=300
)
result = response.output[1].input
elapsed = time.time() - start_time
print(f"text: {text}")
print(f"result: {result}")
print(f"time: {elapsed:.2f}s\n")
if __name__ == "__main__":
binary_classifier_with_cfg()
when to use cfg
use cfg when
- format compliance is critical: legal documents, protocols, apis
- parsing errors are expensive: production systems
- schema is well-defined: structured data extraction
- validation logic is complex: nested conditions
avoid cfg when
- latency is critical: real-time systems (<1s requirement)
- output is free-form: creative writing, summaries
- grammar changes frequently: prototyping phase
- simple validation suffices: basic string checks
resources
- openai cookbook - gpt-5 cfg
- lark documentation
- lark online ide
- example implementations - complete working examples
summary
context-free grammars in gpt-5 provide guaranteed format compliance at the cost of increased latency. the 8-10x performance overhead is often justified when correctness matters more than speed. for production use:
- start with simple grammars and iterate
- cache compiled grammars when possible
- use regex for simple patterns, lark for complex structures
- always set appropriate timeouts
- validate grammar syntax before deployment
cfg transforms llms from probabilistic text generators into formally constrained computational engines, opening new possibilities for reliable ai systems.