gpt-5 context-free grammar guide

on this page

gpt-5 (finally!) introduced native context-free grammars (CFG), enabling formally constrained text generation. this doc contains my notes on the correct api usage, lark grammar, and related performance observations/issues.

tl;dr - cfg today in the openai api is extremely slow, especially relative to options that work with transformers or llama.cpp. if you really need a cfg, you should look at local generation; otherwise, cfg with gpt-5 only makes sense if you are already retrying invalid structured outputs at least 3-5 times.

note on documentation - as of writing, there’s only a single post on community.openai.com about cfg and just one documentation page in the cookbook. official information is extremely limited, so much of this comes from testing and inference.

alternative approaches

while this guide focuses on gpt-5’s native cfg support, most “local” llm stacks have supported this for years.

  • guidance: powerful library supporting cfgs, json schemas, and interleaved control flow. works with multiple backends (transformers, llama.cpp, openai)
  • llguidance: the rust library that powers guidance’s grammar engine. openai uses this internally for their structured outputs feature (json schema only). extremely fast (~50μs per token) with earley’s algorithm
  • transformers logitsprocessor: custom processors in huggingface transformers for manual token filtering
  • outlines: structured generation with regex, json schemas, and cfgs
  • llama.cpp grammars: built-in grammar support for local models
  • openai structured outputs: json schema enforcement via function calling (powered by llguidance internally)

each approach has different performance characteristics and compatibility requirements, but in theory, cfg provides the strongest guarantees. the performance impact varies by library and grammar, so you always need to benchmark if performance is important.

note - currently, the call overhead is inexcusably bad, rendering the usefuleness of openai’s implementation very limited. see the performance details below.

how cfg really works - logit filtering

all constrained generation methods fundamentally work by manipulating token probabilities:

  1. logit generation: the model produces raw scores (logits) for each token in its vocabulary
  2. filtering/masking: invalid tokens according to constraints get their logits set to negative infinity or masked out
  3. temperature scaling: remaining logits are divided by temperature to control randomness
  4. softmax: converts scaled logits to probabilities that sum to 1.0
  5. sampling: a token is selected based on the resulting probability distribution

cfg automates this process by computing valid next tokens based on the grammar state, ensuring only syntactically valid tokens can be selected at each step. this eliminates parse errors but requires evaluating the grammar at every token, causing the performance overhead.

note that this process does not require any modifification of the forward pass of the model itself. it only changes how the logits are processed before or during sampling at the end.

critical api difference

cfg in gpt-5 uses a different api than standard chat completions:

  • endpoint: client.responses.create() (not chat.completions.create())
  • method: custom tools with grammar format
  • model: gpt-5 or gpt-5-2025-08-07
  • availability: public api as of august 2025
# correct cfg usage
response = client.responses.create(
    model="gpt-5",
    input="prompt",
    text={"format": {"type": "text"}},
    tools=[{
        "type": "custom",
        "name": "tool_name",
        "format": {
            "type": "grammar",
            "syntax": "lark" | "regex",
            "definition": grammar_string
        }
    }]
)

what is context-free grammar?

a context-free grammar (cfg) is a formal system for defining valid strings in a language. each grammar consists of:

  • terminals: literal strings or patterns that appear in the output
  • non-terminals: rules that expand into terminals or other non-terminals
  • production rules: define how non-terminals can be rewritten
  • start symbol: the root rule where parsing begins

cfg in gpt-5 ensures output always matches your specified format, eliminating parsing errors and validation overhead.

supported grammar syntaxes

gpt-5 supports two grammar syntax types:

lark grammar

lark is a modern parsing library with ebnf-based syntax:

  • terminals (uppercase): matched by lexer, longest match wins
  • rules (lowercase): define structure, can be recursive
  • operators:
    • | - alternation (choice)
    • + - one or more
    • * - zero or more
    • ? - optional
    • () - grouping

regex grammar

standard regular expression syntax for simpler patterns:

  • faster compilation than lark
  • suitable for fixed formats
  • no recursion support
  • uses rust regex syntax

lark grammar basics

simple example

grammar = r'''
start: greeting
greeting: "Hello" | "Hi" | "Hey"
'''

terminals vs rules

grammar = r'''
// rules (lowercase) - define structure
start: sentence
sentence: subject " " verb " " object

// terminals (uppercase) - actual tokens
subject: NOUN
verb: VERB
object: NOUN

NOUN: "cat" | "dog" | "bird"
VERB: "chases" | "sees" | "hears"
'''

common patterns

# repetition
items: item+              // one or more
optional_items: item*     // zero or more
maybe_item: item?         // optional

# alternation
choice: option_a | option_b | option_c

# grouping
grouped: (item " ")+ item  // "a b c d"

# character classes
NUMBER: /[0-9]+/
WORD: /[a-zA-Z]+/

practical examples

binary choice

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-5",
    input="Call answer to: Is 2+2 equal to 4?",
    text={"format": {"type": "text"}},
    tools=[{
        "type": "custom",
        "name": "answer",
        "description": "Binary answer",
        "format": {
            "type": "grammar",
            "syntax": "lark",
            "definition": 'start: "YES" | "NO"'
        }
    }],
    parallel_tool_calls=False,
    timeout=300
)

result = response.output[1].input  # "YES"

finite state automaton

# fsa that accepts binary strings ending in "01"
fsa_grammar = r'''
start: trace "\n" result

trace: "TRACE:" SP transition+
transition: state " --[" bit "]--> " state SP
state: "q0" | "q1" | "q2"
bit: "0" | "1"

result: "ACCEPT" | "REJECT"
SP: " "
'''

regex for structured data

# us phone number pattern
phone_regex = r'^\\d{3}-\\d{3}-\\d{4}$'

response = client.responses.create(
    model="gpt-5",
    input="Generate a phone number for area code 555",
    text={"format": {"type": "text"}},
    tools=[{
        "type": "custom",
        "name": "phone_gen",
        "description": "Generate phone number",
        "format": {
            "type": "grammar",
            "syntax": "regex",
            "definition": phone_regex
        }
    }],
    parallel_tool_calls=False
)

performance characteristics

timing comparison

based on empirical testing with gpt-5:

methodaverage timeformat guaranteevalidation needed
with cfg~9.5s100%no
without cfg~1.0svariableyes

key findings:

  • cfg adds 8-10x latency overhead for simple grammars
  • first compilation can take 1-5 minutes for complex grammars
  • subsequent calls may benefit from caching
  • overhead increases with grammar complexity

factors affecting performance

  1. grammar complexity

    • number of rules
    • nesting depth
    • use of repetition operators
    • terminal count
  2. grammar type

    • regex: faster compilation, simpler patterns
    • lark: slower compilation, more expressive
  3. compilation phases

    • initial parse: grammar syntax validation
    • optimization: rule simplification
    • compilation: internal representation building

observed performance

from limited testing:

  • simple 2-rule grammar: ~9.5s total latency
  • without cfg: ~1.0s for same task
  • openai documentation notes: “first compilation can take 1-5 minutes for complex grammars”

lark compilation comparison

to understand the cfg overhead, we benchmarked the same grammars using the lark parser library (benchmark script):

grammarruleslark (ms)gpt-5 cfg (ms)overhead
binary10.5-3.2~9,500~3000x
classifier41.5-1.8estimated-
sentence71.9-2.0estimated-
json103.1-4.4estimated-
sir model133.2-4.0estimated-

the massive overhead (1000-10000x) suggests gpt-5 cfg latency comes from:

  1. network/api round-trip time
  2. token-by-token grammar validation during generation
  3. integration overhead with the llm inference pipeline
  4. smaller pool of resources serving cfg requests

pure grammar compilation is negligible (<5ms) compared to the generative process. for context, llguidance (which openai uses for json schema) takes only ~50μs per token, suggesting the bottleneck is not the grammar engine itself but the integration and api layers.

advanced examples

s-i-r epidemiological model

sir_grammar = r'''
start: header "\n" transitions "\n" summary

header: "TIME_STEP:" SP number

transitions: transition+
transition: person ":" SP state " -> " state SP reason "\n"

person: /[A-Z]/
state: "S" | "I" | "R"
reason: "[" /[a-z_]+/ "]"

summary: "COUNTS: S=" number ", I=" number ", R=" number

number: /[0-9]+/
SP: " "
'''

chemical equation balancer

chemistry_grammar = r'''
start: equation "\n" balanced

equation: "EQUATION: " reactants " → " products
reactants: compound (SP "+" SP compound)*
products: compound (SP "+" SP compound)*

compound: coefficient? molecule
coefficient: /[2-8]/
molecule: "H2O" | "O2" | "H2" | "CO2" | "CH4" | "NH3"

balanced: "BALANCED: " ("Yes" | "No")
SP: " "
'''

json with specific schema

# enforce exact json structure
json_grammar = r'''
start: object

object: "{" WS pair (WS "," WS pair)* WS "}"
pair: '"name"' WS ":" WS string
    | '"age"' WS ":" WS number
    | '"active"' WS ":" WS boolean

string: '"' /[a-zA-Z ]+/ '"'
number: /[0-9]+/
boolean: "true" | "false"

WS: /[ \t\n]*/
'''

best practices

grammar design

  1. keep terminals bounded

    # good - bounded pattern
    IDENTIFIER: /[a-z][a-z0-9_]{0,31}/
    
    # avoid - unbounded
    TEXT: /.*/
  2. explicit whitespace handling

    # good - explicit spaces
    sentence: word SP word SP word
    SP: " "
    
    # avoid - implicit whitespace
    sentence: word word word
  3. prefer terminals over complex rules

    # good - terminal for common patterns
    EMAIL: /[a-z]+@[a-z]+\.[a-z]+/
    
    # avoid - complex rule structure
    email: local "@" domain
    local: letter+
    domain: subdomain "." tld
  4. use meaningful names

    # good
    sql_statement: select_clause from_clause where_clause?
    
    # avoid
    s: sc fc wc?

error handling

import time

def call_with_cfg(client, prompt, grammar, syntax="lark", max_retries=3):
    """wrapper with retry logic for cfg calls"""

    for attempt in range(max_retries):
        try:
            response = client.responses.create(
                model="gpt-5",
                input=prompt,
                text={"format": {"type": "text"}},
                tools=[{
                    "type": "custom",
                    "name": "tool",
                    "description": "cfg tool",
                    "format": {
                        "type": "grammar",
                        "syntax": syntax,
                        "definition": grammar
                    }
                }],
                parallel_tool_calls=False,
                timeout=300
            )
            return response.output[1].input

        except Exception as e:
            if "timeout" in str(e).lower() and attempt < max_retries - 1:
                print(f"timeout on attempt {attempt + 1}, retrying...")
                time.sleep(2 ** attempt)  # exponential backoff
            else:
                raise

limitations and gotchas

unsupported lark features

  • lookaround in regex: (?=...), (?!...)
  • lazy modifiers: *?, +?
  • terminal priorities: %priority
  • templates: %template
  • most imports: %import (except common.WS)

common issues

  1. grammar compilation timeout

    • complex grammars can take minutes to compile
    • set timeout=300 or higher
    • consider simplifying grammar
  2. terminal vs rule confusion

    # wrong - terminals can't be recursive
    TEXT: text_part TEXT?
    
    # correct - use rules for recursion
    text: text_part text?
  3. greedy lexing issues

    # problem - NUMBER matches everything
    NUMBER: /[0-9]+/
    DIGIT: /[0-9]/  # never matches
    
    # solution - order matters or use rules
    start: NUMBER | DIGIT

complete working example

#!/usr/bin/env python3
# /// script
# requires-python = ">=3.11"
# dependencies = ["openai"]
# ///

from openai import OpenAI
import time

def binary_classifier_with_cfg():
    """classify text as positive or negative sentiment"""

    client = OpenAI()

    # define grammar for classification
    classifier_grammar = r'''
    start: classification

    classification: sentiment " (" confidence ")"
    sentiment: "POSITIVE" | "NEGATIVE" | "NEUTRAL"
    confidence: "HIGH" | "MEDIUM" | "LOW"
    '''

    texts = [
        "this product is amazing!",
        "terrible experience, would not recommend",
        "it's okay, nothing special"
    ]

    for text in texts:
        start_time = time.time()

        response = client.responses.create(
            model="gpt-5",
            input=f"Call classifier to analyze: '{text}'",
            text={"format": {"type": "text"}},
            tools=[{
                "type": "custom",
                "name": "classifier",
                "description": "sentiment classifier",
                "format": {
                    "type": "grammar",
                    "syntax": "lark",
                    "definition": classifier_grammar
                }
            }],
            parallel_tool_calls=False,
            timeout=300
        )

        result = response.output[1].input
        elapsed = time.time() - start_time

        print(f"text: {text}")
        print(f"result: {result}")
        print(f"time: {elapsed:.2f}s\n")

if __name__ == "__main__":
    binary_classifier_with_cfg()

when to use cfg

use cfg when

  • format compliance is critical: legal documents, protocols, apis
  • parsing errors are expensive: production systems
  • schema is well-defined: structured data extraction
  • validation logic is complex: nested conditions

avoid cfg when

  • latency is critical: real-time systems (<1s requirement)
  • output is free-form: creative writing, summaries
  • grammar changes frequently: prototyping phase
  • simple validation suffices: basic string checks

resources

summary

context-free grammars in gpt-5 provide guaranteed format compliance at the cost of increased latency. the 8-10x performance overhead is often justified when correctness matters more than speed. for production use:

  • start with simple grammars and iterate
  • cache compiled grammars when possible
  • use regex for simple patterns, lark for complex structures
  • always set appropriate timeouts
  • validate grammar syntax before deployment

cfg transforms llms from probabilistic text generators into formally constrained computational engines, opening new possibilities for reliable ai systems.

on this page