building a 150K-word english dictionary with llms: opengloss

datasets:	`mjbommar/opengloss-dictionary` and `mjbommar/opengloss-dictionary-definitions` on HuggingFace
live api:	opengloss.com/api
scale:	150K lexemes, 537K senses, 9.14M edges, 60M words of encyclopedic content
license:	CC-BY-4.0 (data), Apache-2.0 (code)

overview

opengloss is a synthetic encyclopedic dictionary and semantic knowledge graph for English. it integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships into a single resource — providing 4.59x more sense definitions than WordNet 3.0.

the datasets are available on HuggingFace under CC-BY-4.0, a live REST API runs at opengloss.com, and the full explorer supports search, browsing, relation puzzles, and typeahead autocomplete.

when to use

as a richer alternative to WordNet for NLP tasks requiring sense-level semantic edges
as training data for language models needing structured lexical knowledge
for educational tools that need etymology and encyclopedic context
for knowledge graph construction or relation extraction experiments
as a reference dictionary with broader coverage than traditional resources

using the huggingface datasets

install

uv pip install datasets

lexeme-level dataset (150K entries)

the opengloss-dictionary dataset contains one row per word with all senses, edges, etymology, and encyclopedia content (~1.2 GB download):

from datasets import load_dataset

ds = load_dataset("mjbommar/opengloss-dictionary", split="train")
print(f"{len(ds):,} lexemes")  # 150,101 lexemes

# look up a word
results = ds.filter(lambda x: x["word"] == "algorithm")
entry = results[0]

print(entry["parts_of_speech"])        # ['noun']
print(entry["total_senses"])           # 3
print(entry["total_edges"])            # 36
print(entry["all_synonyms"])           # ['multidimensional array', ...]
print(entry["encyclopedia_entry"][:200])  # encyclopedia text
print(entry["etymology_summary"])      # etymology if available

# iterate over individual senses
for sense in entry["senses"]:
    print(f"[{sense['part_of_speech']}:{sense['sense_index']}] {sense['definition']}")
    print(f"  synonyms: {sense['synonyms']}")
    print(f"  hypernyms: {sense['hypernyms']}")
    print(f"  examples: {sense['examples']}")

key columns: word, parts_of_speech, senses, all_definitions, all_synonyms, all_antonyms, all_hypernyms, all_hyponyms, all_collocations, all_examples, etymology_summary, encyclopedia_entry, edges, total_edges

definition-level dataset (537K senses)

the opengloss-dictionary-definitions dataset has one row per sense definition (~310 MB download), useful when you need sense-level granularity:

from datasets import load_dataset

defs = load_dataset("mjbommar/opengloss-dictionary-definitions", split="train")
print(f"{len(defs):,} definitions")  # 536,829 definitions

# filter by part of speech
nouns = defs.filter(lambda x: x["part_of_speech"] == "noun")

# find highly polysemous words
polysemous = defs.filter(lambda x: x["total_senses_for_word"] >= 10)

# get all senses for a word
word_defs = defs.filter(lambda x: x["word"] == "set")
for d in word_defs:
    print(f"[{d['part_of_speech']}:{d['sense_index']}] {d['definition']}")

key columns: word, part_of_speech, sense_index, definition, synonyms, antonyms, hypernyms, hyponyms, examples, collocations, sense_edges, pos_level_edges

convert to pandas

import pandas as pd

df = defs.to_pandas()
print(df["part_of_speech"].value_counts())
# noun         278,568
# adjective    144,571
# verb          90,715
# ...

educational drafting dataset (27.6K documents)

opengloss-v1.1-drafting contains 27,635 synthetic educational documents (articles, essays, stories) generated from the vocabulary:

drafts = load_dataset("mjbommar/opengloss-v1.1-drafting", split="train")
print(drafts[0]["title"])
print(drafts[0]["content"][:200])

using the live api

the REST API at opengloss.com returns JSON and requires no authentication:

lookup a word

curl -s 'https://opengloss.com/api/lexeme?word=algorithm' | python3 -m json.tool

import requests

resp = requests.get("https://opengloss.com/api/lexeme", params={"word": "algorithm"})
entry = resp.json()
print(entry["all_definitions"])
# ['A finite, stepwise procedure for solving a problem or completing a computation.',
#  'A set of precise rules used to generate a predictable output from given inputs.']
print(entry["all_synonyms"])
# ['formula', 'method', 'procedure', 'process', 'protocol', 'routine', 'rule']

search

# fuzzy search
curl -s 'https://opengloss.com/api/search?q=tensor&mode=fuzzy&limit=5' | python3 -m json.tool

# typeahead / prefix search
curl -s 'https://opengloss.com/api/typeahead?q=algo&limit=5&mode=prefix' | python3 -m json.tool

api endpoints

endpoint	method	description
`/api/lexeme?word=<string>`	GET	lookup by word
`/api/lexeme?id=<u32>`	GET	lookup by numeric ID
`/api/search?q=<query>&mode=fuzzy\|substring`	GET	search with fuzzy or substring matching
`/api/typeahead?q=<query>&limit=12&mode=prefix`	GET	autocomplete suggestions
`/api/analytics/trending`	GET	trending searches

data

metric	value
lexemes	150,101 (94,106 single-word, 55,995 multi-word)
sense definitions	536,829 (avg 3.58 per lexeme)
semantic edges	9.14 million (5.2M sense-level, 3.9M POS-level)
usage examples	~1 million
collocations	3 million
encyclopedic content	60 million words (99.7% coverage)
etymology coverage	97.3%

comparison with other resources

resource	lexemes	senses	edges	content
OpenGloss	150,101	536,829	9.14M	encyclopedia + etymology
WordNet 3.0	147,306	117,659	—	definitions only
BabelNet 5.3	—	23M synsets	—	multilingual
ConceptNet 5.7	8M	—	21M	commonsense

how it was built

opengloss was produced through a multi-agent pipeline in under one week for under $1,000:

lexeme selection — foundation from an American English dictionary (104K words) expanded with 77K pedagogical additions via iterative neighbor-graph traversal
sense generation — two-agent architecture using pydantic-ai: an overview agent determines POS categories, then a POS-details agent generates 1-4 definitions per category with synonyms, antonyms, hypernyms, hyponyms, and examples
graph construction — deterministic extraction producing sense-level edges (5.2M) and POS-level edges (3.9M)
enrichment — separate agents for etymology (97.5% coverage) and encyclopedic content (99.7% coverage, 200-400 words each)

generation model: gpt-4-mini. QA model: Claude Sonnet 4.5.

runtime architecture

the live site is powered by a Rust binary (opengloss-rs, v0.4.3) using:

Axum/Tokio async web server
FST (finite-state transducer) for zero-copy prefix and fuzzy lookups
Rkyv + Zstd compressed data embedded in the binary (~830 MB)
single entry lookup: 6.5-10.3 microseconds
prefix search (10 results): 1.73 microseconds

references

Bommarito, M.J. “OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph.” arXiv:2511.18622, November 2025. https://arxiv.org/abs/2511.18622
HuggingFace datasets: opengloss-dictionary, opengloss-dictionary-definitions, opengloss-v1.1-drafting
GitHub: https://github.com/mjbommar/opengloss-rs
Live site: https://opengloss.com/

building a 150k-word english dictionary with llms: opengloss