building a 150k-word english dictionary with llms: opengloss
on this page
| datasets: | mjbommar/opengloss-dictionary and mjbommar/opengloss-dictionary-definitions on HuggingFace |
| live api: | opengloss.com/api |
| scale: | 150K lexemes, 537K senses, 9.14M edges, 60M words of encyclopedic content |
| license: | CC-BY-4.0 (data), Apache-2.0 (code) |
overview
opengloss is a synthetic encyclopedic dictionary and semantic knowledge graph for English. it integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships into a single resource — providing 4.59x more sense definitions than WordNet 3.0.
the datasets are available on HuggingFace under CC-BY-4.0, a live REST API runs at opengloss.com, and the full explorer supports search, browsing, relation puzzles, and typeahead autocomplete.
when to use
- as a richer alternative to WordNet for NLP tasks requiring sense-level semantic edges
- as training data for language models needing structured lexical knowledge
- for educational tools that need etymology and encyclopedic context
- for knowledge graph construction or relation extraction experiments
- as a reference dictionary with broader coverage than traditional resources
using the huggingface datasets
install
uv pip install datasets lexeme-level dataset (150K entries)
the opengloss-dictionary dataset contains one row per word with all senses, edges, etymology, and encyclopedia content (~1.2 GB download):
from datasets import load_dataset
ds = load_dataset("mjbommar/opengloss-dictionary", split="train")
print(f"{len(ds):,} lexemes") # 150,101 lexemes
# look up a word
results = ds.filter(lambda x: x["word"] == "algorithm")
entry = results[0]
print(entry["parts_of_speech"]) # ['noun']
print(entry["total_senses"]) # 3
print(entry["total_edges"]) # 36
print(entry["all_synonyms"]) # ['multidimensional array', ...]
print(entry["encyclopedia_entry"][:200]) # encyclopedia text
print(entry["etymology_summary"]) # etymology if available
# iterate over individual senses
for sense in entry["senses"]:
print(f"[{sense['part_of_speech']}:{sense['sense_index']}] {sense['definition']}")
print(f" synonyms: {sense['synonyms']}")
print(f" hypernyms: {sense['hypernyms']}")
print(f" examples: {sense['examples']}") key columns: word, parts_of_speech, senses, all_definitions, all_synonyms, all_antonyms, all_hypernyms, all_hyponyms, all_collocations, all_examples, etymology_summary, encyclopedia_entry, edges, total_edges
definition-level dataset (537K senses)
the opengloss-dictionary-definitions dataset has one row per sense definition (~310 MB download), useful when you need sense-level granularity:
from datasets import load_dataset
defs = load_dataset("mjbommar/opengloss-dictionary-definitions", split="train")
print(f"{len(defs):,} definitions") # 536,829 definitions
# filter by part of speech
nouns = defs.filter(lambda x: x["part_of_speech"] == "noun")
# find highly polysemous words
polysemous = defs.filter(lambda x: x["total_senses_for_word"] >= 10)
# get all senses for a word
word_defs = defs.filter(lambda x: x["word"] == "set")
for d in word_defs:
print(f"[{d['part_of_speech']}:{d['sense_index']}] {d['definition']}") key columns: word, part_of_speech, sense_index, definition, synonyms, antonyms, hypernyms, hyponyms, examples, collocations, sense_edges, pos_level_edges
convert to pandas
import pandas as pd
df = defs.to_pandas()
print(df["part_of_speech"].value_counts())
# noun 278,568
# adjective 144,571
# verb 90,715
# ... educational drafting dataset (27.6K documents)
opengloss-v1.1-drafting contains 27,635 synthetic educational documents (articles, essays, stories) generated from the vocabulary:
drafts = load_dataset("mjbommar/opengloss-v1.1-drafting", split="train")
print(drafts[0]["title"])
print(drafts[0]["content"][:200]) using the live api
the REST API at opengloss.com returns JSON and requires no authentication:
lookup a word
curl -s 'https://opengloss.com/api/lexeme?word=algorithm' | python3 -m json.tool import requests
resp = requests.get("https://opengloss.com/api/lexeme", params={"word": "algorithm"})
entry = resp.json()
print(entry["all_definitions"])
# ['A finite, stepwise procedure for solving a problem or completing a computation.',
# 'A set of precise rules used to generate a predictable output from given inputs.']
print(entry["all_synonyms"])
# ['formula', 'method', 'procedure', 'process', 'protocol', 'routine', 'rule'] search
# fuzzy search
curl -s 'https://opengloss.com/api/search?q=tensor&mode=fuzzy&limit=5' | python3 -m json.tool
# typeahead / prefix search
curl -s 'https://opengloss.com/api/typeahead?q=algo&limit=5&mode=prefix' | python3 -m json.tool api endpoints
| endpoint | method | description |
|---|---|---|
/api/lexeme?word=<string> | GET | lookup by word |
/api/lexeme?id=<u32> | GET | lookup by numeric ID |
/api/search?q=<query>&mode=fuzzy|substring | GET | search with fuzzy or substring matching |
/api/typeahead?q=<query>&limit=12&mode=prefix | GET | autocomplete suggestions |
/api/analytics/trending | GET | trending searches |
data
| metric | value |
|---|---|
| lexemes | 150,101 (94,106 single-word, 55,995 multi-word) |
| sense definitions | 536,829 (avg 3.58 per lexeme) |
| semantic edges | 9.14 million (5.2M sense-level, 3.9M POS-level) |
| usage examples | ~1 million |
| collocations | 3 million |
| encyclopedic content | 60 million words (99.7% coverage) |
| etymology coverage | 97.3% |
comparison with other resources
| resource | lexemes | senses | edges | content |
|---|---|---|---|---|
| OpenGloss | 150,101 | 536,829 | 9.14M | encyclopedia + etymology |
| WordNet 3.0 | 147,306 | 117,659 | — | definitions only |
| BabelNet 5.3 | — | 23M synsets | — | multilingual |
| ConceptNet 5.7 | 8M | — | 21M | commonsense |
how it was built
opengloss was produced through a multi-agent pipeline in under one week for under $1,000:
- lexeme selection — foundation from an American English dictionary (104K words) expanded with 77K pedagogical additions via iterative neighbor-graph traversal
- sense generation — two-agent architecture using pydantic-ai: an overview agent determines POS categories, then a POS-details agent generates 1-4 definitions per category with synonyms, antonyms, hypernyms, hyponyms, and examples
- graph construction — deterministic extraction producing sense-level edges (5.2M) and POS-level edges (3.9M)
- enrichment — separate agents for etymology (97.5% coverage) and encyclopedic content (99.7% coverage, 200-400 words each)
generation model: gpt-4-mini. QA model: Claude Sonnet 4.5.
runtime architecture
the live site is powered by a Rust binary (opengloss-rs, v0.4.3) using:
- Axum/Tokio async web server
- FST (finite-state transducer) for zero-copy prefix and fuzzy lookups
- Rkyv + Zstd compressed data embedded in the binary (~830 MB)
- single entry lookup: 6.5-10.3 microseconds
- prefix search (10 results): 1.73 microseconds
references
- Bommarito, M.J. “OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph.” arXiv:2511.18622, November 2025. https://arxiv.org/abs/2511.18622
- HuggingFace datasets: opengloss-dictionary, opengloss-dictionary-definitions, opengloss-v1.1-drafting
- GitHub: https://github.com/mjbommar/opengloss-rs
- Live site: https://opengloss.com/