ADR-0024: No Regex Parsing in the DSL — Regex is a Grammar Smell¶
Status: Accepted Date: 2026-05-19
Context¶
The DSL parser is line-and-token oriented. Most constructs (entity,
surface, workspace, scope:, process, event_model, view,
integration …) are parsed by reading tokens off a buffer and dispatching
on keywords — a recursive-descent style that produces typed IR directly.
A handful of places, however, accept a free-form string field that is
parsed at runtime via regex. The longest-lived example is _AGGREGATE_RE
in back/runtime/workspace_aggregation.py:40:
It powers count(Entity), avg(field), sum(field where pred) across
five IR sites (WorkspaceRegion.aggregates, OverlaySeriesSpec,
PipelineStageSpec.value/.progress, ActionCardSpec.count_aggregate,
LensAggregatePrimary.aggregate). Other examples have appeared and been
fixed: _parse_simple_where (since superseded by the predicate parser),
ad-hoc bucket-label regex, current-bucket sentinel substitution.
In every case the regex started as a small expedient. In every case it
accreted disambiguation hacks ("count(X) → entity; avg(X) → column"),
silently rejected new shapes ("avg(Entity.column) is unrepresentable"),
and blocked downstream tooling (linter, doc-gen, IDE completion) because
the contents of the string were invisible to the IR.
Issue #1144 Gap 1 phase 2 surfaced the cost concretely: a cross-entity
aggregate (avg(MarkingResult.score)) cannot be expressed because the
regex has no slot for it. Extending the regex is a one-line change. But
extending the regex again — and adding the matching disambiguation
branch — entrenches the smell.
Decision¶
The DSL parser does not use regex to parse DSL constructs or expressions. When a piece of DSL needs to be parsed, parse it structurally — produce typed IR — using the same token-driven dispatcher the rest of the language uses.
Specifically:
- No regex for grammar. Regex is reserved for character-class recognition (whitespace, identifier shape, numeric literal shape) and for non-DSL inputs (log line scraping, file globs, etc.). Recognising a DSL keyword or shape via regex is grammar work in disguise.
- No string fields whose contents are later parsed at runtime. If a
construct accepts user-authored DSL (an aggregate call, a predicate, a
path expression), the parser MUST produce typed IR for it at parse
time, validated against the IR schema. Stashing the source string for
"later" defers parsing into the runtime — where errors surface as
500s instead of
dazzle validatefailures and where tooling cannot inspect the construct. - A regex in the parser is a signal to extend the grammar. When the
temptation arises, the right next step is to add an IR type and a
dispatcher method, not a
re.compile. The regex is the symptom; the missing grammar is the cause.
This applies to new constructs and, on a rolling basis, to existing ones: when a regex-parsed construct grows a new shape or a disambiguation hack is needed, that change MUST be implemented by retiring the regex in favour of typed IR — not by extending the regex.
Consequences¶
Positive¶
- Errors surface earlier. Parse-time errors at
dazzle validateinstead of runtime regex misses. - Tooling sees the structure. Linter, IDE completion, doc-gen, composition audits all read typed IR. They can't read inside a string.
- No disambiguation hacks. When a shape's meaning depends on the parser context (entity vs column, literal vs aggregate), the IR encodes the distinction in named fields rather than a func-switch.
- Extension has a home. New shapes get an IR field. The grammar
grows in one place, not across
re.compilecalls in multiple modules. - Compiles to alternative targets. Typed IR can be compiled to SQL, to MCP tool schemas, to OpenAPI examples, etc. Strings can't.
Negative¶
- Migration cost. Five sites carry regex-parsed strings today
(
_AGGREGATE_REconsumers). Each migration is a clean-break diff per ADR-0003 — IR change + parser change + runtime change + tests + example apps in one commit. Seedev_docs/2026-05-19-aggregate-ref-ir-brainstorm.mdfor the aggregate-specific sequencing. - Parser surface grows. Recursive-descent parsing of small sub-grammars is more code than a regex. The tradeoff is paid back in validation, tooling, and the lack of disambiguation hacks.
- A judgement call remains. Character-class regex (e.g. "is this an
identifier?") is still allowed; "is this a
count(X)call?" is not. The line is: matches a lexical shape (OK) vs matches a grammar shape (not OK).
Neutral¶
- The DSL surface syntax does not change. Users continue to write
terse, familiar forms (
count(Task where status=open)). What changes is how the parser handles them — structurally, not via regex.
Alternatives Considered¶
1. Extend the regex on demand¶
Each time a new shape is needed, add a capture group and a downstream
branch. This is what was almost done for #1144 Gap 1 phase 2 (extend
_AGGREGATE_RE to (\w+)(?:\.(\w+))?).
Rejected: The regex was already encoding two distinct grammars via
func-disambiguation. Each extension worsens the smell and pushes the
fix-cost forward. The disambiguation hack is the kind of subtle bug
producer ADR-0009 was written to eliminate for predicates.
2. Allow regex for "small, contained" sub-grammars¶
Permit regex for parsing tiny string fields with clear shape (e.g.
"yyyy-mm-dd" date literals, the count(X) shape).
Rejected: No regex starts large. The _AGGREGATE_RE example began
as "just count(X)" and grew into a five-consumer load-bearing
disambiguation hub over four releases. The rule has to be sharp or it
won't bind.
3. Defer parsing to runtime universally¶
Keep DSL fields as strings, parse on each request when needed.
Rejected: Defers errors from dazzle validate (where they're
caught pre-deployment) to runtime (where they're 500s). Contradicts
ADR-0006's frozen-IR guarantee and ADR-0009's link-time validation
posture.
Implementation¶
- This ADR is normative for new constructs from 2026-05-19.
- Existing regex-parsed constructs are migrated as they evolve. Current backlog of regex-encoded grammar slots:
_AGGREGATE_REconsumers (5 IR sites). Migration drafted indev_docs/2026-05-19-aggregate-ref-ir-brainstorm.md.parse_aggregate_where(back/runtime/aggregate_where_parser.py) — already a structured parser, but lives inback/and duplicates the main predicate parser. Folding it in is the second slice of the aggregate migration.- Linter check:
tests/unit/test_no_regex_in_parser.pygrepssrc/dazzle/core/dsl_parser_impl/forre.compileandre.matchand fails on hits outside an explicit allowlist (lexical-shape regex). This is the enforcement gate.
Related¶
- ADR-0003 — Migrations are clean breaks, no compatibility shims. Migrations under this ADR follow the same rule.
- ADR-0006 — Typed IR is the source of truth. Stashed strings violate that posture.
- ADR-0009 — Same shape of argument for scope predicates: formal IR with link-time validation, not ad-hoc pattern matching.
- ADR-0023 — Similar shape of argument for HTML output: pick the right mechanism per intent, don't paper-over with a string concat.