Nicholas Foundation
← All research initiatives

Research initiative

Rule-Before-Train AI Safety

"Can rule-based scaffolding give us safer AI than post-hoc alignment?"

Modern AI safety is largely retrofitted — we train models on enormous corpora and then apply RLHF, content filters, and prompt engineering to constrain behaviour after the fact. Rule-Before-Train inverts that: define explicit behavioural rules first, ship them as a working rule-based agent, and let machine-learning components extend that scaffold rather than replace it.

Current findings

  • A working rule-based conversational engine is achievable in modest engineering time (Nyx AI ships a C++20 rule engine with normalised token matching, JSON-driven intents, and a scenario-based eval harness).
  • Rule-based scaffolding gives operators an explicit, auditable spec for what an agent will and won't do — a property RLHF-aligned models cannot offer.
  • The hard problem isn't building the engine; it's authoring high-quality intents that cover the long tail without rule explosion.

What we want to achieve

  • Reach feature parity with simple LLM chat for a constrained domain (e.g., research assistance) — without invoking a hosted model.
  • Add narrow ML components (intent classification, slot extraction) that augment the rule scaffold rather than replace it.
  • Publish the eval suite so others can compare rule-first vs. train-first approaches on consistent benchmarks.

The question, restated

Every shipping AI assistant today follows the same pattern: train a large language model on broad corpora, then patch its behaviour with reinforcement learning from human feedback, content filters, and prompt engineering. The behaviour you get is emergent — a side effect of the training pipeline. The constraints you apply are retrofitted — applied after the model already knows everything.

Rule-Before-Train inverts that. The agent’s behaviour starts with explicit, human-readable rules. Machine learning extends that scaffold — for intent classification, for slot extraction, for fluency — but never replaces it.

Why this is worth doing

Three reasons, in increasing order of importance.

  1. Auditability. A rule-based agent’s behaviour is explicit. You can read what it will do. You cannot do that with a 70-billion-parameter language model.
  2. Update locality. Fixing a bad behaviour in a rule-based agent is editing a JSON intent file. Fixing the same bad behaviour in a trained model is a retraining cycle.
  3. Resource economics. A rule-based engine ships in megabytes and runs on a laptop. A trained model that delivers comparable behaviour for a constrained domain costs tens of thousands of dollars to train and serves from a GPU.

What Nyx AI demonstrates

Nyx AI is the working prototype that informs this research. It’s a C++20 rule engine with:

  • Normalised token + phrase matching (case- and diacritic-insensitive, configurable stemming).
  • JSON-driven intent definitions (text patterns → response templates).
  • Lightweight conversation state and persistent memory.
  • A scenario-based evaluation harness for regression testing.
  • Local organiser stores for notes, tasks, and reminders — so the engine can be useful in a personal-assistant role without invoking any hosted model.

The engine ships today; the research questions it surfaces are about what to do next.

Open questions

  • How far can a well-authored rule-based agent go in a constrained domain before the rule-explosion cost dominates?
  • What’s the right interface between rules and small narrow ML models (intent classification at 100ms vs. 10ms; slot extraction with confidence thresholds)?
  • Can we publish a public eval suite that lets practitioners compare rule-first vs. train-first agents on comparable tasks?

The answers will inform Nyx’s next year of development.

Read about Nyx AI →

Backed by