Rule-Before-Train AI Safety
Can rule-based scaffolding give us safer AI than post-hoc alignment?
Modern AI safety is largely retrofitted — we train models on enormous corpora and then apply RLHF, content filters, and prompt engineering to constrain behaviour after the fact. Rule-Before-Train inverts that: define explicit behavioural rules first, ship them as a working rule-based agent, and let machine-learning components extend that scaffold rather than replace it.