Trying OpenSpec - A Lighter Approach to Specification-Driven Development
My opinions after building some features with OpenSpec, focusing on workflow and how it changes the way you work with AI.
This post is part of the Domain Orchestrator series:
We’d established the issue and had a solution ready to go - but here’s the kicker: building the solution was the easy part.
The existing system was live, highly important, and functioning reliably. A “big bang” release - flipping everything over at once - was completely off the table. There was simply too much value flowing through this path to take unnecessary risks.
Almost every transaction that mattered to the business passed through this functionality, so we had to be extremely careful to avoid any customer or partner impact. The analogy in my head was trying to change the engine of a car while driving at full speed. We couldn’t stop, we couldn’t break down, and ideally, we couldn’t even slow down.
So the core questions became:
To manage this, we didn’t just create a task list - we created a phased migration playbook, where each phase increased confidence before we moved to the next.
And at every step, we had a rollback plan ready.
The first step in many migrations is to baseline the old system. For us, this wasn’t easily possible.
The “smart consumer” approach meant logic and behaviour were distributed across both the consumer and several downstream services. Getting a unified view of how all the pieces behaved together wasn’t straightforward - which was part of what motivated this initiative in the first place.
So instead of baselining the old world, Phase 1 focused on building the cockpit for the new world.
Before sending any real traffic, we set up observability for the new system: instrumentation, dashboards, and the ability to emit metrics from the new codebase. We created visualisations for:
This dashboard became our “eyes” for the first real technical phase: Shadow Mode.
Shadow mode was a critical part of our release strategy. Because the new service and the consumer integration were being built independently, we gave ourselves the ability to run the orchestrator in parallel - without affecting real decisions.
Our first release looked like this:
This gave us meaningful signal early, with zero impact on customers.
As an aside: fire-and-forget in PHP was more workaround than feature. We essentially made an HTTP call with a very short timeout. Not elegant, but quick and effective for the purpose.
Shadow mode showed us that the new orchestrator was technically stable. Now we needed to show it was logically aligned with the existing system.
We didn’t need a complex real-time comparison engine. Instead, we used a very pragmatic approach: we leveraged the existing flow as a gate.
Here’s the idea:
The gate:
The consumer ran its checks in a sequence. By placing the new risk_assessment check at the end of that sequence (still in shadow mode), we could assume that if an order reached this point, existing checks had already deemed it acceptable under the old model.
The signal:
If the new orchestrator produced a different outcome than expected, that discrepancy became a direct pointer to logic differences we needed to investigate.
The query:
Any unexpected failures in the new check were treated as actionable signals for debugging and refinement.
We ran simple queries to surface these mismatches. Each result represented a valuable test case - a chance to understand where behaviour differed and to align the new logic with established expectations. We iterated until discrepancies reduced to zero.
One benefit of the existing setup was configurability - the consumer could enable or disable checks per customer. This gave us a natural platform for an incremental rollout.
Beforehand, we collaborated with stakeholders to select a set of customers who would give us broad, representative coverage - across markets, partners, and configurations. These needed to be:
We created a rollout plan summarising which criteria each customer fulfilled. Then, customer by customer, we enabled the new risk_assessment check for them while disabling the legacy ones.
During each step, we monitored:
When everything looked stable, we moved to the next set.
For every customer, we had a rollback plan ready.
A simple CLI command could revert the configuration instantly by disabling the new assessment and restoring the legacy checks.
Absolutely. Early in the rollout, we encountered a few issues. When a problem arose, we rolled back immediately, investigated, patched, and tried again.
For transparency and auditability, we kept a migration log with:
This was a useful document, and one we plan to reflect on to try and prevent similar issues happening again.
We waited. Dashboards stayed green. Discrepancies dropped to zero. The rollout continued until 100% of assessments flowed through the new orchestrator.
The migration was successful. The engine swap was complete.
But the project wasn’t truly finished. With the new system live, the most important questions still remained:
In the final post of this series, I’ll share our retrospective: the data, the lessons, and what happens next.