Domain Orchestrator - Our migration strategy

Series

This post is part of the Domain Orchestrator series:

The “Engine Swap”

We’d established the issue and had a solution ready to go - but here’s the kicker: building the solution was the easy part.

The existing system was live, highly important, and functioning reliably. A “big bang” release - flipping everything over at once - was completely off the table. There was simply too much value flowing through this path to take unnecessary risks.

Almost every transaction that mattered to the business passed through this functionality, so we had to be extremely careful to avoid any customer or partner impact. The analogy in my head was trying to change the engine of a car while driving at full speed. We couldn’t stop, we couldn’t break down, and ideally, we couldn’t even slow down.

So the core questions became:

How do we build the new solution safely?
How do we migrate the “smart consumer” without disruption?
How do we coordinate across teams and domains?

Our Migration Playbook: A Phased Approach

To manage this, we didn’t just create a task list - we created a phased migration playbook, where each phase increased confidence before we moved to the next.

Phase 1: Building the cockpit
Phase 2: Shadow Mode
Phase 3: Verifying correctness
Phase 4: The Incremental Rollout

And at every step, we had a rollback plan ready.

Phase 1: Building the cockpit

The first step in many migrations is to baseline the old system. For us, this wasn’t easily possible.

The “smart consumer” approach meant logic and behaviour were distributed across both the consumer and several downstream services. Getting a unified view of how all the pieces behaved together wasn’t straightforward - which was part of what motivated this initiative in the first place.

So instead of baselining the old world, Phase 1 focused on building the cockpit for the new world.

Before sending any real traffic, we set up observability for the new system: instrumentation, dashboards, and the ability to emit metrics from the new codebase. We created visualisations for:

Throughput (RPM): how many requests the new service received
Error rate: whether the orchestrator itself behaved correctly
Internal latency: how long a full assessment took
Check performance: latency distribution for each individual check

This dashboard became our “eyes” for the first real technical phase: Shadow Mode.

👤 Phase 2: Shadow Mode

Shadow mode was a critical part of our release strategy. Because the new service and the consumer integration were being built independently, we gave ourselves the ability to run the orchestrator in parallel - without affecting real decisions.

Our first release looked like this:

The new service exposed an endpoint that executed a small, safe subset of logic.
The consumer integrated with this endpoint, invoking it as an additional step in the existing flow.
- Crucially, it was done in a fire-and-forget fashion - any exceptions were caught, logged, and the original business behaviour continued normally.
- This allowed us to validate:
  - service infrastructure and deployment
  - network communication
  - metric collection
  - payload shape & compatibility
  - early functional behaviour in isolation

This gave us meaningful signal early, with zero impact on customers.

Fire and forget in PHP

As an aside: fire-and-forget in PHP was more workaround than feature. We essentially made an HTTP call with a very short timeout. Not elegant, but quick and effective for the purpose.

Phase 3: Verifying correctness (the manual diff)

Shadow mode showed us that the new orchestrator was technically stable. Now we needed to show it was logically aligned with the existing system.

We didn’t need a complex real-time comparison engine. Instead, we used a very pragmatic approach: we leveraged the existing flow as a gate.

Here’s the idea:

The gate:
The consumer ran its checks in a sequence. By placing the new risk_assessment check at the end of that sequence (still in shadow mode), we could assume that if an order reached this point, existing checks had already deemed it acceptable under the old model.
The signal:
If the new orchestrator produced a different outcome than expected, that discrepancy became a direct pointer to logic differences we needed to investigate.
The query:
Any unexpected failures in the new check were treated as actionable signals for debugging and refinement.

We ran simple queries to surface these mismatches. Each result represented a valuable test case - a chance to understand where behaviour differed and to align the new logic with established expectations. We iterated until discrepancies reduced to zero.

Phase 4: Incremental rollout

One benefit of the existing setup was configurability - the consumer could enable or disable checks per customer. This gave us a natural platform for an incremental rollout.

Beforehand, we collaborated with stakeholders to select a set of customers who would give us broad, representative coverage - across markets, partners, and configurations. These needed to be:

Small enough to manage risk
Active enough to produce timely orders
Distributed enough to represent different regions and partners
Covering any meaningful varieties of behaviour

We created a rollout plan summarising which criteria each customer fulfilled. Then, customer by customer, we enabled the new risk_assessment check for them while disabling the legacy ones.

During each step, we monitored:

dashboards
logs
comparisons
exception streams

When everything looked stable, we moved to the next set.

The Failsafe: Our rollback plan

For every customer, we had a rollback plan ready.

A simple CLI command could revert the configuration instantly by disabling the new assessment and restoring the legacy checks.

Did we ever rollback?

Absolutely. Early in the rollout, we encountered a few issues. When a problem arose, we rolled back immediately, investigated, patched, and tried again.

For transparency and auditability, we kept a migration log with:

rollout and rollback timestamps
root causes
explanations of fixes
links to relevant changes

This was a useful document, and one we plan to reflect on to try and prevent similar issues happening again.

🚒 The New Engine is In

We waited. Dashboards stayed green. Discrepancies dropped to zero. The rollout continued until 100% of assessments flowed through the new orchestrator.

The migration was successful. The engine swap was complete.

But the project wasn’t truly finished. With the new system live, the most important questions still remained:

Did we actually solve the problems we set out to address?
What unexpected behaviours emerged post-launch?
What new capabilities did this unlock for the business?

In the final post of this series, I’ll share our retrospective: the data, the lessons, and what happens next.