Domain Orchestrator - A retrospective

Series

This post is part of the Domain Orchestrator series:

What went well

It’s easy to be harsh on yourself about things that didn’t go exactly to plan, but looking at the bigger picture, we executed a highly complex project with very minimal issues.

So first, let’s start with the positives:

Planning is good: So many times in the past I’ve jumped into projects code-first, making decisions on the fly. Given the scale of this project, I’m extremely happy we took the time to plan properly. Collaboratively modelling the domain, documenting the release plan, and selecting the pilot customers kept everyone on the same wavelength - and it even helped generate the code!
Communication is key: Despite the size of this project and the number of stakeholders, we actually had very minimal meetings. In fact, we had just one meeting every week with around 4-5 people. We used Gemini to summarise the meeting, took our own notes in Confluence, and shared them in a dedicated Slack channel. We also shared regular updates in this channel, creating an “opt-in” system: if you cared about progress, you joined and read the updates; if not, you didn’t have to. People really appreciated the minimal meetings while still feeling fully informed - a big win for everyone.
Observability: This seems obvious, but it’s often taken for granted. As an engineer, when you’re in the zone writing good, well-tested code, you can begin to think, “there’s no way this can go wrong!” - but we all know that’s not true. Solid, meaningful logging, coupled with OpenTelemetry, custom metrics, and insightful dashboards, gave us massive confidence across all environments. I will be taking this level of observability forward into future projects for sure.
Isolation: Being able to build our new service - and our new integration - in isolation was extremely powerful. If you have this possibility, I would recommend it every time. We had our own methods for doing this, but you can, for example, push a message onto a queue when you trigger the existing functionality, and then have a worker consume the queue to execute the new functionality in the background.
Gradual rollout: I suppose “canary release” is the technically correct term, but regardless: if you can facilitate rolling out functionality to a subset of users to measure its performance, it’s a no-brainer. If you have any sort of traffic, “big bang” releases are extremely risky and seldom work. No matter how much you test, it’s very difficult to simulate real user behaviour and production traffic.
Rollback mechanisms: At every stage of this process, we had a way to roll back. When we began rolling out merchants one by one, we relied heavily on this. If your work is on a critical path, always ask “how do we roll back?”. Can you revert a deployment quickly? Can you toggle a feature flag? Can you execute a database command? Always strive to have a fast path to recovery.

What could have been better?

Naming things: Everyone knows it - naming things is difficult. Towards the end of the project, I regretted naming this an “orchestrator”. I feel it opens the door for the service to become a dumping ground. Even a simple name like “risk assessment service” would have put up an invisible boundary to protect it from rogue functionality.
Investing in the old: As I mentioned in an earlier post, migrations are easier when you can baseline the old system. If we had a New Relic dashboard for the old system to compare against the new one, things would have been much easier. However, due to conflicting priorities and resources, we simply couldn’t do it. It seems counter-intuitive to invest in something that will ultimately be replaced, but putting in a day or two of effort to help compare both systems would have been a worthy trade-off.
Expecting the unexpected: Once live, we encountered some unexpected behaviour because the old system had custom functionality we didn’t know about during the build. Some of these put the entire project at risk—if we couldn’t replicate it, we couldn’t go live. We were able to hack and compromise around it thankfully, but ideally, these would have been revealed up front.
Small bang releases: Whilst we had a mechanism to roll out the functionality to select customers, if you have tens of thousands of customers, you can’t keep doing this one by one - at some point, you need to pull the trigger. In hindsight, I would have invested time to build a way of releasing to cohorts (e.g., 10% of remaining customers, then 30%, then 50%). Instead of one big bang, we could have had multiple “small bangs”… as dodgy as that sounds!

What’s next?

Reap the benefits: We did this for a reason. Now we have a platform where we can introduce new checks independent of our consumers, A/B test new providers, and much more.
Clean up: The “smart consumer” still has a lot of code and database records which are now redundant. Removing those will delete a significant amount of technical debt and clean up the domain.
Optimising for speed: We made a conscious decision to prioritize cost over speed for the initial release (sequential checks), but now that the system is stable, this can be re-assessed. We can also freely adjust all of our own downstream service contracts to receive data we already have, rather than each service independently looking it up, further reducing latency.

Conclusion: Architecture is about enabling teams

Looking back at this journey - from the frustrations of the “Smart Consumer” to the successful rollout of the Risk Orchestrator - the biggest lesson hasn’t been about Kubernetes pods or idempotency keys. It’s been about boundaries.

The “Smart Consumer” pattern served us well in the early days. It was simple and fast to build. But as we scaled, it became a cage that restricted both teams involved. By moving to the Orchestrator, we didn’t just refactor code; we redefined the boundaries between our teams.

We gave the consumer team simplicity and our team autonomy. I think that is one of the ultimate goals of software architecture: not to draw perfect boxes on a whiteboard, but to enable teams to move fast, independently, and with confidence.

The engine swap is done. The car is still moving. And for the first time in a long time, we’re excited to see just how fast it can go, and where it can take us.

Thanks for following along with this series! If you missed any part of the journey, you can catch up here: