Why Service Architectures Should Focus on Workflows
Since this week saw the 15th anniversary of this blog I thought I’d celebrate by actually writing a post, and a technical one at that. Before we get to it, I have to recognise that I’m blogging in the 21st century now, so here’s a meme:
Recently I’ve been musing over the way we tend to architect distributed systems. Almost certainly none of what follows is original, but like most of my writing its designed to help me understand and codify a set of thoughts that have been buzzing around for a while.
If we practise domain driven design then it’s important to remember that the domain model is not the system. Rather, its a tool to help us understand our system better, a map of the territory if you like. It helps us navigate and use our system correctly but its insufficient by itself. Its called a model because it makes simplifying assumptions about the world. It tells us how the world might work if we didn’t have to worry about things that are hard to control like latency, bandwidth, communication failures, resource limits, concurrency and trust.
As a whole, the domain model encapsulates the constraints and business rules that apply to our system. We represent real world entities by objects that encapsulate state and manipulate it via methods. A well normalised system will ensure that objects have a single responsibility with the minimal interface required to fulfil it. Interacting with the system quite often requires the orchestration of many objects, each handling their own unique part of the work. Workflows such as taking orders, paying staff or shipping goods are accomplished by arranging the appropriate objects and invoking methods to achieve the final desired system state according to the constraints defined by the relationships between objects.
The first place we usually encounter the limitation of our domain model is when we start adding persistence. Aside from the well-known object/relational impedance mismatch we also need to introduce new abstractions to deal with the location of the backend database: batching up reads and writes for efficiency, adding transactions for integrity and connection management for reliability.
We are compensating for the deficiencies in our domain model but importantly we are not invalidating it. The constraints and rules still apply: customers have orders which have line items etc. Now, however, we are having to interface our simple model to some of the complexities of the real world. If the database was in the same process space as the domain model then we wouldn’t be as concerned about the impact of interacting with it. However most databases are physically remote over a fallible network link with huge latency compared to main memory and have multiple clients modifying data concurrently.
Another place we encounter limitations, and the main point of this post, is when we want to scale out our system to increase capacity. These days that implies we want to distribute the system over several interacting services that can be scaled as necessary. The temptation here is to simply follow the domain model constraints and distribute the system by entity type. Following this we may end up with services for customers, orders and products each encapsulating its data and exposing actions over it. They interact with the database independently to ensure data integrity and can be scaled out with additional instances as necessary as system load grows.
Placing an order might be orchestrated by the order service that gets the customer details from the customer service, product information from the product service and uses them to create the order. Along the way it might also verify whether the customer is good for credit or needs to pay immediately, check stock levels, lookup shipping costs and trigger activities in fulfilment and invoicing services.
On the surface this seems reasonable, after all these steps have to happen somewhere, but digging deeper we can expose more complexity. The customer service will probably need get data from a database, perhaps calling into an organisation service for corporate customers to get billing and shipping addresses. The product service also looks up data in the database but we only need basic information such as description and package quantities so getting a full product entity is likely to be overkill for our needs. Creating an order requires locking, persistence of line items and the order itself, and…. you get the idea: a lot is happening and it’s happening over the network.
This scale out approach carries many of the same problems we encounter with adding persistence to our domain model:
- Latency - the time taken to traverse the network for each service call is at least two orders of magnitude larger than calling an in-process method. Add in disk seeks and it's another order of magnitude slower. The domain model simplifies the world by assuming these calls are free but in reality they can compound up to crippling degrees of slowdown.
- Bandwidth - retrieving a customer's details or a product spec can result in a lot of data being transferred unnecessarily when we only need a name or quantity. Again the domain model assumes data is referenced in-place or that bandwidth is infinite. Introducing alternate service calls for summaries and subsets of the data can mitigate this problem but wasting bandwidth limits scalability of the system as a whole.
- Reliability - we have to assume that every out of process request can fail unexpectedly. The possible causes are nearly infinite: the network link may be down, the server could be out of sockets, routing may be faulty or someone switched off the destination server. The domain model simplified the world by assuming every method call will succeed. A typical workaround for this is to introduce timeouts and automatic retries which can lead to pathological situations where every timeout cascades up into multiple retries at all levels tying up resources and limiting throughput.
- Fragility - because the service model mirrors the domain model, services will participate in multiple of workflows and any failure will affect them all. For example a customer service is likely to be used during ordering, shipping, invoicing and payment. If the customer service is unavailable then all those workflows break at the same time. Not only that but each of these workflows depends on multiple services being available.
- Scaling - because lots of services are involved in a single workflow, scaling capacity for one workflow means every service used has to be scaled as well and since the services may be invoked multiple times during a single interaction they have to be scaled faster than the invoking service. This adds unnecessary costs to scaling. If we want to take more payments then we need to scale out a bunch of services even if they are only supplying a single value in our workflow.
- Hotspots - most domain models have a few central objects such as customers or companies that are involved in large numbers of workflows. When these are encapsulated as services they become hotspots and bottlenecks for performance, degrading multiple workflows simultaneously.
- Evolvability - all systems evolve with business requirements but service architectures that follow the domain model too closely become difficult to evolve. A change to workflow such as adding a new required piece of information can lead to a cascade of interface changes between services to enable them to pass it through to dependent services. Conversely, removing a workflow means removing functionality from services and subsequently retesting and redeployment of large numbers of services. Given the fragility of the system, this can cause problems in unrelated workflows simply because one service is malfunctioning.
All of these problems can be mitigated with a different approach to service design. Instead of carving up the domain by entity, focus on the workflows.
In this approach each important function becomes a separate service that can be scaled independently and has minimal dependencies. Using the example earlier we would design a “place order” service that is responsible solely for placing an order. It performs the same tasks as before but instead of calling entity-based services it uses the domain model objects directly to carry out its work, connecting to the database as necessary to select just the information it requires to create the order.
Communication between workflows is performed asynchronously by sharing state in the database or publishing an event to a message queue. For example once the order has been placed, events are fired that will kick off invoicing and fulfilment workflows.
This brings a number of immediate advantages:
- Deploying or retiring a workflow becomes as simple as switching a service on or off which leads to greater freedom to experiment.
- Scaling a workflow is limited to scaling a single service horizontally and the costs of doing this can be cleanly evaluated.
- The system as a whole becomes much more robust. When a service encounters problems it is limited to a single workflow such as issuing invoices. Other workflows can continue to operate independently.
- Latency, bandwidth use and reliability are all improved because there are fewer network calls. The service still relies on the database and other support systems such as lock servers, but most of the data flow is controlled in-process.
- The unit of testing and deployment is a single service which reduces the complexity and cost of maintenance.
It’s important to recognise the difference between the organisation of code and its deployment. We logically separate the objects we use to implement the domain model using SOLID practices that help us maintain, evolve and reason about our code. But we don’t need to be constrained in that way when it comes to deployment as a service. The service uses the domain model objects but it doesn’t need to own them - it just composes the objects it needs into the patterns required for its workflow. Data is still encapsulated by each domain object and their interactions are still well defined.
In this respect the domain model actually becomes more important because it defines the reusable business rules and logic that will be used to create services for each workflow.
So we should keep the domain model but be aware that it’s only a map to help us navigate our system and tell us how things should work under ideal conditions. As we interface that to the real world we need to introduce abstractions and aggregations tuned to the messy realities of networks, resource constraints and concurrency. To improve reliability, throughput and scaling economics we need to focus on composing domain objects into services that distinct workflows through our system and arrange our architecture to minimise dependencies between services.