Having a workflow engine is essential when dealing with hundreds of microservices, all interconnected in various ways. When a bug arises, pinpointing it can become a nightmare.
This is where a workflow engine comes in. In this context, a workflow engine acts as a 'hub' that handles sending requests to microservices and processing their outputs. This helps mitigate issues that arise when microservices are directly connected to each other.
Creating a resilient workflow engine is challenging. There are two main approaches:
For my purposes, I chose the 'Fail-Quick' method. Durable workflows have key setbacks that weren't suitable for my needs, particularly the lack of idempotency.
Idempotency means that a service produces the same output no matter how many times it is run, i.e. , a service
The basic architecture of the system was set up with rollbacks in case of failure. During my initial research, I came across the SAGA Pattern, which closely resembled the system I was planning to make. Although it is mainly used for coordinating large-scale database transactions, its core values are applicable here.
I decided to use Kafka as the message broker because it is efficient under large loads. In Kafka, I created three topics, each with its JSON format defined using Python's pydantic module:
tomanagerMessages from microservices to the manager.
class ToManager(BaseModel):
uuid: UUID
name: str
version: str # Talked about this below
service: str
function: str
status: Status
schema_version: str # Helps in future proofing
data: Any
where Status corresponds to the following enum,
class Status(Enum):
SUCCESS = "success"
IN_PROGRESS = "inprogress"
FAILURE = "failure"
The inclusion of a version field allows us to edit workflows without fear. When editing an old workflow, we can add the new one as a different version. If any process follows the old workflow, it can continue normally, while new processes use the latest version.
fromanagerMessages from the manager to the microservice.
class FromManager(BaseModel):
uuid: UUID
name: str
version: str
service: str
function: str
schema_version: str # Helps in future proofing
data: Any
tomanagereditsMessages from the workflow maker to the manager.
class ToManagerEdits(BaseModel):
name: str
version: str
schema_version: str # Helps in future proofing
workflow: Any # Talked about this in the next section
Here, the data field is flexible because the microservice output is unknown. Once received, it can be relayed to the 'next step'.
To allow users to create workflows, I built a simple graphical node editor using Flume. This lets users describe the workflow intuitively, which is easier than text-based methods.

Once the workflow is drawn, it can be exported to JSON. Flume provides a JSON export that includes detailed node information. For our purposes, we focus on the inter-dependencies.
Upon exporting the workflow, we create directed acyclic graphs (DAGs). We also create an 'anti-graph' by reversing all connections. This anti-graph is crucial for the rollback mechanism.
Consider a scenario where our workflow includes create_user, add_license, and add_membership. If add_membership fails, we start the rollback by calling the rollback functions of add_license, followed by create_user.

A rollback function can involve multiple steps. For example, the rollback for add_license might include remove_license and log_removal. Our system keeps track of these functions and calls them in parallel when needed:
services_function_negation_list = {
"user": {
"create_user": ["delete_user"],
},
"license" : {
"add_license": ["remove_license", "log_removal"],
},
A single manager instance handling hundreds of requests is a recipe for disaster. To ensure scalability, the manager has two child threads:
ToManager requests, dealing with microservice responses. This thread is in a manager group to avoid duplicate actions.ToManagerEdits requests, not in a group as edits need to propagate across all managers.Handling long in-progress events is crucial to maintaining a responsive system. Here are two approaches:
Branched rollbacks can complicate the rollback process. Here are two strategies:
Kafka failures can disrupt the communication flow. To handle this:
Sometimes, a microservice might fail catastrophically and be unable to relay a 'failed' status. To address this:
I am not sure of the actual names for these ;-;↩︎