Paul Russell
2004-02-25 23:45:25 UTC
Guys,
Firstly, a caveat: There's going to be a lot of teaching grandma to
suck eggs in this e-mail. I know you guys know most of this, but I'm
deliberately going from 'the top' because it helps me collate my
thoughts, and it also helps when I look back at this e-mail later and
can't remember why I said something!
Also, I'm still new to Werkflow, so I might have misunderstood how bits
of it work, feel free to pick me up on this!
Apologies for the length of this e-mail, it's a bit of a brain-dump.
========
In a separate mail, I've been talking about running business processes
across multiple nodes in a cluster with a view to providing high
availability and to a lesser extent load balancing.
I'm going to describe this by example, rather than in abstract terms.
I've put a diagram of a trivial business process on my website at...
Loading Image...
The blue boxes represent logical units of work, i.e. steps we want to
execute as a unit. The implication here is that each blue box executes
atomically. In general, we don't care where each step executes -- it'd
be nice if the load was evenly distributed across the cluster of
servers. More importantly, we don't want to lose an 'in flight'
business process if one of the nodes fails. We need to be sure that we
follow some rules to make sure we don't mess things up:
* Only execute each task once.
* Don't let the process instance/case state get out of sync with the
executing tasks.
* Processes must be forward recoverable if something fails.
* Processes must be able to undo their previous actions if something
terminal happens.
What do we need to make this work? Well, we need:
* Some way to persist the state of a running process at any time.
* Some way to resurrect the state of a running process at any time on
any node.
* Clear separation of activities. The activities mustn't 'run into'
each other.
* Some way to encapsulate 'what to do next' and place the encapsulated
command somewhere safe where it can be executed from anywhere, even if
the node that created it ceases to exist.
* Synchronize the command and the state so that we only commit the
state changes if we commit the command and vice verse.
Helpfully, Werkflow and J2EE provides most of the gubbins we need to do
this:
* Werkflow already has support for pluggable persistence providers, so
connecting to some kind of cluster wide persistent store shouldn't be a
problem.
* J2EE provides JMS which can be used with the Command Message pattern
to dispatch commands. Helpfully, JMS can be clustered, which means that
each message can only be consumed by one host.
* J2EE provides Message Driven Beans, which provide an easy way to
consume messages on queues, and more importantly provides automatic
thread pooling across a cluster, so it should automatically load
balance messages across all nodes in a cluster as well as providing
transparent failover.
* J2EE provides transparent (container managed) transaction support,
which means that we can rely on the container to make sure that we
either commit both the process state and the next command message or we
persist neither. (i.e. we make it look like the whole activity never
happened)
What's missing then?:
* Werkflow doesn't currently provide JMS support, and neither does it
provide the hooks to plug this support in.
* Werkflow doesn't provide a mechanism for 'undoing' changes that have
been committed already.
Neither of these things strike me as particularly complicated to
address. We should be able to perform a dependancy inversion to
abstract the current Concurrent-based scheduler out into a service,
just like the existing persistence and messaging services. The
compensation services ('undo') could be provided as an external module
that maintains a 'compensation list' of tasks which need to be executed
to undo the results of a process. It may be possible to inject these
tasks into the same scheduler that's running the rest of the workflow.
The structure I'm describing above looks a little like this, I think:
http://paulrussell.dyndns.org/images/work/workflow/werkflow-with-
clustering.png
Everything that supports clustering should be 'bolted on' around
Werkflow as services rather than being part of the core, otherwise
we'll penalize people who don't want silly things like clustering.
At runtime, everything would be controlled by the scheduler (which I
/think/ is pretty much what happens now); a process would start by the
core scheduling a task for the first activity. The scheduler would then
resurrect this 'task' whenever (and wherever) it is ready to do so. The
task would invoke the activity via the core. The last thing the core
would do is schedule the next activity. Control then returns to the
scheduler which commits the transaction and the whole process starts
over. This is illustrated in the following diagram.
http://paulrussell.dyndns.org/images/work/workflow/werkflow-simple-
sequence.png
There are a couple of neat things about this:
* If the activity fails for any reason (the database is down,
connectivity fails, some other kind of failure), then the fact that the
message came from a queue and was read within the scope of a
transaction comes to the rescue -- it's as if the message never left
the queue, it'll simply get picked up again later and retried.
* Equally, if for some reason the 'outgoing' transition can't be
scheduled, the transaction will roll back all the state changes /and/
the original message 'get' so it'll be like the whole unit of work was
never executed. It'll get retried later.
* If one of the nodes in a cluster gets shut down, any process running
on that box will transparently start running on another node. (subject
to in-doubt transactions, I guess. It's too late to remember exactly
how 2PC works right now ;)
What do you guys think about this? Am I talking cobblers? If you guys
agree with what I'm talking about here, my next steps would likely be
to start looking in detail at how easy it would be to perform the
dependancy inversion on the scheduler. I don't think this should be
/too/ painful, and would be a good way for me to finish learning the
innards of Werkflow!
Let me know what you think!
Paul
Firstly, a caveat: There's going to be a lot of teaching grandma to
suck eggs in this e-mail. I know you guys know most of this, but I'm
deliberately going from 'the top' because it helps me collate my
thoughts, and it also helps when I look back at this e-mail later and
can't remember why I said something!
Also, I'm still new to Werkflow, so I might have misunderstood how bits
of it work, feel free to pick me up on this!
Apologies for the length of this e-mail, it's a bit of a brain-dump.
========
In a separate mail, I've been talking about running business processes
across multiple nodes in a cluster with a view to providing high
availability and to a lesser extent load balancing.
I'm going to describe this by example, rather than in abstract terms.
I've put a diagram of a trivial business process on my website at...
Loading Image...
The blue boxes represent logical units of work, i.e. steps we want to
execute as a unit. The implication here is that each blue box executes
atomically. In general, we don't care where each step executes -- it'd
be nice if the load was evenly distributed across the cluster of
servers. More importantly, we don't want to lose an 'in flight'
business process if one of the nodes fails. We need to be sure that we
follow some rules to make sure we don't mess things up:
* Only execute each task once.
* Don't let the process instance/case state get out of sync with the
executing tasks.
* Processes must be forward recoverable if something fails.
* Processes must be able to undo their previous actions if something
terminal happens.
What do we need to make this work? Well, we need:
* Some way to persist the state of a running process at any time.
* Some way to resurrect the state of a running process at any time on
any node.
* Clear separation of activities. The activities mustn't 'run into'
each other.
* Some way to encapsulate 'what to do next' and place the encapsulated
command somewhere safe where it can be executed from anywhere, even if
the node that created it ceases to exist.
* Synchronize the command and the state so that we only commit the
state changes if we commit the command and vice verse.
Helpfully, Werkflow and J2EE provides most of the gubbins we need to do
this:
* Werkflow already has support for pluggable persistence providers, so
connecting to some kind of cluster wide persistent store shouldn't be a
problem.
* J2EE provides JMS which can be used with the Command Message pattern
to dispatch commands. Helpfully, JMS can be clustered, which means that
each message can only be consumed by one host.
* J2EE provides Message Driven Beans, which provide an easy way to
consume messages on queues, and more importantly provides automatic
thread pooling across a cluster, so it should automatically load
balance messages across all nodes in a cluster as well as providing
transparent failover.
* J2EE provides transparent (container managed) transaction support,
which means that we can rely on the container to make sure that we
either commit both the process state and the next command message or we
persist neither. (i.e. we make it look like the whole activity never
happened)
What's missing then?:
* Werkflow doesn't currently provide JMS support, and neither does it
provide the hooks to plug this support in.
* Werkflow doesn't provide a mechanism for 'undoing' changes that have
been committed already.
Neither of these things strike me as particularly complicated to
address. We should be able to perform a dependancy inversion to
abstract the current Concurrent-based scheduler out into a service,
just like the existing persistence and messaging services. The
compensation services ('undo') could be provided as an external module
that maintains a 'compensation list' of tasks which need to be executed
to undo the results of a process. It may be possible to inject these
tasks into the same scheduler that's running the rest of the workflow.
The structure I'm describing above looks a little like this, I think:
http://paulrussell.dyndns.org/images/work/workflow/werkflow-with-
clustering.png
Everything that supports clustering should be 'bolted on' around
Werkflow as services rather than being part of the core, otherwise
we'll penalize people who don't want silly things like clustering.
At runtime, everything would be controlled by the scheduler (which I
/think/ is pretty much what happens now); a process would start by the
core scheduling a task for the first activity. The scheduler would then
resurrect this 'task' whenever (and wherever) it is ready to do so. The
task would invoke the activity via the core. The last thing the core
would do is schedule the next activity. Control then returns to the
scheduler which commits the transaction and the whole process starts
over. This is illustrated in the following diagram.
http://paulrussell.dyndns.org/images/work/workflow/werkflow-simple-
sequence.png
There are a couple of neat things about this:
* If the activity fails for any reason (the database is down,
connectivity fails, some other kind of failure), then the fact that the
message came from a queue and was read within the scope of a
transaction comes to the rescue -- it's as if the message never left
the queue, it'll simply get picked up again later and retried.
* Equally, if for some reason the 'outgoing' transition can't be
scheduled, the transaction will roll back all the state changes /and/
the original message 'get' so it'll be like the whole unit of work was
never executed. It'll get retried later.
* If one of the nodes in a cluster gets shut down, any process running
on that box will transparently start running on another node. (subject
to in-doubt transactions, I guess. It's too late to remember exactly
how 2PC works right now ;)
What do you guys think about this? Am I talking cobblers? If you guys
agree with what I'm talking about here, my next steps would likely be
to start looking in detail at how easy it would be to perform the
dependancy inversion on the scheduler. I don't think this should be
/too/ painful, and would be a good way for me to finish learning the
innards of Werkflow!
Let me know what you think!
Paul
--
Paul Russell
***@paulrussell.org
Paul Russell
***@paulrussell.org