Workflow Mode¶
dr-provision implements a Workflow system to automate the various tasks needed to provision and decommission systems. The workflow system is the results of several other components in dr-provision interacting. The rest of this section goes over those parts in more detail.
The Models¶
Tasks¶
The basic unit of work that dr-provision sequences to drive Machines through a workflow is a Task. Individual Tasks are executed against Machines by creating a Job for them.
Tasks contain individual Templates that are expanded for a Machine whenever a Job is created. Each of these individual Templates can expand to either a script to be executed (if the Path parameter is empty or not present), or a file to be placed on the filesystem at the location indicated by template-expanding the Path parameter.(if the Path parameter is not empty).
Jobs¶
A Job is used to track the execution history of Tasks against a specific Machine. A Job is created every time a Task is executed against a machine -- Jobs keep track of their state of execution. The history of what has been executed (including all log output from scripts) is stored as a chain of Jobs, and the exit status of the Job determines what a machine agent will do next.
Stages¶
A Stage is used to provide a list of Tasks that should be run on a Machine along with the BootEnv the tasks should be run in.
Workflows¶
A Workflow is used to provide a list of Stages that a Machine should run through to get to a desired end state. When a Workflow is added to a Machine, it renders the BootEnv and Stage for the Machine read-only, replaces the task list on the Machine with one that will step through all the Stages, BootEnvs, and Tasks needed to drive the machine through the Workflow.
How They Work Together¶
Machine Agent (client side)¶
The Machine Agent runs on the Client and is responsible for executing tasks and rebooting the Machine as needed. It is structured as a finite state machine for increased reliability and auditability. The Machine Agent always starts in the AGENT_INIT state.
AGENT_INIT¶
Initializes the Agent with a fresh copy of the Machine data, marks
the current Job for the machine as failed
if it is
created
or running
,and creates an event
stream that receives events for that Machine from dr-provision. If
an error was recorded, the Agent prints it to stderr and then clears
it out.
If an error occurs during this, the agent will sleep for a bit and transition back to AGENT_INIT, otherwise it will transition to AGENT_WAIT_FOR_RUNNABLE.
AGENT_WAIT_FOR_RUNNABLE¶
Waits for the Machine to be both Available and Runnable, and for either the machine Context to equal the context the Agent is paying attention to, or for the machine BootEnv to change. Once it is, the Agent transitions to AGENT_REBOOT if the machine changed BootEnv and the Agent is running in the empty ("") context, AGENT_EXIT if the Agent received a termination signal, AGENT_INIT if there was an error waiting for the state change, and AGENT_RUN_TASK otherwise.
AGENT_RUN_TASK¶
Tries to create a new Job to run on the machine. If there was an error creating the Job, transitions back to AGENT_INIT. If there was no job created, the Agent transitions to AGENT_CHANGE_STATE if the Machine does not have a Workflow, and AGENT_WAIT_FOR_CHANGE_STAGE if it does.
If a Job was created, the Agent attempts to execute all the steps in the Task for which the Job was created, and updates the Job depending on the exit status of the steps.
If there was an error executing the Job, the agent will transition back to AGENT_INIT.
If the Job signalled that a reboot is needed, the Agent transitions to AGENT_REBOOT.
If the Job signalled that the system should be powered off, the Agent transitions to AGENT_POWEROFF.
If the Job signalled that the Agent should stop processing Jobs, the Agent transitions to AGENT_EXIT.
Otherwise, the Agent transitions to AGENT_WAIT_FOR_RUNNABLE.
AGENT_WAIT_FOR_STAGE_CHANGE¶
Waits for the Machine to be Available, and for any of the following fields on the Machine to change:
- CurrentTask
- Tasks
- Runnable
- BootEnv
- Stage
- Context
Once those conditions are met, follows the same rules as AGENT_WAIT_FOR_RUNNABLE.
AGENT_EXIT¶
Causes the Agent to cleanly shut down
AGENT_REBOOT¶
Reboots the system if running in the empty ("") context, otherwise exits the Agent.
AGENT_POWEROFF¶
Cleanly shuts the system down if running in the empty ("") context, otherwise exits the Agent.
Reboot! Using Agent State Changes in Scripts¶
These functions are implemented in the community content shared template
accessed by adding {{ template "prelude.tmpl" . }}
in your
content.
By adding this library, you can call the following functions:
- exit
to exit the task normally
- exit_incomplete
to exit the job in an incomplete state
- exit_reboot
to exit the job and trigger a reboot
- exit_shutdown
to exit the job and trigger a system shutdown
- exit_stop
to exit the job and stop the Agent
- exit_incomplete_reboot
to mark the job as incomplete and reboot
- exit_incomplete_shutdown
to mark the job as incomplete and shutdown
The above exit functions are implemented by calling exit
with various bits set
in the exit code:
- exit 0
means the job succeeded
- exit 1
means the job failed
- exit 16
stops the agent
- exit 32
triggers a shutdown
- exit 64
triggers a reboot
- exit 128
means the job is incomplete
Mixed exit status are achieved by adding the status codes together:
- exit 192
means the job is incomplete AND system should reboot
- exit 160
means the job is incomplete AND system should shutdown
- exit 65
means the job failed AND the system should reboot
You should take care when writing tasks that the do not exit with any of the above
bits set by accident This usually happens when set -e
is present in a shell
script and a command exits with a non-zero exit code that happens to have one
of the above bits set. If this is the case, your script logic should be reworked to
catch that exit code and exit appropriately. A construct like failing-command || exit 1
is generally sufficient.
Idempotent behavior¶
In DRP extra care should be taken for tasks to make sure they are idempotent. This is especially important with regards to the above reboot section. Often when an operator needs to control a reboot the operating environment is sledgehammer. Sledgehammer runs in memory so when you reboot everything you previously had configured is gone and when the system boots again it will need to restore the system to its state before the reboot. To prevent your task from getting stuck in a reboot loop you will want to create a param of type boolean, set that param at the end of your task before doing the reboot, and you need to check the value of the param to see if you should do the job before the job is run. Below is a pseudo example.
#!/usr/bin/env bash {{template "prelude.tmpl" .}} if [[ {{.Param "reboot-test-skip"}} == true ]]; then echo "Skipping task because param says so"; exit 0 fi echo "running the thing" drpcli machines set "$RS_UUID" param reboot-test-skip to true exit_reboot
Following examples like we have outlined above will ensure that the task is idempotent.
dr-provision (server side)¶
In dr-provision, the machine Agent relies on these API endpoints to perform its work:
- GET from
/api/v3/machines/<machine-uuid>
to get a fresh copy of the Machine during AGENT_INIT. - PATCH to
/api/v3/machines/<machine-uuid>
to update the machine Stage and BootEnv during the AGENT_CHANGE_STAGE. - POST to
/api/v3/jobs
to retrieve the next Job to run during AGENT_RUN_TASK. - PATCH to
/api/v3/jobs/<job-uuid>
to update Job status during AGENT_RUN_TASK and during AGENT_INIT. - PUT to
/api/v3/jobs/<job-uuid>/log
to update the job log during AGENT_RUN_TASK. - UPGRADE to
/api/v3/ws
to create the EventStream websocket that receives Events for the Machine from dr-provision. Each Event contains a copy of the Machine state at the point in time that the event was created.
Retrieving the next Job¶
Out of all those endpoints, the one that does the most work is the
POST /api/v3/jobs
endpoint, which is responsible for figuring
out what (if any) is the next Job that should be provided to the Machine
Agent. It encapsulates the following logic:
-
dr-provision receives an incoming POST on
/api/v3/jobs
that contains a Job with the Machine and Context fields filled out.If the Machine does not exist, the endpoint returns an Unprocessable Entity HTTP status code.
If the Machine is not Runnable and Available, the endpoint returns a Conflict status code.
If the Machine has no more runnable Tasks (as indicated by CurrentTask being greater than or equal to the length of the Machine Tasks list), or the current Context on the Machine is not equal to the Context of the new Job, the endpoint returns a No Content status code, indicating to the Machine Agent that there are no more tasks to run.
-
dr-provision retrieves the CurrentJob for the Machine. If the Machine does not have a CurrentJob, we create a fake one in the Failed state and use that as CurrentJob for the rest of this process.
- dr-provision tentatively sets
nextTask
to CurrentTask + 1. - If the CurrentTask is set to -1 or points to a
stage:
orbootenv:
entry in the machine Task list, we mark the CurrentTask asfailed
if it is not alreadyfailed
orcreated
. - If CurrentTask is set to -1, we update it to 0 and set
nextTask
to 0. - If CurrentTask points to a
stage:
,context:
or abootenv:
entry in the Tasks list, we roll forward on the Tasks list until we get to an entry that does not contain astage:
,context:
, orbootenv:
entry, gathering machine changes as we go. If the changes we gather result in any changes to the Machine object, we generate a new Job encapsulating all the changes we gathered, set it to thefinished
state, save the gathered machine changes and the job, and skip to the final step in this list. - Depending on the State of the CurrentJob, we take one of the
following actions:
incomplete
: This indicates that CurrentJob did not fail, but it also did not finish. dr-provision returns CurrentJob unchanged, along with the Accepted status code.finished
: This indicates that the CurrentJob finished without error, and dr-provision should create a new Job for the next Task in the Tasks list. dr-provision sets CurrentTask tonextTask
.failed
: This indicates that the CurrentJob failed. Since updating a Job to thefailed
state automatically makes the Machine not Runnable, something else has intervened to make the machine Runnable again. dr-provision will create a new Job for the current Task in the Tasks list.
- dr-provision creates a new Job for the Task in the Tasks list
pointed to by CurrentTask in the
created
state. The Machine CurrentJob is updated with the UUID of the new Job. The new Job and the Machine are saved. - If the Entry in the Tasks list pointed to by CurrentTask starts
with
action:
, then the rest of the entry is interpreted as either aplugin:action_name
or as aaction_name
. dr-provision will try to invoke the requestedaction_name
on the machine (optionally using the specifiedplugin
). If the plugin invocation succeeds, the results of the invocation are saved in the log, the Job is set tofinished
, and returned along with the Created HTTP status code. If the plugin invocation fails for any reason, the reason it failed is saved in the log along with any diagnostic output from the plugin, the job is set tofailed
, and nothing is returned along with the NoContent status code. - If the new Job is in the
created
state, it is returned along with Created HTTP status code, otherwise nothing is returned along with the NoContent status code.
Changing the Workflow on a Machine¶
Changing a Workflow on the Machine has the following effects:
-
The Stages in the Workflow are expanded to create a new Tasks list. Each Stage gets expanded into a List as follows:
stage:<stageName>
bootenv:<bootEnvName>
if the Stage specifies a non-empty BootEnv.- The Tasks list in the Stage
The Tasks list on the Machine are replaced with the results of the above expansion.
-
The CurrentTask index is set directly to -1.
- The Stage and BootEnv fields become read-only from the API. Instead,
they will change in accordance with any
stage:
andbootenv:
elements in the Task list resulting from expanding the Stages in the Workflow. Any Stage changes that happen during processing a Workflow do not affect the Tasks list or the CurrentTask index. - The Context field on the Machine is set to the value of the BaseContext Meta field on the Machine, or the empty string if that Meta field does not exist on the Machine.
Removing a Workflow from a Machine¶
To remove a workflow from a Machine, set the Workflow field to the empty
string. The Stage field on the Machine is set to none
, the
Tasks list is emptied, the CurrentTask index is set back to -1, and the
Context field on the Machine is set to the value of the BaseContext Meta
field on the Machine, or the empty string if that Meta field does not
exist on the Machine.
Changing the Stage on a Machine¶
Changing a Stage on a Machine has the following effects when done via the API and the Machine does not have a Workflow:
- The Tasks list on the Machine is replaced by the Tasks list on the Stage.
- If the BootEnv field on the Stage is not empty, it replaces the BootEnv on the Machine.
- The CurrentTask index is set to -1
- If the Machine has a different BootEnv now, it is marked as not Runnable.
- The Context field on the Machine is set to the value of the BaseContext Meta field on the Machine, or the empty string if that Meta field does not exist on the Machine.
Resetting the CurrentTask index to -1¶
If the Machine does not have a Workflow, the CurrentTask index is simply set to -1. Otherwise. it is set to the most recent entry that would not occur in a different BootEnv from the machine's current BootEnv. In both cases, the Context field on the Machine is reset appropriately for the Task position.