Skip to content

Workflow Mode

dr-provision implements a Workflow system to automate the various tasks needed to provision and decommission systems. The workflow system is the results of several other components in dr-provision interacting. The rest of this section goes over those parts in more detail.

The Models

Tasks

The basic unit of work that dr-provision sequences to drive Machines through a workflow is a Task. Individual Tasks are executed against Machines by creating a Job for them.

Tasks contain individual Templates that are expanded for a Machine whenever a Job is created. Each of these individual Templates can expand to either a script to be executed (if the Path parameter is empty or not present), or a file to be placed on the filesystem at the location indicated by template-expanding the Path parameter.(if the Path parameter is not empty).

Jobs

A Job is used to track the execution history of Tasks against a specific Machine. A Job is created every time a Task is executed against a machine -- Jobs keep track of their state of execution. The history of what has been executed (including all log output from scripts) is stored as a chain of Jobs, and the exit status of the Job determines what a machine agent will do next.

Stages

A Stage is used to provide a list of Tasks that should be run on a Machine along with the BootEnv the tasks should be run in.

Workflows

A Workflow is used to provide a list of Stages that a Machine should run through to get to a desired end state. When a Workflow is added to a Machine, it renders the BootEnv and Stage for the Machine read-only, replaces the task list on the Machine with one that will step through all the Stages, BootEnvs, and Tasks needed to drive the machine through the Workflow.

How They Work Together

Machine Agent (client side)

The Machine Agent runs on the Client and is responsible for executing tasks and rebooting the Machine as needed. It is structured as a finite state machine for increased reliability and auditability. The Machine Agent always starts in the AGENT_INIT state.

AGENT_INIT

Initializes the Agent with a fresh copy of the Machine data, marks the current Job for the machine as failed if it is created or running,and creates an event stream that receives events for that Machine from dr-provision. If an error was recorded, the Agent prints it to stderr and then clears it out.

If an error occurs during this, the agent will sleep for a bit and transition back to AGENT_INIT, otherwise it will transition to AGENT_WAIT_FOR_RUNNABLE.

AGENT_WAIT_FOR_RUNNABLE

Waits for the Machine to be both Available and Runnable, and for either the machine Context to equal the context the Agent is paying attention to, or for the machine BootEnv to change. Once it is, the Agent transitions to AGENT_REBOOT if the machine changed BootEnv and the Agent is running in the empty ("") context, AGENT_EXIT if the Agent received a termination signal, AGENT_INIT if there was an error waiting for the state change, and AGENT_RUN_TASK otherwise.

AGENT_RUN_TASK

Tries to create a new Job to run on the machine. If there was an error creating the Job, transitions back to AGENT_INIT. If there was no job created, the Agent transitions to AGENT_CHANGE_STATE if the Machine does not have a Workflow, and AGENT_WAIT_FOR_CHANGE_STAGE if it does.

If a Job was created, the Agent attempts to execute all the steps in the Task for which the Job was created, and updates the Job depending on the exit status of the steps.

If there was an error executing the Job, the agent will transition back to AGENT_INIT.

If the Job signalled that a reboot is needed, the Agent transitions to AGENT_REBOOT.

If the Job signalled that the system should be powered off, the Agent transitions to AGENT_POWEROFF.

If the Job signalled that the Agent should stop processing Jobs, the Agent transitions to AGENT_EXIT.

Otherwise, the Agent transitions to AGENT_WAIT_FOR_RUNNABLE.

AGENT_WAIT_FOR_STAGE_CHANGE

Waits for the Machine to be Available, and for any of the following fields on the Machine to change:

  • CurrentTask
  • Tasks
  • Runnable
  • BootEnv
  • Stage
  • Context

Once those conditions are met, follows the same rules as AGENT_WAIT_FOR_RUNNABLE.

AGENT_EXIT

Causes the Agent to cleanly shut down

AGENT_REBOOT

Reboots the system if running in the empty ("") context, otherwise exits the Agent.

AGENT_POWEROFF

Cleanly shuts the system down if running in the empty ("") context, otherwise exits the Agent.

Reboot! Using Agent State Changes in Scripts

These functions are implemented in the community content shared template accessed by adding {{ template "prelude.tmpl" . }} in your content.

By adding this library, you can call the following functions: - exit to exit the task normally - exit_incomplete to exit the job in an incomplete state - exit_reboot to exit the job and trigger a reboot - exit_shutdown to exit the job and trigger a system shutdown - exit_stop to exit the job and stop the Agent - exit_incomplete_reboot to mark the job as incomplete and reboot - exit_incomplete_shutdown to mark the job as incomplete and shutdown

The above exit functions are implemented by calling exit with various bits set in the exit code: - exit 0 means the job succeeded - exit 1 means the job failed - exit 16 stops the agent - exit 32 triggers a shutdown - exit 64 triggers a reboot - exit 128 means the job is incomplete

Mixed exit status are achieved by adding the status codes together: - exit 192 means the job is incomplete AND system should reboot - exit 160 means the job is incomplete AND system should shutdown - exit 65 means the job failed AND the system should reboot

You should take care when writing tasks that the do not exit with any of the above bits set by accident This usually happens when set -e is present in a shell script and a command exits with a non-zero exit code that happens to have one of the above bits set. If this is the case, your script logic should be reworked to catch that exit code and exit appropriately. A construct like failing-command || exit 1 is generally sufficient.

Idempotent behavior

In DRP extra care should be taken for tasks to make sure they are idempotent. This is especially important with regards to the above reboot section. Often when an operator needs to control a reboot the operating environment is sledgehammer. Sledgehammer runs in memory so when you reboot everything you previously had configured is gone and when the system boots again it will need to restore the system to its state before the reboot. To prevent your task from getting stuck in a reboot loop you will want to create a param of type boolean, set that param at the end of your task before doing the reboot, and you need to check the value of the param to see if you should do the job before the job is run. Below is a pseudo example.

#!/usr/bin/env bash
{{template "prelude.tmpl" .}}
if [[ {{.Param "reboot-test-skip"}} == true ]]; then
  echo "Skipping task because param says so";
  exit 0
fi
echo "running the thing"
drpcli machines set "$RS_UUID" param reboot-test-skip to true
exit_reboot

Following examples like we have outlined above will ensure that the task is idempotent.

dr-provision (server side)

In dr-provision, the machine Agent relies on these API endpoints to perform its work:

  • GET from /api/v3/machines/<machine-uuid> to get a fresh copy of the Machine during AGENT_INIT.
  • PATCH to /api/v3/machines/<machine-uuid> to update the machine Stage and BootEnv during the AGENT_CHANGE_STAGE.
  • POST to /api/v3/jobs to retrieve the next Job to run during AGENT_RUN_TASK.
  • PATCH to /api/v3/jobs/<job-uuid> to update Job status during AGENT_RUN_TASK and during AGENT_INIT.
  • PUT to /api/v3/jobs/<job-uuid>/log to update the job log during AGENT_RUN_TASK.
  • UPGRADE to /api/v3/ws to create the EventStream websocket that receives Events for the Machine from dr-provision. Each Event contains a copy of the Machine state at the point in time that the event was created.

Retrieving the next Job

Out of all those endpoints, the one that does the most work is the POST /api/v3/jobs endpoint, which is responsible for figuring out what (if any) is the next Job that should be provided to the Machine Agent. It encapsulates the following logic:

  1. dr-provision receives an incoming POST on /api/v3/jobs that contains a Job with the Machine and Context fields filled out.

    If the Machine does not exist, the endpoint returns an Unprocessable Entity HTTP status code.

    If the Machine is not Runnable and Available, the endpoint returns a Conflict status code.

    If the Machine has no more runnable Tasks (as indicated by CurrentTask being greater than or equal to the length of the Machine Tasks list), or the current Context on the Machine is not equal to the Context of the new Job, the endpoint returns a No Content status code, indicating to the Machine Agent that there are no more tasks to run.

  2. dr-provision retrieves the CurrentJob for the Machine. If the Machine does not have a CurrentJob, we create a fake one in the Failed state and use that as CurrentJob for the rest of this process.

  3. dr-provision tentatively sets nextTask to CurrentTask + 1.
  4. If the CurrentTask is set to -1 or points to a stage: or bootenv: entry in the machine Task list, we mark the CurrentTask as failed if it is not already failed or created.
  5. If CurrentTask is set to -1, we update it to 0 and set nextTask to 0.
  6. If CurrentTask points to a stage:, context: or a bootenv: entry in the Tasks list, we roll forward on the Tasks list until we get to an entry that does not contain a stage:, context:, or bootenv: entry, gathering machine changes as we go. If the changes we gather result in any changes to the Machine object, we generate a new Job encapsulating all the changes we gathered, set it to the finished state, save the gathered machine changes and the job, and skip to the final step in this list.
  7. Depending on the State of the CurrentJob, we take one of the following actions:
    • incomplete: This indicates that CurrentJob did not fail, but it also did not finish. dr-provision returns CurrentJob unchanged, along with the Accepted status code.
    • finished: This indicates that the CurrentJob finished without error, and dr-provision should create a new Job for the next Task in the Tasks list. dr-provision sets CurrentTask to nextTask.
    • failed: This indicates that the CurrentJob failed. Since updating a Job to the failed state automatically makes the Machine not Runnable, something else has intervened to make the machine Runnable again. dr-provision will create a new Job for the current Task in the Tasks list.
  8. dr-provision creates a new Job for the Task in the Tasks list pointed to by CurrentTask in the created state. The Machine CurrentJob is updated with the UUID of the new Job. The new Job and the Machine are saved.
  9. If the Entry in the Tasks list pointed to by CurrentTask starts with action:, then the rest of the entry is interpreted as either a plugin:action_name or as a action_name. dr-provision will try to invoke the requested action_name on the machine (optionally using the specified plugin). If the plugin invocation succeeds, the results of the invocation are saved in the log, the Job is set to finished, and returned along with the Created HTTP status code. If the plugin invocation fails for any reason, the reason it failed is saved in the log along with any diagnostic output from the plugin, the job is set to failed, and nothing is returned along with the NoContent status code.
  10. If the new Job is in the created state, it is returned along with Created HTTP status code, otherwise nothing is returned along with the NoContent status code.

Changing the Workflow on a Machine

Changing a Workflow on the Machine has the following effects:

  • The Stages in the Workflow are expanded to create a new Tasks list. Each Stage gets expanded into a List as follows:

    • stage:<stageName>
    • bootenv:<bootEnvName> if the Stage specifies a non-empty BootEnv.
    • The Tasks list in the Stage

    The Tasks list on the Machine are replaced with the results of the above expansion.

  • The CurrentTask index is set directly to -1.

  • The Stage and BootEnv fields become read-only from the API. Instead, they will change in accordance with any stage: and bootenv: elements in the Task list resulting from expanding the Stages in the Workflow. Any Stage changes that happen during processing a Workflow do not affect the Tasks list or the CurrentTask index.
  • The Context field on the Machine is set to the value of the BaseContext Meta field on the Machine, or the empty string if that Meta field does not exist on the Machine.

Removing a Workflow from a Machine

To remove a workflow from a Machine, set the Workflow field to the empty string. The Stage field on the Machine is set to none, the Tasks list is emptied, the CurrentTask index is set back to -1, and the Context field on the Machine is set to the value of the BaseContext Meta field on the Machine, or the empty string if that Meta field does not exist on the Machine.

Changing the Stage on a Machine

Changing a Stage on a Machine has the following effects when done via the API and the Machine does not have a Workflow:

  • The Tasks list on the Machine is replaced by the Tasks list on the Stage.
  • If the BootEnv field on the Stage is not empty, it replaces the BootEnv on the Machine.
  • The CurrentTask index is set to -1
  • If the Machine has a different BootEnv now, it is marked as not Runnable.
  • The Context field on the Machine is set to the value of the BaseContext Meta field on the Machine, or the empty string if that Meta field does not exist on the Machine.

Resetting the CurrentTask index to -1

If the Machine does not have a Workflow, the CurrentTask index is simply set to -1. Otherwise. it is set to the most recent entry that would not occur in a different BootEnv from the machine's current BootEnv. In both cases, the Context field on the Machine is reset appropriately for the Task position.