Etsy Icon>

Code as Craft

Introducing Arbiter: A Utility for Generating Oozie Workflows main image

Introducing Arbiter: A Utility for Generating Oozie Workflows

  image

At Etsy we have been using Apache Oozie for managing our production workflows on Hadoop for several years. We’ve even recently started using Oozie for managing our ad hoc Hadoop jobs as well. Oozie has worked very well for us and we currently have several dozen distinct workflows running in production. However, writing these workflows by hand has been a pain point for us. To address this, we have created Arbiter, a utility for generating Oozie workflows.

The Problem

Oozie workflows are specified in XML. The Oozie documentation has an extensive overview of writing workflows, but there are a few things that are helpful to know. A workflow begins with the start node:

Each job or other task in the workflow is an action node within a workflow. There are some built-in actions for running MapReduce jobs, standard Java main classes, etc. and you can also define custom action types. This is an example action:

Each action defines a transition to take upon success and a (possibly) different transition to take upon failure:

To have actions run in parallel, a fork node can be used. All the actions specified in the fork will be run in parallel:

After these actions there must be a join node to wait for all the forked actions to finish: Finally, a workflow ends by transitioning to either the end or kill nodes, for a successful or unsuccessful result, respectively:

Here is a complete example of one of our shorter workflows

Having the workflows be defined in XML has been very helpful. We have several validation and visualization tools in multiple languages that can parse the XML and produce useful results without being tightly coupled to Oozie itself. However, the XML is not as useful for the people that work with it. First, it is very verbose. Each new action adds about 20 lines of XML to the workflow, much of which is boilerplate. As a result, our workflows average around 200 lines and the largest is almost 1800 lines long. This also makes it hard for someone to read the workflow and understand what the workflow does and the flow of execution.

Next, defining the flow of execution can be tricky. It is natural to think about the dependencies between actions. Oozie workflows, however, are not specified in terms of these dependencies. The workflow author must satisfy these dependencies by configuring the workflow to run the actions in the proper order. For simple workflows this may not be a problem, but can quickly become complex. Moreover, the author must manually manage parallelism by inserting forks and joins. This makes modifying the workflow more complex. We have found that it’s easy to miss adding an action to a fork, resulting in an orphaned action that doesn’t get run. Another common problem we’ve had with forks is that a single-action fork is considered invalid by Oozie, which means removing the second-last action from a fork requires removing the fork and join entirely.

Introducing Arbiter

Arbiter was created to solve these problems. XML is very amenable to being produced automatically, so there is the opportunity to write the workflows in another format and produce the final workflow definition. We considered several options, but ultimately settled on YAML. There are robust YAML parsers in many languages and we considered it easier for people to read than JSON. We also considered a Scala-based DSL, but we wanted to stick with a markup language for language-agnostic parsing.

Writing Workflows

Here is the same example workflow from above written in Arbiter’s YAML format:

The translation of the YAML to XML is highly dependent on the configuration given to Arbiter, which we will cover in the next section. However, there are several points to consider now. First, the YAML definition is only about 20% of the length of the XML. Since the workflow definition is much shorter, it’s easier for someone to read it and understand what the workflow does. In addition, none of the flow control nodes need to be manually specified. Arbiter will insert the start, end, and kill nodes in the correct locations. Forks and joins will also be inserted when actions can be run in parallel.

Most importantly, however, the workflow author can directly specify the dependencies between actions, instead of the order of execution. Arbiter will handle ordering the actions in such a way to satisfy all the given dependencies.

In addition to the standard workflow actions, Arbiter allows you to define an “error handler” action. It will automatically insert this action before any transitions to the end or kill nodes in the workflow. We use this to send an email alert with details about the success and failure of the workflow actions. If these are omitted the workflow will transition directly to the end or kill nodes as appropriate.

Configuration

The mapping between a YAML workflow definition and the final XML is controlled by configuration files. These are also specified in YAML. Here is an example configuration file to accompany the example workflow given above:

The key part of the configuration file is the actionTypes setting. Each action type will map to a certain action type in the XML workflow. However, multiple Arbiter action types can map to the same Oozie action type, such as the screamapillar and rollup action types both mapping to the Oozie java action type. This allows you to have meaningful action types in the YAML workflow definitions without the overhead of actually creating custom Oozie action types. Let’s review the parts of an action type definition:

The tag key defines the action type tag in the workflow XML. This can be one of the built-in action types like java, or a custom Oozie action type. Arbiter does not need to be made aware of custom Oozie action types. The name key defines the name of this action type, which will be used to set the type of actions in the workflow definition. If the Oozie action type accepts configuration properties from the workflow XML, these are controlled by the configurationPosition and properties keys. properties defines the actual configuration properties that will be applied to every action of this type, and configurationPosition defines where in the generated XML for the action the configuration tag should be placed. The defaultArgs key defines the default elements of the generated XML for actions of this type. The keys are the names of the XML tags, and the values are lists of the values for that tag. Even tags that can appear only once must be specified as a list.

You can also define properties to be populated from values set in the workflow definition. Any string surrounded by $$ will be interpolated in this way. $$rollup_file$$ and $$category$$ are examples of doing so in this configuration file. These will be populated with the values of the rollup_file and category keys from a rollup action in the workflow definition.

Using this configuration file, we could write an action like the following in the YAML workflow definition: Arbiter would then translate this action to the following XML: Arbiter also allows you to specify the name of the kill node and the message it logs with the killName and killMessage properties.

How Arbiter Generates Workflows

Arbiter builds a directed graph of all the actions from the workflow definition it is processing. The vertices of the graph are the actions and the edges are dependencies. The direction of the edge is from the dependency to the dependent action to represent the desired flow of execution. Oozie workflows are required to be acyclic, so if a cycle is detected Arbiter will throw an exception.

The directed graph that Arbiter builds will be made up of one or more weakly connected components. This is the graph from the example workflow above, which has two such components:

An example of a graph that is input to Arbiter

Each of these components is processed independently. First, any vertices with no incoming edges are removed from the graph and inserted into a new result graph. If there is more than one vertex removed Arbiter will also insert a fork/join pair to run them in parallel. Having removed those vertices, the original component will now have been split into one or more new weakly connected components. Each of these components is then recursively processed in this same way.

Once every component has been processed, Arbiter then combines these independent components until it has produced a complete graph. Since these components were initially not connected, they can be run in parallel. If there is more than component, Arbiter will insert a fork/join pair. This results in the following graph for the example workflow, showing the ok transitions between nodes:

An example workflow graph produced by Arbiter

This algorithm biases Arbiter towards satisfying the dependencies between actions over achieving optimal parallelism. In general this algorithm still produces good parallelism, but in certain cases (such as a workflow with one action that depends on every other action), it can degenerate to a fairly linear flow. While it is a conservative choice, this algorithm has still worked out well for most of our workflows and has the advantage of being straightforward to follow in case the generated workflow is incorrect or unusual.

Once this process has finished all the flow control nodes will be present in the workflow graph. Arbiter can then translate this into the XML using the provided configuration files.

Get Arbiter

Arbiter is now available on Github! We’ve been using Arbiter internally already and it’s been very useful for us. If you’re using Oozie we hope Arbiter will be similarly useful for you and welcome any feedback or contributions you have!