Back to Blog

Creating a bulletproof workflow

Written by Marko Marinkovic in

Platform

on February 17th, 2017

The importance of a bulletproof workflow

We’ve interviewed some of Seven Bridges’ experienced bioinformaticians to collect their tips and best practices for creating a bulletproof workflow on the Platform. Many bioinformaticians will have experienced the frustration of inexplicably failed analyses, incorrect outputs, or inconsistent results. When dealing with complex tools and gigabytes or even terabytes of data, it is particularly important to minimize the potential for human error in your analyses.

The Seven Bridges Platform brings the power of cloud infrastructure to your bioinformatics analyses. You can add your own tools to the Platform and build them into workflows in combination with Seven Bridges’ publicly available tools. A workflow is a chain that is only as strong as its weakest link. To create a stable and consistent workflow, care needs to be taken when wrapping user-uploaded tools and when piping tools together into a single functional unit.

Know thy tool(s)

Each workflow consists of a series of tools that have their own properties and behaviors. Before wrapping a tool for use on the Seven Bridges Platform, you should know how the tool works, what it expects as its inputs, what its available parameters are, whether the tool’s behavior and performance depends on the types and sizes of inputs, and any other details that might affect the way the tool functions with other tools on the Platform. We recommend that you test the tool locally first, and only wrap it for use on the Platform once you have confirmed that it meets your requirements and are familiar with its inputs and outputs.

It’s also important at this stage to assess the hardware requirements of the tool (CPU, memory and disk space), at least for the input file size, type, and parameters of your intended use case. This type of information will be important for selecting an adequate instance type on the Platform. To read more on choosing instance types, see the post Making efficient use of compute resources.

Wrap and test tools separately

In the context of the Seven Bridges Platform, “wrapping” a tool is the process of specifying details of its input and output ports, its command options, and the CPU and memory resources it requires, so the tool can be used on the Platform. This specification is made using the Tool Editor. It generates a Common Workflow Language (CWL) description of the tool, that the Platform can read in order to run the tool. Before including a tool in a workflow, make sure it is wrapped properly by running it on its own on the Platform: to do this, you’ll have to supply all the tool’s input files, including any that would otherwise be generated by an upstream tool in a workflow. When running the tool on its own, we recommend testing it using different (sets of) input files of different sizes and specificities. Then, when you are sure that the tool works the way it is intended to, you can start connecting it to other tools.

Here are some more things that you might want to pay attention to when wrapping a tool:

If a tool requires a secondary file, such as an index file, to be supplied along with the “primary” file on an input port, make sure that the input port is configured to find and read the secondary file.
Make sure to specify correct CPU and memory requirements for your tool. The Platform’s scheduling algorithm will use the values supplied for these fields in the tool editor to select a suitable computation instance to run the tool on. If the values you enter are lower than the tool requires given your specified inputs, the tool will fail. On the other hand, if the values you enter are too high you may end up paying for compute resources you are not using.
Find edge cases that cause the tool to fail (if any). To do this, use input files that differ in size and other specific characteristics, as well as different values of input parameters.
If there is a parameter that will cause the tool to misbehave unless it is given a specific value, define this default value and “lock” it, so it cannot be changed on execution. However, make sure to provide an explanation of why you have locked the parameter in the tool description. Locking is also good practice for “technical” parameters that do not directly impact the results provided by the tool.

Assemble the workflow section by section

A step-by-step approach makes debugging a lot easier. It is always a good idea to create smaller “pieces” of a workflow that include just a small number of tools. This way you will be able to see how the tools operate together and whether there are any errors or unexpected behaviors that need to be addressed. Again, to test the sub-workflow, provide test input files and see if this part of the workflow provides correct and expected outputs. If so, it is ready to be connected to the other pieces of the workflow.

Connect all the pieces into a single workflow

When putting all the components together, pay attention to the following features:

Stage input. This option on the Tool Editor allows you to make a tool’s output files available in the working directory of the next downstream tool in the workflow. Learn more about stage input and its common use cases.
Metadata inheritance. Several tools require their input files to be annotated with appropriate metadata values. It is therefore important to set up your tools to so that metadata can be inherited from input files to output files. Learn more on configuring tools’ handling of metadata in the documentation on tool output ports.
Secondary files. These files are usually index files that allow faster random access to a file containing genomic data. Some tools require index files to be present along with the data file; to ensure that index files are always provided, we recommend including a suitable indexing tool upstream from it. This might slightly extend the execution time of a workflow, but will prevent it from failing if the required secondary file is not present in the project.

After thorough testing and incremental assembly, your workflow should be bulletproof. Good luck!

Don’t forget that Seven Bridges’ interdisciplinary support team are on hand at all times to help you debug your analyses if you need them. Don’t hesitate to click the Get support button on the Platform if you’d like a helping hand or second opinion.

What next?

To read more advanced techniques for Platform use, see our series Tool wrapping tips and tricks.

Marko Marinkovic