Memoization for large scale genomic analysis allows researchers and bioinformaticians to restart from a point of failure by enabling the reuse of existing outputs. This functionality is of critical importance given the size and complexity of genomic data and the impact of a failure on workflow efficiency and overall cost.
The Seven Bridges Platform enables memoization by caching results at the most basic level: that of individual tools. Each time a tool is run, the inputs are noted in a ledger along with the resultant outputs. If the tool is subsequently called with the same inputs, instead of re-running the tool, the memoized results are simply read from the ledger and served back. Thus, the nature and complexity of workflows play no part in how tools access the cache: as soon as the tool has finished running in one workflow, its cache becomes available to all other tools.
Let’s take a look at how memoization serves as a safeguard to your research.
Reuse computation across multiple, related, workflows
Memoization enables researchers to avoid duplicate computations across related workflows. For example, a sequence alignment step is typically expensive, can take several hours to complete, and is often a common component of many workflows. If, one workflow is run on a particular pair of files from a sequencer and a second, different workflow, containing the same alignment step, is subsequently run with the same sequencer files, the Seven Bridges Platform will detect this, and fetch the memoized results from the tool cache, reducing the running time from hours to seconds.
Quick recovery from mistaken parameters
Sometimes users make mistakes when setting off tasks. Perhaps someone, in a bit of a hurry to beat the Friday traffic, puts in a wrong parameter and sets off a large batch run. On Saturday morning they take a peek at what the tasks have been doing and discover, a thousand dollars later, a mistaken parameter value.
On the Seven Bridges Platform, the user can simply abort the entire batch task and then use the convenient “edit and run task” button, clone the task, set the correct parameter and restart within a matter of seconds. With memoization turned on, all the finished steps will be skipped and the workflow will carry on from where it was interrupted.
Explore a large parameter set
Many research projects involve a large set of parameters. It is costly, time-consuming but often vital, to explore many different parameter combinations. Memoization enables more efficient exploration by allowing researchers to run a workflow with one set of parameters and then rerun it repeatedly with changed values. Memoized results will reduce the time and cost of reruns, ultimately increasing the ability for parameter exploration.
Efficient development and debugging
Memoization is very useful in the workflow development phase by enabling re-entrancy in a workflow, allowing execution to recover from a failed or stopped task and carry on from the point of last success, after any applied changes.
Using the Seven Bridges graphical tool and workflow editor, “update app” features, and the “edit and rerun” feature on the task launch page, a developer can efficiently alter individual parts of a workflow and repeatedly rerun a test task quickly and cost-effectively until all workflow components have been debugged to satisfaction.
Because the cached results are available as soon as a tool has finished running, even a running task can be cloned with a new set of changes. Often a developer will want to keep the original task running to completion to make sure all the downstream sub-graphs affected by the changes still work, even as fixes to other issues are implemented.
Test workflows during refactoring
When changes to a workflow should not change the results, a memoized re-run can act as a quick sanity check. If the changed components are expected to produce the same result as their older versions, it follows that only those changed components should end up being re-computed. Components downstream of them should receive the same inputs which allow them to be skipped. An execution where a downstream sub-graph is also recomputed indicates changes to the outputs. This allows efficient “unit tests” of workflows that incur no additional compute or storage cost.