Slicer3:Large scale experiment control brainstorming

From Slicer Wiki
Jump to: navigation, search
Home < Slicer3:Large scale experiment control brainstorming

Goal

To provide Slicer3 with a mechanism for submitting, monitoring, and summarizing large scale experiments that utilize Slicer3 modules, particularly the Command Line Modules. This page summarizes our thoughts, requirements, and experiments to date, mostly accomplished during the March 2007 Slicer3 MiniRetreat.

There are two introductory use cases that we wish to support:

  1. Slicer3 is used interactively to select a set of parameters for an algorithm or workflow on a single dataset. Then, these parameters are applied to N datasets non-interactively.
  2. Slicer3 is used interactively to select a subset of parameters for an algorithm or workflow on a single dataset. Then, the remaining parameter space is searched non-interactively. Various parameter space evaluation techniques could be employed, the simplest of which is to sample the space of (param1, param2, param3).

Note, that with the above two use cases, we are only trying to address large scale experiment control from the standpoint of what it means to Slicer3. We are not trying to solve the general case of large scale experiment control.

Assumptions and restrictions

  1. Computing configuration.
    We shall support a variety of computing infrastructures which include
    1. single computer systems,
    2. clusters,
    3. grids (optional)
  2. Access to compute nodes.
    We shall have no direct access to the compute nodes. All job submissions shall be to some sort of submit node. Exception may be when operating on a single computer system configuration.
  3. Staged data
    The compute nodes shall mount a filesystem outside of the node on which data is staged. We are not providing Slicer3 with the mechanisms to stage data. We assume that all data is staged outside of Slicer3.
  4. Staged programs
    The compute nodes shall have access to the Slicer3 processing modules. Like the case for data, the processing modules are staged outside of the Slicer3 environment.
  5. Experiment scheduling
    A given experiment shall result in one or more processing jobs being submitted to the computing resources.
  6. Job submission
    Submitting a job to the computing infrastructure shall result in a job submission token such that that job can be
    1. monitored for status: scheduled, running, completed
    2. terminated
  7. Experiment control
    We shall be able to monitor an experiment to see its status.
    We shall be able to interrupt an experiment. This may involve removing jobs from the queue and terminating jobs in process.
    We shall be able to resume an experiment without re-running the entire experiment. Previously terminated jobs will be resubmitted. Previously completed jobs will not be rerun.
    We shall be able to rerun an experiment, overwriting previous results.
  8. Job execution robustness
    Jobs terminating unsuccessfully shall be automatically resubmitted to the computing environment upon the experiment designers request. Jobs may be resubmitted zero times, K times, or until successful.
  9. Platform
    The cluster head node is expected to be a linux machine with standard packages installed.

Components

To address the application scenario, we considered the following:

Completion of a Single Execution Step

Given that any job submission may fail (for example, if the cluster or a node goes down) we need to be able to distinguish the following cases when starting a job:

1. job has not been started
2. job has been started (i.e. we are reattaching to a running job)
2a job still running
2b job not running
3 job has completed successfully

We want to provide the following capabilities:

  • the job can started (i.e. we have a back-end for different job control systems)
  • we can check the status of jobs (we save the job token)
  • we can restart a job that has died
  • we don't start a job if it is alreay running
  • we don't re-run a job that already has completed successfully
  • we can clean up the state so the whole job can be re-run

Overall Experiment Control

We would like tools to provide the following controls:

  • Be able to start and experiment
  • Be able to check the status of an experiment that is running
  • Be able to confirm that the experiment has completed
  • Be able to restart an experiment in the middle if the cluster crashed or jobs failed.

Thought experiments

Below are a few thought experiments to address the above. These will be used to see how the above needs can be addressed.

Makefiles + the Looping Launcher

We considered the running of the experiment as a dependency graph, and realized that the 'make' utitlity is a powerful system for expressing and resolving dependencies, and that it has parallel exeucution features with the -j flag. We'd like to be able to say 'make Experiment' and have the results generated. Make is good at re-running only what is needed.

We looked at three issues:

  1. Makefiles are not friendly to write (particularly with large numbers of dependencies), so some helper utilities would be needed. we anticipate 100s or 1000s of file dependencies in an experiment, but this is doable by make.
  2. We needed a 'looping launcher' (a.k.a. loopy) that would submit a job and monitor it for successful completion. If the job fails, the launcher resubmits the job for a pre-set number of times.
  3. Make works on the dependencies at the file level, whereas some steps in an experiment may write several files or write to a database. So a wrapper utility is needed to monitor the job status and create a file that make can use to determine dependencies. The loopy command includes this functionality.

Example Makefile Using the Looping Launcher

The following example illustrates the concept on a simple example of 2 input data sets, each of which needs to be segmented before a summary statistics program can be run. We simulated the cluster submit with a program called 'randomFail' which accepts a 'probability of failure' argument so we can test the approach's ability to recover from individual job failures. This makefile would be run on the head node of a cluster.

In this case, the program <loopy> accepts two important parameters

--retries tells loopy how many times to try running the job before considering it a failure
-d is the "done file" which tells loopy that it's target command has completed successfully

Internally the loopy code (in slicer3 snv here) starts the job and monitors it for completion. Loopy saves the job id associated with a given job in a file name of the "done file" with the addition of a .started suffix. This allows subsequent invokations of loopy to determine what state the job is in. The .started file is deleted when the job completes successfully.

LOOPY=../../loopy --retries 10
SEGMENTER=./randomFail --p-fail 0.0 --delay 2000
STATISTICS=./randomFail --p-fail 0.0 --delay 2000


all: data/summary-statistics.out
        @echo job complete

data/summary-statistics.out: data/ex1.out data/ex2.out
        @echo making summary
        $(LOOPY) -d data/summary-statistics.out "$(STATISTICS) -o data/summary-statistics.out"

data/ex1.out:
        @echo making ex1
        $(LOOPY) -d data/ex1.out "$(SEGMENTER) -o data/ex1.out"
data/ex2.out:
        @echo making ex2
        $(LOOPY) -d data/ex2.out "$(SEGMENTER) -o data/ex2.out"

clean:
        rm -f data/ex1.out data/ex2.out data/summary-statistics.out

Utility to help Generate Experiment Makefiles

A set of experiment control helper code (in slicer3 snv here) was prototyped. This code, called ETC for now, allows you to make simple high level looping constructs to build up the dependencies needed for a complex experiment.

In the following example, 50 experimental subjects are generated where each one must complete successfully before the summary statistics can be executed. The ETC code generates a Makefile.ETC which uses the loopy code internally to implement a robust and restartable experiment.


source ../../ETC.tcl

::ETC::SetProject ExampleSegmentation

::ETC::SetWorkingDirectory .
for {set i 0} {$i < 50} {incr i} {
  lappend SubjectList ex$i
}

foreach Subject $SubjectList {
  ::ETC::Schedule -name EM$Subject "./randomFail -d 1000 -p 0.6 -o data/$Subject"
}

::ETC::Schedule -name Summary -depends EM* ""
::ETC::RootTask Summary

::ETC::Generate Makefile.ETC

Note: here is a project that implements some similar concepts in python: [1]

BatchMake

BatchMake allows for large scale experiments to be designed using a scripting language similar to CMake scripts. BatchMake provides a number of looping constructs which can be used to design experiments and parameter searches

  • foreach
  • sequence
  • randomize
  • fornfold

Here is a BatchMake script to search the parameter space of a median filter

SetApp(median @'Median Filter')
SetAppOption(median.inputVolume 'c:/projects/I2/Insight/Testing/Data/Input/cthead1.png')

Set(kernels '1,1,1' '2,2,1' '3,3,1' '4,4,1' '5,5,1')
Set(outVolumePrefix 'c:/projects/Temp/Slicer3/median')

foreach(kernel ${kernels})
  RegEx(kernelText ${kernel} ',' REPLACE '_')
  SetAppOption(median.outputVolume ${outVolumePrefix}${kernelText}.png)
  SetAppOption(median.neighborhood ${kernel})

  Run(output ${median})

endforeach(kernel)

We have extended the ModuleDescription library in Slicer3 to generate a BatchMake XML Application Wrapper from a ModuleDescription object. This allows Slicer3 Command Line Modules to be loaded into BatchMake and used as BatchMake application objects in BatchMake scripts. This code has yet to be integrated into Slicer3 permanently because there a number of design decisions outstanding. Here is the ModuleDescription XML file that Slicer uses

<?xml version="1.0" encoding="utf-8"?>
<executable>
  <category>
  Filtering.Denoising
  </category>
  <title>
  Median Filter
  </title>
  <description>
The MedianImageFilter is commonly used as a robust approach for
noise reduction. This filter is particularly efficient against
"salt-and-pepper" noise. In other words, it is robust to the presence
of gray-level outliers. MedianImageFilter computes the value of each output
pixel as the statistical median of the neighborhood of values around the
corresponding input pixel.
  </description>
  <version>0.1.0.$Revision: 2085 $(alpha)</version>
  <documentation-url></documentation-url>
  <license></license>
  <contributor>Bill Lorensen</contributor>
  <acknowledgements>This command module was derived from Insight/Examples/Filtering/MedianImageFilter (copyright) Insight Software Consortium</acknowledgements>
  <parameters>
    <label>Median Filter Parameters</label>
    <description>Parameters for the median filter</description>
    <integer-vector>
      <name>neighborhood</name>
      <longflag>--neighborhood</longflag>
      <description>The size of the neighborhood in each dimension</description>
      <label>Neighborhood Size</label>
      <default>1,1,1</default>
    </integer-vector>
  </parameters>
  <parameters>
    <label>IO</label>
    <description>Input/output parameters</description>
    <image>
      <name>inputVolume</name>
      <label>Input Volume</label>
      <channel>input</channel>
      <index>0</index>
      <description>Input volume to be filtered</description>
    </image>
    <image>
      <name>outputVolume</name>
      <label>Output Volume</label>
      <channel>output</channel>
      <index>1</index>
      <description>Output filtered</description>
    </image>
  </parameters>
</executable>

and here is the resulting BatchMake XML Application wrapper

<?xml version="1.0" encoding="utf-8"?>
<BatchMakeApplicationWrapper>
  <BatchMakeApplicationWrapperVersion>1.0</BatchMakeApplicationWrapper>
  <Module>
    <Name>Median Filter</Name>
    <Version>0.1.0.$Revision: 2085 $(alpha)</Version>
    <Path>c:/projects/Slicer3-clean-net2005/bin/RelWithDebInfo/../../lib/Slicer3/Plugins/RelWithDebInfo/MedianImageFilter.exe</Path>
    <Parameters>
      <Param>
        <Type>1</Type>
        <Name>neighborhood.flag</Name>
        <Value>--neighborhood</Value>
        <Parent>0</Parent>
        <External>0</External>
        <Optional>1</Optional>
      </Param>
      <Param>
        <Type>4</Type>
        <Name>neighborhood</Name>
        <Value>1,1,1</Value>
        <Parent>1</Parent>
        <External>0</External>
        <Optional>0</Optional>
      </Param>
      <Param>
        <Type>0</Type>
        <Name>inputVolume</Name>
        <Value></Value>
        <Parent>0</Parent>
        <External>1</External>
        <Optional>0</Optional>
      </Param>
      <Param>
        <Type>0</Type>
        <Name>outputVolume</Name>
        <Value></Value>
        <Parent>0</Parent>
        <External>2</External>
        <Optional>0</Optional>
      </Param>
    </Parameters>
  </Module>
</BatchMakeApplicationWrapper>

BatchMake and the Computing Infrastructure

  • What is needed to make BatchMake submit to a cluster?
  • To a grid?

BatchMake and Job Control

  1. Can BatchMake terminate a job?
    • Yes
  2. Can BatchMake resubmit a job until it completes successfully?
    • Even better - it works with Condor's DAG utility to produce a directed acyclic graph of operations on your data, based on your Batchmake script. That is, no additional work is needed on your part to ensure an, effective, deterministic processing of your data on a grid.

BatchMake and Experiment Control

Can BatchMake interrupt, continue, and rerun an experiment?

  • Much of this is inherent in Condor and has been extended by BatchMake's use of condor's DAG utility and BatchMake's condorWatcher. Furthermore, if an executable is compiled with condor's library (no code changes needed) a job can even be moved from one node on a grid to another without having to re-compute - condor's library provides a core-dump/load-like facility for stopping and starting processes without loosing intermediate results.