Writing pipelines

In this chapter, you will learn how to write pipelines for Compi. Pipelines are composed of (i) a set of tasks which ones can depend on others and (ii) a set of parameters that the user should provide to run the pipeline.

Once your pipeline is defined, Compi provides you with a powerful multithreaded running environment and a command-line user interface to run it.

The XML file for pipelines

Pipelines are defined via an XML file, where tasks and parameters are declared. Task code are provided in bash by default (you can use other languages). Also, the pipeline version must be specified in the version tag.

Task dependencies can be defined. For example, a task generate-report (which creates an HTML summary of an analysis) will depend on analyze-data (which runs an R script over data) because the file generated by the analysis task is formatted by the reporting task and thus analyze-data should be run before generate-report.

A pipeline example

Here is pipeline example showing the main features of Compi.

<?xml version="1.0" encoding="UTF-8"?>
<pipeline xmlns="http://www.sing-group.org/compi/pipeline-1.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <version>1.0</version>
    <params>
        <param name="name" shortName="n">Your name</param>
        <param name="output" shortName="o">Output file</param>
    </params>

    <tasks>
        <task id="greetings" params="name output">
            echo "Hi ${name}" > ${output}
        </task>
        <task id="bye" after="greetings" params="name output"
          interpreter="/usr/bin/perl -e &quot;${task_code}&quot;">
            my $filename = $ENV{'output'};
            open(my $fh, '>>', $filename) or die "Could not open file '$filename' $!";
            print $fh "bye ".$ENV{'name'}."\n"
        </task>
    </tasks>

    <!-- optional part -->
    <metadata>
      <task-description id="greetings">A task to greet you!</task-description>
      <task-description id="bye">A task for saying goodbye!</task-description>
    </metadata>
</pipeline>

This small example defines a pipeline with two tasks (greetings and bye) with two parameters (name and output). The first task simply writes a greeting with the name provided in the name parameter and saves it to an output file given in the output parameter. The second task is a perl task which writes a bye message the the same output file.

Defining pipeline parameters

Pipeline parameters are options that the user can define when running a pipeline. Tasks will use these parameters, which will be passed to them as environmental variables. For example, consider the following pipeline:

<?xml version="1.0" encoding="UTF-8"?>
<pipeline xmlns="http://www.sing-group.org/compi/pipeline-1.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <version>1.0</version>
    <params>
        <param name="yourName" shortName="">Your name</param>
    </params>
    <tasks>
        <task id="greetings" params="yourName">
            echo "Hi ${yourName}"
        </task>
    </tasks>
</pipeline>

In this example, the pipeline defines a parameter yourName and the task () greetings will make use of it.

The <param> and <flag> elements

The <param> and <flag> elements define a parameter of the pipeline. A <param> can have a default value and can be global, that is, every task will have access to it without specifying the parameter in the params attribute of the <task> element (see Defining tasks). <flag> are a special type of parameters that do not require to specify a value (they are present or not as environmental variable, but with no concrete value). <flag> can not have default values.

Attribute Description Mandatory
name A name for the parameter. To be compatible with environment variable names, this name can only contain letters, digits, underscores, and can not start with a digit. Parameter names must be unique in the file. YES
shortName An alternative, and normally shorter, name for the parameter. Parameter short names must be unique in the file. NO
defaultValue A default value for the parameter. NO
global A boolean value (“true” or “false”) indicating that this parameter is global. Global parameters are always passed to all tasks, without the need of specifying them in the params attribute of every task NO

Here it is an example of a parameters section

<?xml version="1.0" encoding="UTF-8"?>
<pipeline xmlns="http://www.sing-group.org/compi/pipeline-1.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <version>1.0</version>
    <params>
        <param name="yourName" shortName="n" global="true" defaultValue="anonymous">Your name</param>
        <flag name="sayGoodBye" shortName="g">Do you want to say goodbye?</flag>
    </params>
    <tasks>
        <task id="greetings">
            echo "Hi ${yourName}"
        </task>
        <task id="goodbye"
              params="yourName sayGoodBye" if="[ -v sayGoodBye ]"
              after="greetings">
            echo "Goodbye ${yourName}"
        </task>
    </tasks>
</pipeline>

Defining tasks

Simple tasks: the <task> element

Tasks are defined inside the <tasks> element. A <task> element contains a piece of runnable code (by default in Bash language). Alternatively, the piece of code can be loaded from the file specified in the src attribute, whose location is relative to the pipeline XML file. When the task runs, parameters are passed as environmental variables.

In addition, <task> elements contain the following attributes:

Attribute Description Mandatory
id The ID for the task. This must be a valid NCName. YES
after List of tasks that should end before this task can be started. The list can be separated by whitespaces or commas. NO
params List of parameters that this task will use. The parameters can not be identified by their shortName. Only global parameters and those indicated here are passed to the task. Values should be separated by whitespaces. NO
interpreter A command to be run instead of the task code, that can be exploited to interpretate the task code. See Custom interpreters. NO
if A command to be run just before the task is about run. If the command’s return status is different from 0, the task will be skipped. NO
src The location of the file (relative to the pipeline XML file) that contains the task code. NO

Parallel iterative tasks: the <foreach> element

A special type of tasks are foreach tasks. When a foreach task is run, its code is launched several times in parallel over a collection of elements.

There are several types of collection to iterate over (a list of values, a range of numbers, a set of files from a directory, the output lines of a bash command, etc.)

Attribute Description Mandatory
of

The type of collection to iterate over. There are the following possible values:

  • list: a comma-separated list of values
  • range: a number interval specified as <low>:<high>. E.g.: “1:10”
  • file: all files under a given directory (recursively)
  • param: the name of a parameter whose value is a comma-separated list of values
  • command: a command whose output lines are the values to iterate over
YES
in The source to take the collection elements to iterate over. Here you can use pipeline parameters with ${parameter} as they will be replaced with they actual value. YES
as Name of the loop parameter to use in the task code. YES

param example

In case you want a param foreach that iterates over all items of a given collection and the collection is a pipeline parameter, then the in attribute must be specify this as follows: in="${parameter}".

Try the foreach-items.xml pipeline with:

compi run -o -p foreach-items.xml -- --items_list "A, B, C"

A list foreach works the same way but the items list is a fixed value

file example

In case you want a file foreach that iterates over all files under a given directory and the source directory is a pipeline parameter, then the in attribute must be specify this as follows: in="${parameter}".

Try the foreach-file-in-data-dir.xml pipeline with:

compi run -o -p foreach-file-in-data-dir.xml -- --data_dir "/path/to/dir"

command example

A command foreach has in the in a command whose output lines are the values to iterate over (each line is an element).

Use this ZIP file to run the pipeline provided with:

compi run -o -p foreach-command.xml -- --file_with_items foreach-command-input.txt

Iteration dependencies between foreach tasks

You can define a “iteration dependency” between two foreach tasks, so that the first iteration of the dependant foreach waits only for the first iteration of the foreach which is depending on. For example:

<!-- samples is a parameter with values such as
"case-1,case-2,control-1,control-2" -->
<foreach id="preprocess" of="param" in="samples" as="sample">
  preprocess.sh ${sample}.csv > ${sample}.preprocessed.csv
</foreach>
<foreach id="analyze" of="param" in="samples" as="sample" after="*preprocess">
  analyze.sh ${sample}.preprocessed.csv
</foreach>

Please note the * character in after="*preprocess", which indicates that the iterations of the second foreach will wait only for their respective iteration of the first foreach.

Note

It is mandatory that all foreach tasks have the same number of iterations if you want to establish an “iteration dependency” between them.

Defining tasks metadata

In order to describe the task objectives, making Compi able to generate user documentation, you can optionally define tasks metadata.

Tasks metadata is defined inside the <metadata> element. A <task-description> element contains a brief description of the task objectives. The id attribute indicates the task for which the description is being provided.

<?xml version="1.0" encoding="UTF-8"?>
<pipeline xmlns="http://www.sing-group.org/compi/pipeline-1.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <!-- ... -->

    <!-- optional part -->
    <metadata>
      <task-description id="greetings">A task to greet you!</task-description>
      <task-description id="bye">A task for saying goodbye!</task-description>
    </metadata>

</pipeline>

Validating a pipeline

Run the following command to validate the pipeline.xml file:

compi validate -p pipeline.xml

Viewing the pipeline as a graph

Run the following command to export the graph defined by the pipeline.xml pipeline as an image.

compi export-graph -p pipeline.xml -o pipeline.png -f png
_images/pipeline.png

If you want to draw also the task parameters, try options --draw-task-params or --draw-pipeline-params.