Writing pipelines
*****************
In this chapter, you will learn how to write pipelines for Compi. Pipelines
are composed of (*i*) a set of tasks which ones can depend on others and (*ii*)
a set of parameters that the user should provide to run the pipeline.
Once your pipeline is defined, Compi provides you with a powerful multithreaded
running environment and a command-line user interface to run it.
The XML file for pipelines
==========================
Pipelines are defined via an XML file, where tasks and parameters are declared.
Task code are provided in bash by default (you can use other languages). Also,
the pipeline version must be specified in the ``version`` tag.
Task dependencies can be defined. For example, a task ``generate-report``
(which creates an HTML summary of an analysis) will depend on ``analyze-data``
(which runs an R script over data) because the file generated by the analysis
task is formatted by the reporting task and thus ``analyze-data`` should be run
before ``generate-report``.
A pipeline example
==================
Here is pipeline example showing the main features of Compi.
.. code-block:: xml
1.0
Your name
Output file
echo "Hi ${name}" > ${output}
my $filename = $ENV{'output'};
open(my $fh, '>>', $filename) or die "Could not open file '$filename' $!";
print $fh "bye ".$ENV{'name'}."\n"
A task to greet you!
A task for saying goodbye!
This small example defines a pipeline with two tasks (``greetings`` and
``bye``) with two parameters (``name`` and ``output``). The first task simply
writes a greeting with the name provided in the ``name`` parameter and saves it
to an output file given in the ``output`` parameter. The second task is a perl
task which writes a bye message the the same output file.
Defining pipeline parameters
============================
Pipeline parameters are options that the user can define when running a pipeline.
Tasks will use these parameters, which will be passed to them as
**environmental variables**. For example, consider the following pipeline:
.. code-block:: xml
1.0
Your name
echo "Hi ${yourName}"
In this example, the pipeline defines a parameter ``yourName`` and the task ()
``greetings`` will make use of it.
The ```` and ```` elements
---------------------------------------
The ```` and ```` elements define a parameter of the pipeline.
A ```` can have a default value and can be global, that is, every task
will have access to it without specifying the parameter in the ``params``
attribute of the ```` element (see `Defining tasks`_). ```` are
a special type of parameters that do not require to specify a value (they
are present or not as environmental variable, but with no concrete value).
```` can not have default values.
+--------------+--------------------------------------------------+-----------+
| Attribute | Description | Mandatory |
+==============+==================================================+===========+
| name | A name for the parameter. To be compatible with | YES |
| | environment variable names, this name can only | |
| | contain letters, digits, underscores, and can | |
| | not start with a digit. | |
| | Parameter names must be unique in the file. | |
+--------------+--------------------------------------------------+-----------+
| shortName | An alternative, and normally shorter, name for | NO |
| | the parameter. | |
| | Parameter short names must be unique in the | |
| | file. | |
+--------------+--------------------------------------------------+-----------+
| defaultValue | A default value for the parameter. | NO |
+--------------+--------------------------------------------------+-----------+
| global | A boolean value ("true" or "false") indicating | NO |
| | that this parameter is global. Global parameters | |
| | are always passed to all tasks, without the need | |
| | of specifying them in the ``params`` attribute | |
| | of every task | |
+--------------+--------------------------------------------------+-----------+
Here it is an example of a parameters section
.. code-block:: xml
1.0
Your name
Do you want to say goodbye?
echo "Hi ${yourName}"
echo "Goodbye ${yourName}"
Defining tasks
==============
Simple tasks: the ```` element
------------------------------------
Tasks are defined inside the ```` element. A ```` element contains
a piece of runnable code (by default in Bash language). Alternatively, the piece
of code can be loaded from the file specified in the ``src`` attribute, whose
location is relative to the pipeline XML file. When the task runs, parameters
**are passed as environmental variables**.
In addition, ```` elements contain the following attributes:
+--------------+-------------------------------------------------+-----------+
| Attribute | Description | Mandatory |
+==============+=================================================+===========+
| id | The ID for the task. This must be a valid | YES |
| | NCName_. | |
+--------------+-------------------------------------------------+-----------+
| after | List of tasks that should end before this task | NO |
| | can be started. The list can be separated by | |
| | whitespaces or commas. | |
+--------------+-------------------------------------------------+-----------+
| params | List of parameters that this task will use. The | NO |
| | parameters can not be identified by their | |
| | shortName. | |
| | Only global parameters and those indicated here | |
| | are passed to the task. | |
| | Values should be separated by whitespaces. | |
+--------------+-------------------------------------------------+-----------+
| interpreter | A command to be run instead of the task code, | NO |
| | that can be exploited to interpretate the task | |
| | code. See :ref:`custom_interpreters`. | |
+--------------+-------------------------------------------------+-----------+
| if | A command to be run just before the task is | NO |
| | about run. If the command's return status | |
| | is different from 0, the task will be skipped. | |
+--------------+-------------------------------------------------+-----------+
| src | The location of the file (relative to the | NO |
| | pipeline XML file) that contains the task code. | |
+--------------+-------------------------------------------------+-----------+
Parallel iterative tasks: the ```` element
---------------------------------------------------
A special type of tasks are *foreach* tasks. When a *foreach* task is run,
its code is launched several times in parallel over a collection of elements.
There are several types of collection to iterate over (a list of values, a range of
numbers, a set of files from a directory, the output lines of a bash command,
etc.)
+--------------+-------------------------------------------------+-----------+
| Attribute | Description | Mandatory |
+==============+=================================================+===========+
| of | The type of collection to iterate over. There | YES |
| | are the following possible values: | |
| | | |
| | * ``list``: a comma-separated list of values | |
| | * ``range``: a number interval specified | |
| | as :. E.g.: "1:10" | |
| | * ``file``: all files under a given directory | |
| | (recursively) | |
| | * ``param``: the name of a parameter whose value| |
| | is a comma-separated list of values | |
| | * ``command``: a command whose output lines | |
| | are the values to iterate over | |
+--------------+-------------------------------------------------+-----------+
| in | The source to take the collection elements to | YES |
| | iterate over. Here you can use pipeline | |
| | parameters with ``${parameter}`` as they will | |
| | be replaced with they actual value. | |
+--------------+-------------------------------------------------+-----------+
| as | Name of the loop parameter to use in the task | YES |
| | code. | |
+--------------+-------------------------------------------------+-----------+
``param`` example
^^^^^^^^^^^^^^^^^
In case you want a ``param`` foreach that iterates over all items of a given collection
and the collection is a pipeline parameter, then the ``in`` attribute
must be specify this as follows: ``in="${parameter}"``.
Try the `foreach-items.xml <_static/resources/foreach-items.xml>`_ pipeline with:
.. code-block:: shell
compi run -o -p foreach-items.xml -- --items_list "A, B, C"
A ``list`` foreach works the same way but the items list is a fixed value
``file`` example
^^^^^^^^^^^^^^^^
In case you want a ``file`` foreach that iterates over all files under a given
directory and the source directory is a pipeline parameter, then the ``in`` attribute
must be specify this as follows: ``in="${parameter}"``.
Try the `foreach-file-in-data-dir.xml <_static/resources/foreach-file-in-data-dir.xml>`_ pipeline with:
.. code-block:: shell
compi run -o -p foreach-file-in-data-dir.xml -- --data_dir "/path/to/dir"
``command`` example
^^^^^^^^^^^^^^^^^^^
A ``command`` foreach has in the ``in`` a command whose output lines are the values to
iterate over (each line is an element).
Use `this ZIP file <_static/resources/foreach-command-input.zip>`_ to run the pipeline provided with:
.. code-block:: shell
compi run -o -p foreach-command.xml -- --file_with_items foreach-command-input.txt
Iteration dependencies between `foreach` tasks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can define a "iteration dependency" between two ``foreach`` tasks, so that
the first iteration of the dependant ``foreach`` waits only for the first iteration
of the ``foreach`` which is depending on. For example:
.. code-block:: xml
preprocess.sh ${sample}.csv > ${sample}.preprocessed.csv
analyze.sh ${sample}.preprocessed.csv
Please note the ``*`` character in ``after="*preprocess"``, which indicates that the iterations
of the second ``foreach`` will wait only for their respective iteration of the
first ``foreach``.
.. note::
It is mandatory that all ``foreach`` tasks have the same number of iterations
if you want to establish an "iteration dependency" between them.
Defining tasks metadata
=======================
In order to describe the task objectives, making Compi able to generate user
documentation, you can optionally define tasks metadata.
Tasks metadata is defined inside the ```` element. A
```` element contains a brief description of the task
objectives. The ``id`` attribute indicates the task for which the description
is being provided.
.. code-block:: xml
A task to greet you!
A task for saying goodbye!
.. _NCName: http://www.datypic.com/sc/xsd/t-xsd_ID.html
Validating a pipeline
=======================
Run the following command to validate the ``pipeline.xml`` file:
.. code-block:: bash
compi validate -p pipeline.xml
Viewing the pipeline as a graph
===============================
Run the following command to export the graph defined by the ``pipeline.xml`` pipeline as an image.
.. code-block:: bash
compi export-graph -p pipeline.xml -o pipeline.png -f png
.. figure:: images/writing/pipeline.png
:align: center
If you want to draw also the task parameters, try options ``--draw-task-params`` or ``--draw-pipeline-params``.