Custom runners
**************
What are Compi `runners`
========================
By default, compi runs task code by spawning local processes. With `runners`,
task' codes are passed to custom-made scripts which are in charge of running
them, for example, by submitting a job to a queue (e.g. Slurm, SGE) or using
Docker images.
Runners are passed to the main ``compi run`` command using the ``-r``
parameter.
Creating a custom runner
========================
Like pipelines, runners are defined in XML. Individual runners are defined
using the ``runner`` tag inside the ``runners`` tag. The ``task`` attribute
is used to specify the list of tasks (comma-separated) that the corresponding
runner must execute.
.. code-block:: xml
/bin/sh -c "${task_code}"
The runner code will have the following environment variables provided by compi:
- ``task_id``: contains the id of the task being executed.
- ``task_code``: contains the code (defined in the ``pipeline.xml``) of the task being executed.
- ``task_params``: contains the list of params associated to the task being executed.
- ``i``: in the case of ``foreach`` tasks, the iteration value.
- Like in regular Compi tasks, the task variables are also defined.
A simple example
----------------
Consider the following XML of the greetings pipeline:
.. code-block:: xml
1.0
Your name
Do you want to say goodbye?
echo "Hi ${yourName}"
echo "Goodbye ${yourName}"
And the following runners file where one runner is defined for the two
pipeline tasks:
.. code-block:: xml
echo -e "[${task_id}] \n\tyourName: ${yourName} \n\tcode: ${task_code} \n\tparams: ${task_params}" >> /tmp/runner-output
/bin/sh -c "${task_code}"
What this runner does is: printing task information (using the environment runner
variables) into a file (``/tmp/runner-output``) and then running the task
using the shell interpreter. This example can be executed with:
.. code-block:: console
compi run -p pipeline.xml -r runner.xml -o -- --sayGoodBye
cat /tmp/runner-output
Examples of useful runners
==========================
Generic Docker runner
---------------------
Let's supose the following pipeline with one task to align a FASTA file using
Clustal Omega:
.. code-block:: xml
1.0
Working directory.
Input file.
Output file.
Clustal Omega executable.
${clustalomega} -i ${workingDir}/${input} -o ${workingDir}/${output}
One may want to run this task using a Docker runner which runs the same task
code inside a Docker container where the Clustal Omega executable is available.
The following runners file shows a runner to do this:
.. code-block:: xml
envs=$(for param in $task_params; do echo -n "-e $param "; done)
docker run --rm $envs -v ${workingDir}:${workingDir} --entrypoint /bin/bash pegi3s/clustalomega -c "${task_code}"
The key points of this generic Docker runner are:
- The first line creates a variable with the list of parameters that should be passed to the Docker container as environment variables.
- The second line runs the docker image passing this list of environment variables and mounts the directory where the command has the input and output files.
- Since this particular image of Clustal Omega has an entrypoint defined, it must be overriden to run the desired task code.
Generic Slurm runner
--------------------
The following runners file shows a generic Slurm runner:
.. code-block:: xml
tmpfile=$(mktemp /tmp/compi-task-code.XXXXXXXX)
echo "#!/bin/bash" >> ${tmpfile}
echo ${task_code} >> ${tmpfile}
chmod u+x ${tmpfile}
srun -c 1 -p main --export ALL -o /tmp/task-1.log -e /tmp/task-1.err -J task_1 bash ${tmpfile}
Some parameters of the ``srun`` may need to be adjusted for each specific
cluster, but this is how a generic Slurm runner may look like. The
``export`` parameter must be used to export all the environment variables to
the process that will be executed, and this is neccessary because the task
parameters are declared as environment variables.
Generic SSH runner
--------------------
The following runners file shows a generic SSH runner, that executes the task code in a given SSH host.
A confidence relation between the client machine (where Compi runs) and the remote host is assumed (See `here `_ how to create this)
.. code-block:: xml
remote_host="192.168.1.108" #set here the remote machine
remote_user="lipido"
# copy the compi environment to a file
envfile=$(mktemp /tmp/compi-env.XXXXXX)
for param in $task_params; do
export -p | sed -n -e "/^declare -x $param/,/^declare -x/ p" | sed \$d >> $envfile
echo "export $param" >> $envfile
done
scp $envfile ${remote_user}@${remote_host}:${envfile}
task_code_with_env="source $envfile; $task_code"
ssh ${remote_user}@${remote_host} "$task_code_with_env"
Generic AWS runner
--------------------
Based on the previous SSH generic runner, here it is a more complex runner. This runner runs the task code over SSH in an Amazon Linux virtual machine.
In order to do that, this runner is in charge of creating the instance, if it is not available, and waiting for the SSH protocol being available.
After that, the task code is run in the instance via SSH. This runner uses `flock` utility to ensure that only one execution of the runner
launches the Amazon instance, whereas the other ones only executes the SSH part.
.. code-block:: xml
&1 | grep 'Permission denied' )
[[ $? = 0 ]] && READY='ready'
echo "READY is $READY"
set -e
done
echo "Ready"
else
echo "Yes, it is available"
fi
) 99<"$lock_file"
# Here we assume that the instance is up and running
# Obtain the instance details and host
OUT=$(aws ec2 describe-instances --filters "Name=tag-value,Values=compi-aws-${image_id}" Name=instance-state-name,Values=running --output text)
remote_host=$(echo "$OUT" | grep ASSOCIATION | head -n 1 | cut -f 3)
# copy the compi environment to a file
envfile=$(mktemp /tmp/compi-env.XXXXXX)
for param in $task_params; do
export -p | sed -n -e "/^declare -x $param/,/^declare -x/ p" | sed \$d >> $envfile
echo "export $param" >> $envfile
done
scp -o StrictHostKeyChecking=no -i ${private_key_file} $envfile ${remote_user}@${remote_host}:${envfile}
task_code_with_env="source $envfile; $task_code"
ssh -o StrictHostKeyChecking=no -i ${private_key_file} ${remote_user}@${remote_host} "$task_code_with_env"
]]>
.. _runner_maximum_tasks:
Controlling the maximum number of parallel tasks at task-level
--------------------------------------------------------------
The ``--num-tasks/-n`` parameter of Compi allows setting the maximum number of tasks that can be run in parallel.
Nevertheless, imagine you need to run a very heavy `foreach` task or a `foreach` task that may not be run in parallel if there
are not enough resources (e.g. GPUs). Using ``--num-tasks`` for controlling this foreach tasks will also apply the same limit
to other pipeline tasks and this may not be desirable.
The maximum number of such parallel tasks can be controlled at task-level using a custom pipeline runner like the following, which:
1. Requires a pipeline parameter ``max_tasks`` that specifies the maximum number of parallel tasks allowed for ``task``.
2. Creates a shared lock file that will be used to store a number with the actual number of parallel tasks. This file is shared between all parallel tasks.
3. Waits until it can access the lock file (using flock to avoid concurrency issues) and checks if the there is still room for one more task (i.e. the actual number is less than ``max_tasks``). When available, increments the number and creates a slot file (used for the task process to be aware of the slot).
4. Runs the task code as necessary.
5. Updates the actual number of parallel tasks in the shared lock file.
.. code-block:: xml
${lock_file}
echo $((count+1)) > ${slot_available}
fi
) 99<"$lock_file"
sleep_time=5
done
echo "Got a ticket for ${id}"
/bin/sh -c "${task_code}"
( flock 99
count=$(cat ${lock_file})
echo $((count-1)) > ${lock_file}
) 99<"$lock_file"
]]>
Try `this example XML pipeline <_static/resources/runner-for-max-parallel-tasks.zip>`_ with (``max_tasks`` has a default value of 4):
.. code-block:: shell
compi run -o -n 20 -r runner.xml