Custom runners ************** What are Compi `runners` ======================== By default, compi runs task code by spawning local processes. With `runners`, task' codes are passed to custom-made scripts which are in charge of running them, for example, by submitting a job to a queue (e.g. Slurm, SGE) or using Docker images. Runners are passed to the main ``compi run`` command using the ``-r`` parameter. Creating a custom runner ======================== Like pipelines, runners are defined in XML. Individual runners are defined using the ``runner`` tag inside the ``runners`` tag. The ``task`` attribute is used to specify the list of tasks (comma-separated) that the corresponding runner must execute. .. code-block:: xml /bin/sh -c "${task_code}" The runner code will have the following environment variables provided by compi: - ``task_id``: contains the id of the task being executed. - ``task_code``: contains the code (defined in the ``pipeline.xml``) of the task being executed. - ``task_params``: contains the list of params associated to the task being executed. - ``i``: in the case of ``foreach`` tasks, the iteration value. - Like in regular Compi tasks, the task variables are also defined. A simple example ---------------- Consider the following XML of the greetings pipeline: .. code-block:: xml 1.0 Your name Do you want to say goodbye? echo "Hi ${yourName}" echo "Goodbye ${yourName}" And the following runners file where one runner is defined for the two pipeline tasks: .. code-block:: xml echo -e "[${task_id}] \n\tyourName: ${yourName} \n\tcode: ${task_code} \n\tparams: ${task_params}" >> /tmp/runner-output /bin/sh -c "${task_code}" What this runner does is: printing task information (using the environment runner variables) into a file (``/tmp/runner-output``) and then running the task using the shell interpreter. This example can be executed with: .. code-block:: console compi run -p pipeline.xml -r runner.xml -o -- --sayGoodBye cat /tmp/runner-output Examples of useful runners ========================== Generic Docker runner --------------------- Let's supose the following pipeline with one task to align a FASTA file using Clustal Omega: .. code-block:: xml 1.0 Working directory. Input file. Output file. Clustal Omega executable. ${clustalomega} -i ${workingDir}/${input} -o ${workingDir}/${output} One may want to run this task using a Docker runner which runs the same task code inside a Docker container where the Clustal Omega executable is available. The following runners file shows a runner to do this: .. code-block:: xml envs=$(for param in $task_params; do echo -n "-e $param "; done) docker run --rm $envs -v ${workingDir}:${workingDir} --entrypoint /bin/bash pegi3s/clustalomega -c "${task_code}" The key points of this generic Docker runner are: - The first line creates a variable with the list of parameters that should be passed to the Docker container as environment variables. - The second line runs the docker image passing this list of environment variables and mounts the directory where the command has the input and output files. - Since this particular image of Clustal Omega has an entrypoint defined, it must be overriden to run the desired task code. Generic Slurm runner -------------------- The following runners file shows a generic Slurm runner: .. code-block:: xml tmpfile=$(mktemp /tmp/compi-task-code.XXXXXXXX) echo "#!/bin/bash" >> ${tmpfile} echo ${task_code} >> ${tmpfile} chmod u+x ${tmpfile} srun -c 1 -p main --export ALL -o /tmp/task-1.log -e /tmp/task-1.err -J task_1 bash ${tmpfile} Some parameters of the ``srun`` may need to be adjusted for each specific cluster, but this is how a generic Slurm runner may look like. The ``export`` parameter must be used to export all the environment variables to the process that will be executed, and this is neccessary because the task parameters are declared as environment variables. Generic SSH runner -------------------- The following runners file shows a generic SSH runner, that executes the task code in a given SSH host. A confidence relation between the client machine (where Compi runs) and the remote host is assumed (See `here `_ how to create this) .. code-block:: xml remote_host="192.168.1.108" #set here the remote machine remote_user="lipido" # copy the compi environment to a file envfile=$(mktemp /tmp/compi-env.XXXXXX) for param in $task_params; do export -p | sed -n -e "/^declare -x $param/,/^declare -x/ p" | sed \$d >> $envfile echo "export $param" >> $envfile done scp $envfile ${remote_user}@${remote_host}:${envfile} task_code_with_env="source $envfile; $task_code" ssh ${remote_user}@${remote_host} "$task_code_with_env" Generic AWS runner -------------------- Based on the previous SSH generic runner, here it is a more complex runner. This runner runs the task code over SSH in an Amazon Linux virtual machine. In order to do that, this runner is in charge of creating the instance, if it is not available, and waiting for the SSH protocol being available. After that, the task code is run in the instance via SSH. This runner uses `flock` utility to ensure that only one execution of the runner launches the Amazon instance, whereas the other ones only executes the SSH part. .. code-block:: xml &1 | grep 'Permission denied' ) [[ $? = 0 ]] && READY='ready' echo "READY is $READY" set -e done echo "Ready" else echo "Yes, it is available" fi ) 99<"$lock_file" # Here we assume that the instance is up and running # Obtain the instance details and host OUT=$(aws ec2 describe-instances --filters "Name=tag-value,Values=compi-aws-${image_id}" Name=instance-state-name,Values=running --output text) remote_host=$(echo "$OUT" | grep ASSOCIATION | head -n 1 | cut -f 3) # copy the compi environment to a file envfile=$(mktemp /tmp/compi-env.XXXXXX) for param in $task_params; do export -p | sed -n -e "/^declare -x $param/,/^declare -x/ p" | sed \$d >> $envfile echo "export $param" >> $envfile done scp -o StrictHostKeyChecking=no -i ${private_key_file} $envfile ${remote_user}@${remote_host}:${envfile} task_code_with_env="source $envfile; $task_code" ssh -o StrictHostKeyChecking=no -i ${private_key_file} ${remote_user}@${remote_host} "$task_code_with_env" ]]> .. _runner_maximum_tasks: Controlling the maximum number of parallel tasks at task-level -------------------------------------------------------------- The ``--num-tasks/-n`` parameter of Compi allows setting the maximum number of tasks that can be run in parallel. Nevertheless, imagine you need to run a very heavy `foreach` task or a `foreach` task that may not be run in parallel if there are not enough resources (e.g. GPUs). Using ``--num-tasks`` for controlling this foreach tasks will also apply the same limit to other pipeline tasks and this may not be desirable. The maximum number of such parallel tasks can be controlled at task-level using a custom pipeline runner like the following, which: 1. Requires a pipeline parameter ``max_tasks`` that specifies the maximum number of parallel tasks allowed for ``task``. 2. Creates a shared lock file that will be used to store a number with the actual number of parallel tasks. This file is shared between all parallel tasks. 3. Waits until it can access the lock file (using flock to avoid concurrency issues) and checks if the there is still room for one more task (i.e. the actual number is less than ``max_tasks``). When available, increments the number and creates a slot file (used for the task process to be aware of the slot). 4. Runs the task code as necessary. 5. Updates the actual number of parallel tasks in the shared lock file. .. code-block:: xml ${lock_file} echo $((count+1)) > ${slot_available} fi ) 99<"$lock_file" sleep_time=5 done echo "Got a ticket for ${id}" /bin/sh -c "${task_code}" ( flock 99 count=$(cat ${lock_file}) echo $((count-1)) > ${lock_file} ) 99<"$lock_file" ]]> Try `this example XML pipeline <_static/resources/runner-for-max-parallel-tasks.zip>`_ with (``max_tasks`` has a default value of 4): .. code-block:: shell compi run -o -n 20 -r runner.xml