Simulation batch-processing with Snakemake

Snakemake is a Python-based tool originating from bioinformatics, which allows to create reproducible and scalable data analyses workflows. Here, Snakemake is used to facilitate the batch-processing of a collection of Computational Fluid Dynamics (CFD) simulation setups by leveraging three of its main strengths:

Portability: The integration of Apptainer/ Singularity/ Docker allows to provide the simulation software and all other software dependencies in the form of a container image.
Scalability: For single- and multicore execution on workstations and computer clusters only the total number of cores to be used or jobs to be submitted in parallel has to be specified.
Reporting: HTML reports containing plots from all simulations can be generated for sharing results with collaborators.

Furthermore, the workflow described here introduces an efficient way to conduct

Parameter studies by means of case templating, eliminating the need for maintaining multiple parameterized sub-versions of a simulation case.

Prerequisites

Installation

Note that a standalone installation of Snakemake will not work. The system described here depends on an installation of the multiphasepy package. You can obtain it from PYPI. Installing the package automatically installs the correct minimum version of Snakemake together with a plugin for execution on High Performance Computer (HPC) clusters. Please refer to the installation guideline for further details.

Case collection structure requirements

It is recommended that all simulation setups are located in a cases subdirectory.

|--- cases                       # subdirectory containing simulation setups
|   |--- someSetup
|   |--- anotherSetup
|   |--- subdirectory
|       |--- yetAnotherSetup

If the case collection is already configured to run as a workflow, you should see the following files:

|--- profiles                    # Define how to run the workflow (PC/HPC),
|   |--- default
|   |   |--- config.yaml         # Container selection, Snakemake settings, etc.
|   |--- slurm
|       |--- config.yaml         # Partition, walltime, etc.
|--- workflow                    # Internal scripts for running the workflow
|--- workflow.yml                # List of cases to run

If this is not the case, the workflow system first needs to be enabled using the command line utility mpyworkflow.

Simulation setup structure requirements

The purpose of the workflow is to enable convenient batch-processing of a larger collection of simulation setups. For the workflow to function, the individual setups must feature an executable script that contains all commands for

running a case, Allrun, i.e. pre-processing, solution and post- processing

While not mandatory, it is advisable to also provide scripts for

cleaning a case, Allclean, i.e. for resetting the case to its original state
updating reference solutions, Allupdate, which must copy the new results to the validation/reference directory at the level of the case
validating results, Allvalidate, i.e. comparing results against reference solutions

which allows to develop the case collection into a validation database.

Another requirement is that all PNG files created during post-processing are stored in a case-level directory called postProcessing/report, from which Snakemake will gather images/plots for the report. Plotting scripts must be written accordingly. This directory is generated automatically for each case when running the workflow.

Regular setups

Regular setups are setups that do not require further parameterization and can run stand-alone also outside of the workflow using the corresponding Allrun script.

Template setups

To allow efficient parameter studies, cases may also be provided in the form of templates, featuring a top-level caseParameterTable.ecsv file in the Astropy ECSV format that lists all case variations with the corresponding parameters and their units, e.g.

# %ECSV 0.9
# ---
# datatype:
# - {name: case, datatype: string}
# - {name: floatParam, unit: kg*m / s^2, datatype: float64}
# - {name: intParam, unit: Pa, datatype: int16}
# - {name: stringParam, datatype: string}
case   floatParam  intParam stringParam
case1  1.0         1        one
case2  2.0         2        two

Any ASCII file provided with a setup can then be converted into a template by adding the ending .jinja to it and filling it with placeholders for the parameters rather than actual values. For using this system in combination with the OpenFOAM Foundation Software, it is recommended to add a top-level caseParameterDict.jinja file in the well-known dictionary format:

FoamFile
{
    format      ascii;
    class       dictionary;
    object      caseParameterDict;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

case        {{case}};

floatParam  {{floatParam}};
intParam    {{intParam}};
stringParam {{stringParam}};

// ************************************************************************* //

By using slash syntax, the values from this dictionary can be picked up in any subdictionary of the case, e.g.

FoamFile
{
    format      ascii;
    class       dictionary;
    object      caseParameterDict;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

floatParam  ${${FOAM_CASE}/caseParameterDict!floatParam};
intParam    ${${FOAM_CASE}/caseParameterDict!intParam};
stringParam ${${FOAM_CASE}/caseParameterDict!stringParam};

// ************************************************************************* //

When executing the workflow, files with the ending .jinja will be rendered according to the cases selected from the caseParameterTable.ecsv. The .jinja ending is eliminated in that process. Copier is used in the background as template rendering engine.

Note that the workflow renders any number of cases on the fly. If you wish to render only a single case to run it standalone, use a function provided by the multiphasepy package

mpycopy --case <case-name> <template-directory> <destination-directory>

which copies the case and fills in parameter values. For more information see mpycopy --help.

Execution

Configuration

“What” to process

The cases to be included in a batch-process are listed in the file workflow.yml, together with a the target directory to which they are copied for execution. Note that the include dictionary must reflect the directory structure of the case collection:

target_dir: run
include:
  cases:
    regular_case: true
    template_case:
      case1: true
      case2: true
      case3: false # this case is ignored in the workflow
      # case4: true # this case is ignored as well
      ...

If the case collection is very large, the include dictionary within the workflow.yml can be created automatically using a function provided by the multiphasepy package

mpyworkflow collect <directory_containing_cases>

which walks recursively through the directory and also includes all variants of template cases. For more information see mpyworkflow collect --help.

There is also a single_timestep: true option which can be added to the workflow.yml to quickly test a batch before its actual submission. Note that this currently only works in combination with OpenFOAM Foundation software.

“How” to process

Profiles are used to configure how batch runs are executed. There is a default profile (profiles/default/config.yaml) which is always loaded and contains settings that apply irrespective of the execution environment (PC/HPC), e.g. to specify the software environment used for executing jobs.

Containers

It is possible to use the Apptainer container runtime by setting

use-apptainer: true
apptainer-args: "--home $HOME -B $(readlink -f $PWD)"
# Note that the setting "--home $HOME -B $(readlink -f $PWD)" will mount your
# home directory and the current working directory to the container, even if it
# is a link to a network storage. If mounting of other filesystems is necessary,
# e.g. on a computer cluster, add them like "-B $(readlink -f $PWD),/scratch".

config:
    container: "oras://<registry>/<namespace>/<tag>:<image>.sif"

Note that Apptainer can also process Docker images, e.g. by

config:
    container: "docker://<registry>/<namespace>/<tag>:<image>"

The container image used for execution can also be specified at the case level by adding a file case.yml containing

workflow:
    container: "oras://<different_registry>/<different_namespace>/<different_tag>:<different_image>.sif"

To forcibly turn off the use of container for a specific case you can also set the container configuration to null or false:

workflow:
    container: null # or false

Further, the entry can be a template parameter in a case.yml.jinja file

workflow:
    container: {{image}}

whereby the use of a separate image for every variant of a template setup is possible by adding the name of the image to the caseParameterTable.ecsv file.

If the use of Apptainer is deactivated than the execution environment is determined by the environment provided at the start of a workflow.

Note that Apptainer by default takes over all system environment variables, so activating a software environment, e.g., for OpenFOAM Foundation software before starting a workflow can lead to conflicts. It is possible to suppress this behavior by adding

apptainer-args: "--cleanenv"

as an option to Apptainer, but this may be undesirable if system environment variables are actually desired in the container run (e.g. Slurm variables for multi-node execution).

Environment Variables

It is possible to set environment variables for the entire workflow by setting

config:
  container: ...
  env_var:
    NAME: "value"

in the profiles/default/config.yaml configuration file.

The environment variables will be available for every rule of the workflow.

License management options

Some CFD software require the user to manage licenses for their simulations. Most commonly this management takes the form of a license server that can issue a license for the simulation to run. If at the time of request, no license are availables, the simulation will crash.

To avoid any inconviences, the workflow integrates a rule that manages license availability:

local execution: The workflow will sleep until a license is available.
remote execution: The workflow will requeue the allrun group until a license is available.

The license checking is limited to the Allrun rule. To work properly, the license checker requires some configuration. It is also necessary to have built the workflow with the require_license flag. For an existing workflow, the functionality can be added by

copier recopy -a .workflow-copier-answers.yml

and answering the questions accordingly.

License workflow configuration

license_command: in the workflow.yml provide the license command that will be executed to check if a license is available:

license_command: 'license_checker'

The command should return “True” if enough licenses are availables, “False” otherwise.

Additional predefined options can be passed to the license command for better context:

license_command: 'license_checker "${log}" $"{software}" '

available options are:

log: path to the check_license.log file.
software: name of the case simulation software.
license_server: url of the license server.
cores: number of cores the simulation will run on.

TIPS: If the license command is a bash script, it is highly recommended to use strict mode.

license_server in the workflow.yml provide the license server for each simulation software needed:

license_server:
    "Simcenter STAR-CCM+": 1999@starccm.server
    "Ansys Fluent": 1055@fluent.server

The key of the license sever should be the name of the simulation software that is configured in the case.yml file.

License case configuration

Each case can be configured using the case.yml file to use the license checker:

...
simulation:
    software: "Simcenter STAR-CCM+"
    require_license: yes
...

software: name of the software that is used to identify the license server.

Default value for the software key is determined by the case type. available case type are: ‘base’, ‘OpenFOAM’, ‘Simcenter STAR-CCM+’, ‘Ansys Fluent’.

require_license: if the case require license checking (default: false).

Additional Snakemake options

In the file profiles/default/config.yaml any command line option of Snakemake can be added in order to make the actual command for starting a workflow more compact.

Operating on High Performance Computer (HPC) clusters

For executing the batch run on an HPC system that uses the Slurm workload manager, there is a separate profile (profiles/slurm/config.yaml) to specify the partition to be used, among other things.

General command sequence (for local execution)

Quick start

To copy the cases listed in workflow.yml to the target directory and batch- process the case-level Allrun scripts of all cases execute

snakemake -c all

which will utilize all cores of your machine. Note that Snakemake always requires you to explicitly specify the number of cores for any command. This is to enforce a conscious choice of the resources used.

Step by step

The workflow definition is organized into so-called rules, whose execution is triggered by

snakemake <rule> -c <number_of_cores>

The rules in this workflow are named according to the various All*-scripts that are provided with each case. To batch-process the case-level Allrun scripts of all cases listed in workflow.yml, execute

snakemake Allrun -c <number_of_cores>

The Allrun rule is the default rule, hence the command

snakemake -c <number_of_cores>

is synonymous. It will copy (render) the cases into the target_dir directory and run them, using the supplied number of cores. If the individidual cases require less cores, several simulations will run in parallel. On the other hand, setups asking for more cores than provided in total are scaled down. Note that retrieving and adjusting the number of cores from the simulation setup currently only works for OpenFOAM Foundation software. The workflow automatically reads and possibly adjusts the numberOfSubdomains entry in ${FOAM_CASE}/system/decomposeParDict.

Other case-level scripts are batch-processed in a similar manner, e.g.

snakemake Allvalidate -c <number_of_cores>
snakemake Allupdate -c <number_of_cores>

or

snakemake Allclean -c <number_of_cores>

If a case doesn’t feature an Allvalidate or Allupdate script, its execution is simply omitted for this case and no error is reported.

It is also possible to only initialize cases by

snakemake init -c <number_of_cores>

which will copy (render) the cases into the target_dir directory.

If execution of a rule was successful, triggering it again will not do anything. You can force re-running with the --forceall, -F option, e.g.

snakemake Allrun -F -c <number_of_cores>

Using

snakemake --report -c 1

an HTML report can be generated which gathers PNG files from case-level postProcessing/report subdirectories of all cases included in the workflow. This will generate a file named report.html. Alternatively, you can specify the name of the file by

snakemake --report <alternative_name>.html -c 1

This command also works if triggered for a failed workflow or in a separate shell for a running workflow, i.e. to generate a premature report. However, the workflow must be past the initialization stage, i.e.

snakemake init -c <number_of_cores>

must be complete. Note that, as a case database grows, the amount of PNG files that are marked for inclusion in the report may be too high for a self- contained HTML file. You will notice when the report takes too long to load in your browser. In this case, recreate the report by

snakemake --report report_name.zip -c 1

which generates a zip directory containing the <report_name>.html file containing links to the actual PNG files in a separate directory data.

Remote execution on a Slurm HPC cluster

For configuring the remote execution, specify the partition and the maximum wall-time per job in the profiles/slurm/config.yaml.

Compared to a local execution, the command sequence for remote execution only differs in two aspects:

The profiles/slurm/config.yaml configuration file must be selected through the --profile option. Note that profiles/default/config.yaml is always loaded, so the settings in profiles/slurm/config.yaml apply additionally. An alternative profile could be created for use with other batch systems like PBS.
The -j, --jobs option is now needed which specifies the maximum number of jobs submitted in parallel.

The set of commands to run the workflow can simply be issued on the submit node, i.e.

snakemake --profile profiles/slurm -j <number_of_jobs>

Cancelling Ctrl+c the execution sends an scancel to the individual jobs submitted by Snakemake.

Background Execution using tmux

To allow logging out use the terminal multiplexer tmux. Create a separate session for every workflow.

tmux new-session -s <sessionName>
snakemake --profile profiles/slurm init -j <number_of_jobs>
snakemake --profile profiles/slurm -j <number_of_jobs>

Detach from the session by pressing Ctrl+b, then d. You can then log out. In order to stop the workflow, reattach to the session

tmux attach -t <sessionName>

and cancel the script execution by pressing Ctrl+c. The current tmux session can be killed with Ctrl+d or exit. Note that jobs submitted to an HPC system by Snakemake will not be cancelled by killing a tmux session. Always cancel the Snakemake process first.

Existing sessions can be listed with their id by

tmux ls

To reconnect to an existing session use:

tmux attach-session -t <session-id>

To kill an existing session use:

tmux kill-session -t <session-id>

For more in depth use please refer to the documentation.

Debugging

The setting

keep-going: true      # Keeps workflow alive if a single job fails.

in profiles/default/config.yaml causes the workflow to continue running even if individual jobs fail. Snakemake will report about possible errors, e.g.

Exiting because a job execution failed. Look below for error messages
Error in rule Allrun_case:
    message: None
    jobid: 7
    input: run/<path_to_case>/.init
    output: run/<path_to_case>/.Allrun
    log: run/<path_to_case>/log.Allrun (check log file(s) for error details)
Errors occurred. Run `snakemake --summary -c 1 <rule, e.g. Allrun> | grep missing`
to get a list of failed cases.
Then check the corresponding log files.
Complete log(s): .snakemake/log/<timestamp>.snakemake.log
WorkflowError:
At least one job did not complete successfully.

The output in the shell: block is irrelevant, as it just points to the general script for executing the All<script> at the case level. To debug a case, consult the corresponding log.All<script> as indicated by:

    log: run/case/log.Allrun (check log file(s) for error details)

The case-level All<script> stops executing at the first error that occurs. The error message from the problematic command is shown in log.All<script> unless the command writes to its own log file, e.g. if started with the OpenFOAM Foundation software run function runApplication. In the latter case, consult the individual log file.

There is also the command

snakemake --summary -c <number_of_cores> <rule, e.g. Allrun>

which can be filtered by

snakemake --summary -c <number_of_cores> | grep missing

to give a list of failed cases an the corresponding log files.

Problematic cases can be fixed in-place, i.e. in the target_dir directory. Then simply restart the rule for which the error occurred, e.g.

  snakemake Allupdate -c <number_of_cores>

whereby only the failed jobs for this rule will be restarted.

If the execution time of previous rules has been negligible, it is better to delete the problematic case from the target_dir directory and to repair it in the source cases directory, to make the fix persistent.

Afterwards, simply restart the workflow

  snakemake -c <number_of_cores>

and rerun follow-up rules. Again, only cases that are missing from the target_dir directory are initialized and executed.

Sending a custom command to all cases of a batch run

In some circumstances it may be useful to send the same command to all cases within a batch run. For this purpose, add the corresponding command to the workflow.yml file, as for example

custom_command: "( cd validation && ./createGraphs )"

and execute it by (possibly in a separate terminal)

snakemake custom_command -c <number_of_cores>

The above example will regenerate plots for all cases included in the batch run, provided the script creation is handled by a script validation/createGraphs at the case-level. This is particularly useful when reference solutions are updated and a new report is desired. Other use cases for OpenFOAM Foundation software include stopping simulations mercyfully by

custom_command: "touch stop"

which requires that the cases are setup with the stopAtFile functionObject:

#includeFunc    stopAtFile(action=writeNow)

If not, the same may be achieved by

custom_command: "foamDictionary system/controlDict -entry stopAt -set writeNow"