Simulation batch-processing with Snakemake
Snakemake is a Python-based tool originating from bioinformatics, which allows to create reproducible and scalable data analyses workflows. Here, Snakemake is used to facilitate the batch-processing of a collection of Computational Fluid Dynamics (CFD) simulation setups by leveraging three of its main strengths:
Portability: The integration of Apptainer/ Singularity/ Docker allows to provide the simulation software and all other software dependencies in the form of a container image.
Scalability: For single- and multicore execution on workstations and computer clusters only the total number of cores to be used or jobs to be submitted in parallel has to be specified.
Reporting: HTML reports containing plots from all simulations can be generated for sharing results with collaborators.
Furthermore, the workflow described here introduces an efficient way to conduct
Parameter studies by means of case templating, eliminating the need for maintaining multiple parameterized sub-versions of a simulation case.
Prerequisites
Installation
Note that a standalone installation of Snakemake will not work. The system
described here depends on an installation of the
multiphasepy package. You can obtain it from PYPI.
Installing the package automatically installs the correct minimum version of
Snakemake together with a plugin for execution on High Performance Computer
(HPC) clusters. Please refer to the
installation guideline
for further details.
Case collection structure requirements
It is recommended that all simulation setups are located in a cases
subdirectory.
|--- cases # subdirectory containing simulation setups
| |--- someSetup
| |--- anotherSetup
| |--- subdirectory
| |--- yetAnotherSetup
If the case collection is already configured to run as a workflow, you should see the following files:
|--- profiles # Define how to run the workflow (PC/HPC),
| |--- default
| | |--- config.yaml # Container selection, Snakemake settings, etc.
| |--- slurm
| |--- config.yaml # Partition, walltime, etc.
|--- workflow # Internal scripts for running the workflow
|--- workflow.yml # List of cases to run
If this is not the case, the workflow system first needs to be enabled using the
command line utility mpyworkflow.
Simulation setup structure requirements
The purpose of the workflow is to enable convenient batch-processing of a larger collection of simulation setups. For the workflow to function, the individual setups must feature an executable script that contains all commands for
running a case,
Allrun, i.e. pre-processing, solution and post- processing
While not mandatory, it is advisable to also provide scripts for
cleaning a case,
Allclean, i.e. for resetting the case to its original stateupdating reference solutions,
Allupdate, which must copy the new results to thevalidation/referencedirectory at the level of the casevalidating results,
Allvalidate, i.e. comparing results against reference solutions
which allows to develop the case collection into a validation database.
Another requirement is that all PNG files created during post-processing are
stored in a case-level directory called postProcessing/report, from which
Snakemake will gather images/plots for the report. Plotting scripts must be
written accordingly. This directory is generated automatically for each case
when running the workflow.
Regular setups
Regular setups are setups that do not require further parameterization and can
run stand-alone also outside of the workflow using the corresponding Allrun
script.
Template setups
To allow efficient parameter studies, cases may also be provided in the form of
templates, featuring a top-level caseParameterTable.ecsv file in the
Astropy ECSV format
that lists all case variations with the corresponding parameters and their
units, e.g.
# %ECSV 0.9
# ---
# datatype:
# - {name: case, datatype: string}
# - {name: floatParam, unit: kg*m / s^2, datatype: float64}
# - {name: intParam, unit: Pa, datatype: int16}
# - {name: stringParam, datatype: string}
case floatParam intParam stringParam
case1 1.0 1 one
case2 2.0 2 two
Any ASCII file provided with a setup can then be converted into a
template by adding the ending
.jinja to it and filling it with placeholders for the parameters rather than
actual values. For using this system in combination with the OpenFOAM Foundation
Software, it is recommended to add a top-level caseParameterDict.jinja file in
the well-known dictionary format:
FoamFile
{
format ascii;
class dictionary;
object caseParameterDict;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
case {{case}};
floatParam {{floatParam}};
intParam {{intParam}};
stringParam {{stringParam}};
// ************************************************************************* //
By using slash syntax, the values from this dictionary can be picked up in any subdictionary of the case, e.g.
FoamFile
{
format ascii;
class dictionary;
object caseParameterDict;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
floatParam ${${FOAM_CASE}/caseParameterDict!floatParam};
intParam ${${FOAM_CASE}/caseParameterDict!intParam};
stringParam ${${FOAM_CASE}/caseParameterDict!stringParam};
// ************************************************************************* //
When executing the workflow, files with the ending .jinja will be rendered
using the parameter values from the caseParameterTable.ecsv for the selected
case. The .jinja ending is eliminated in that process and the file becomes a
regular dictionary.
Note that the workflow renders any number of cases on the fly. If you wish to render only a single case to run it standalone, use a function provided by the multiphasepy package
mpycopy --case <case-name> <template-directory> <destination-directory>
which copies the case and fills in parameter values. For more information see
mpycopy --help.
Execution
Configuration
“What” to process
The cases to be included in a batch-process are listed in the file
workflow.yml, together with a the target directory to which they are copied
for execution. Note that the include dictionary must reflect the directory
structure of the case collection:
target_dir: run
include:
cases:
regular_case: true
template_case:
case1: true
case2: true
case3: false # this case is ignored in the workflow
# case4: true # this case is ignored as well
...
If the case collection is very large, the include dictionary within the
workflow.yml can be created automatically using a function provided by the
multiphasepy package
mpyworkflow collect <directory_containing_cases>
which walks recursively through the directory and also includes all variants
of template cases. For more information see mpyworkflow collect --help.
There is also a single_timestep: true option which can be added to the
workflow.yml to quickly test a batch before its actual submission. Note
that this currently only works in combination with OpenFOAM Foundation software.
“How” to process
Profiles
are used to configure how batch runs are executed. There is a default
profile (profiles/default/config.yaml) which is always loaded and contains
settings that apply irrespective of the execution environment (PC/HPC), e.g. to
specify the software environment used for executing jobs.
Containers
It is possible to use the Apptainer container runtime by setting
use-apptainer: true
apptainer-args: "--home $HOME -B $(readlink -f $PWD)"
# Note that the setting "--home $HOME -B $(readlink -f $PWD)" will mount your
# home directory and the current working directory to the container, even if it
# is a link to a network storage. If mounting of other filesystems is necessary,
# e.g. on a computer cluster, add them like "-B $(readlink -f $PWD),/scratch".
config:
container: "oras://<registry>/<namespace>/<tag>:<image>.sif"
Note that Apptainer can also process Docker images, e.g. by
config:
container: "docker://<registry>/<namespace>/<tag>:<image>"
The container image used for execution can also be specified at the case level
by adding a file case.yml containing
workflow:
container: "oras://<different_registry>/<different_namespace>/<different_tag>:<different_image>.sif"
To forcibly turn off the use of container for a specific case you can also set the container configuration to null or false:
workflow:
container: null # or false
Further, the entry can be a template parameter in a case.yml.jinja file
workflow:
container: {{image}}
whereby the use of a separate image for every variant of a template setup is
possible by adding the name of the image to the caseParameterTable.ecsv file.
If the use of Apptainer is deactivated than the execution environment is determined by the environment provided at the start of a workflow.
Note that Apptainer by default takes over all system environment variables, so activating a software environment, e.g., for OpenFOAM Foundation software before starting a workflow can lead to conflicts. It is possible to suppress this behavior by adding
apptainer-args: "--cleanenv"
as an option to Apptainer, but this may be undesirable if system environment variables are actually desired in the container run (e.g. Slurm variables for multi-node execution).
Environment Variables
It is possible to set environment variables for the entire workflow by setting
config:
container: ...
env_var:
NAME: "value"
in the profiles/default/config.yaml configuration file.
The environment variables will be available for every rule of the workflow.
License management options
Some CFD software require the user to manage licenses for their simulations. Most commonly this management takes the form of a license server that can issue a license for the simulation to run. If at the time of request, no license are availables, the simulation will crash.
To avoid any inconviences, the workflow integrates a rule that manages license availability:
local execution: The workflow will sleep until a license is available.
remote execution: The workflow will requeue the allrun group until a license is available.
The license checking is limited to the Allrun rule. To work properly, the license checker requires some configuration. It is also necessary to have built the workflow with the require_license flag. For an existing workflow, the functionality can be added by
mpyworkflow update .
and answering the questions accordingly.
License workflow configuration
license_command: in the workflow.yml provide the license command that
will be executed to check if a license is available:
license_command: 'license_checker'
The command should return “True” if enough licenses are availables, “False” otherwise.
Additional predefined options can be passed to the license command for better context:
license_command: 'license_checker "${log}" $"{software}" '
available options are:
log: path to the
check_license.logfile.software: name of the case simulation software.
license_server: url of the license server.
cores: number of cores the simulation will run on.
TIPS: If the license command is a bash script, it is highly recommended to use strict mode.
license_server in the workflow.yml provide the license server for each
simulation software needed:
license_server:
"Simcenter STAR-CCM+": 1999@starccm.server
"Ansys Fluent": 1055@fluent.server
The key of the license sever should be the name of the simulation software that is configured in the
case.ymlfile.
License case configuration
Each case can be configured using the case.yml file to use the license
checker:
...
simulation:
software: "Simcenter STAR-CCM+"
require_license: yes
...
software: name of the software that is used to identify the license server.
Default value for the software key is determined by the case type. available case type are: ‘base’, ‘OpenFOAM’, ‘Simcenter STAR-CCM+’, ‘Ansys Fluent’.
require_license: if the case require license checking (default: false).
Additional Snakemake options
In the file profiles/default/config.yaml any command line option
of Snakemake can be added in order to make the actual command for starting a
workflow more compact.
Operating on High Performance Computer (HPC) clusters
For executing the batch run on an HPC system that uses the Slurm workload
manager, there is a separate profile (profiles/slurm/config.yaml) to specify
the partition to be used, among other things.
General command sequence (for local execution)
Quick start
To copy the cases listed in workflow.yml to the target directory and batch-
process the case-level Allrun scripts of all cases execute
snakemake -c all
which will utilize all cores of your machine. Note that Snakemake always requires you to explicitly specify the number of cores for any command. This is to enforce a conscious choice of the resources used.
Step by step
The workflow definition is organized into so-called rules, whose execution is triggered by
snakemake <rule> -c <number_of_cores>
The rules in this workflow are named according to the various All*-scripts
that are provided with each case. To batch-process the case-level Allrun
scripts of all cases listed in workflow.yml, execute
snakemake Allrun -c <number_of_cores>
The Allrun rule is the default rule, hence the command
snakemake -c <number_of_cores>
is synonymous. It will copy (render) the cases into the target_dir directory
and run them, using the supplied number of cores. If the individidual cases
require less cores, several simulations will run in parallel. On the other hand,
setups asking for more cores than provided in total are scaled down. Note that
retrieving and adjusting the number of cores from the simulation setup currently
only works for OpenFOAM Foundation software. The workflow automatically reads
and possibly adjusts the numberOfSubdomains entry in
${FOAM_CASE}/system/decomposeParDict.
Other case-level scripts are batch-processed in a similar manner, e.g.
snakemake Allvalidate -c <number_of_cores>
snakemake Allupdate -c <number_of_cores>
or
snakemake Allclean -c <number_of_cores>
If a case doesn’t feature an Allvalidate or Allupdate script, its execution
is simply omitted for this case and no error is reported.
It is also possible to only initialize cases by
snakemake init -c <number_of_cores>
which will copy (render) the cases into the target_dir directory.
If execution of a rule was successful, triggering it again will not do anything.
You can force re-running with the --forceall, -F option, e.g.
snakemake Allrun -F -c <number_of_cores>
Using
snakemake --report -c 1
an HTML report can be generated which gathers PNG files from case-level
postProcessing/report subdirectories of all cases included in the workflow.
This will generate a file named report.html. Alternatively, you can specify
the name of the file by
snakemake --report <alternative_name>.html -c 1
This command also works if triggered for a failed workflow or in a separate shell for a running workflow, i.e. to generate a premature report. However, the workflow must be past the initialization stage, i.e.
snakemake init -c <number_of_cores>
must be complete. Note that, as a case database grows, the amount of PNG files that are marked for inclusion in the report may be too high for a self- contained HTML file. You will notice when the report takes too long to load in your browser. In this case, recreate the report by
snakemake --report report_name.zip -c 1
which generates a zip directory containing the <report_name>.html file
containing links to the actual PNG files in a separate directory data.
Remote execution on a Slurm HPC cluster
For configuring the remote execution, specify the partition and the maximum
wall-time per job in the profiles/slurm/config.yaml.
Compared to a local execution, the command sequence for remote execution only differs in two aspects:
The
profiles/slurm/config.yamlconfiguration file must be selected through the--profileoption. Note thatprofiles/default/config.yamlis always loaded, so the settings inprofiles/slurm/config.yamlapply additionally. An alternative profile could be created for use with other batch systems like PBS.The
-j, --jobsoption is now needed which specifies the maximum number of jobs submitted in parallel.
The set of commands to run the workflow can simply be issued on the submit node, i.e.
snakemake --profile profiles/slurm -j <number_of_jobs>
Cancelling Ctrl+c the execution sends an scancel to the individual jobs
submitted by Snakemake.
Background Execution using tmux
To allow logging out use the terminal multiplexer tmux. Create a separate
session for every workflow.
tmux new-session -s <sessionName>
snakemake --profile profiles/slurm init -j <number_of_jobs>
snakemake --profile profiles/slurm -j <number_of_jobs>
Detach from the session by pressing Ctrl+b, then d. You can then log out.
In order to stop the workflow, reattach to the session
tmux attach -t <sessionName>
and cancel the script execution by pressing Ctrl+c. The current tmux session
can be killed with Ctrl+d or exit. Note that jobs submitted to an HPC system
by Snakemake will not be cancelled by killing a tmux session. Always cancel the
Snakemake process first.
Existing sessions can be listed with their id by
tmux ls
To reconnect to an existing session use:
tmux attach-session -t <session-id>
To kill an existing session use:
tmux kill-session -t <session-id>
For more in depth use please refer to the documentation.
Debugging
The setting
keep-going: true # Keeps workflow alive if a single job fails.
in profiles/default/config.yaml causes the workflow to continue running even
if individual jobs fail. Snakemake will report about possible errors, e.g.
Exiting because a job execution failed. Look below for error messages
Error in rule Allrun_case:
message: None
jobid: 7
input: run/<path_to_case>/.init
output: run/<path_to_case>/.Allrun
log: run/<path_to_case>/log.Allrun (check log file(s) for error details)
Errors occurred. Run `snakemake --summary -c 1 <rule, e.g. Allrun> | grep missing`
to get a list of failed cases.
Then check the corresponding log files.
Complete log(s): .snakemake/log/<timestamp>.snakemake.log
WorkflowError:
At least one job did not complete successfully.
The output in the shell: block is irrelevant, as it just points to the general
script for executing the All<script> at the case level. To debug a case,
consult the corresponding log.All<script> as indicated by:
log: run/case/log.Allrun (check log file(s) for error details)
The case-level All<script> stops executing at the first error that occurs. The
error message from the problematic command is shown in log.All<script> unless
the command writes to its own log file, e.g. if started with the OpenFOAM
Foundation software run function runApplication. In the latter case, consult
the individual log file.
There is also the command
snakemake --summary -c <number_of_cores> <rule, e.g. Allrun>
which can be filtered by
snakemake --summary -c <number_of_cores> | grep missing
to give a list of failed cases an the corresponding log files.
Problematic cases can be fixed in-place, i.e. in the target_dir directory.
Then simply restart the rule for which the error occurred, e.g.
snakemake Allupdate -c <number_of_cores>
whereby only the failed jobs for this rule will be restarted.
If the execution time of previous rules has been negligible, it is better to
delete the problematic case from the target_dir directory and to repair it in
the source cases directory, to make the fix persistent.
Afterwards, simply restart the workflow
snakemake -c <number_of_cores>
and rerun follow-up rules. Again, only cases that are missing from the
target_dir directory are initialized and executed.
Sending a custom command to all cases of a batch run
In some circumstances it may be useful to send the same command to all cases
within a batch run. For this purpose, add the corresponding command to the
workflow.yml file, as for example
custom_command: "( cd validation && ./createGraphs )"
and execute it by (possibly in a separate terminal)
snakemake custom_command -c <number_of_cores>
The above example will regenerate plots for all cases included in the batch run,
provided the script creation is handled by a script validation/createGraphs
at the case-level. This is particularly useful when reference solutions are
updated and a new report is desired. Other use cases for OpenFOAM Foundation
software include stopping simulations mercyfully by
custom_command: "touch stop"
which requires that the cases are setup with the stopAtFile functionObject:
#includeFunc stopAtFile(action=writeNow)
If not, the same may be achieved by
custom_command: "foamDictionary system/controlDict -entry stopAt -set writeNow"