# Simulation batch-processing with Snakemake [Snakemake](https://snakemake.github.io/) is a Python-based tool originating from bioinformatics, which allows to create reproducible and scalable data analyses workflows. Here, Snakemake is used to facilitate the batch-processing of a collection of Computational Fluid Dynamics (CFD) simulation setups by leveraging three of its main strengths: * **Portability**: The integration of [Apptainer](https://apptainer.org/)/ [Singularity](https://sylabs.io/singularity/)/ [Docker](https://www.docker.com/) allows to provide the simulation software and all other software dependencies in the form of a container image. * **Scalability**: For single- and multicore execution on workstations and computer clusters only the total number of cores to be used or jobs to be submitted in parallel has to be specified. * **Reporting**: HTML reports containing plots from all simulations can be generated for sharing results with collaborators. Furthermore, the workflow described here introduces an efficient way to conduct * **Parameter studies** by means of case templating, eliminating the need for maintaining multiple parameterized sub-versions of a simulation case. ## Prerequisites ### Installation Note that a standalone installation of Snakemake will not work. The system described here depends on an installation of the `multiphasepy` package. You can obtain it from [PYPI](https://pypi.org). Installing the package automatically installs the correct minimum version of Snakemake together with a plugin for execution on High Performance Computer (HPC) clusters. Please refer to the [installation guideline](https://multiphase-python-repository-by-hzdr.readthedocs.io/en/latest/installation.html) for further details. ### Case collection structure requirements It is recommended that all simulation setups are located in a `cases` subdirectory. ```shell |--- cases # subdirectory containing simulation setups | |--- someSetup | |--- anotherSetup | |--- subdirectory | |--- yetAnotherSetup ``` If the case collection is already configured to run as a workflow, you should see the following files: ```shell |--- profiles # Define how to run the workflow (PC/HPC), | |--- default | | |--- config.yaml # Container selection, Snakemake settings, etc. | |--- slurm | |--- config.yaml # Partition, walltime, etc. |--- workflow # Internal scripts for running the workflow |--- workflow.yml # List of cases to run ``` If this is not the case, the workflow system first needs to be enabled using the command line utility [`mpyworkflow`](cli-tools/mpyworkflow). ### Simulation setup structure requirements The purpose of the workflow is to enable convenient batch-processing of a larger collection of simulation setups. For the workflow to function, the individual setups must feature an executable script that contains all commands for * **running a case**, `Allrun`, i.e. pre-processing, solution and post- processing While not mandatory, it is advisable to also provide scripts for * **cleaning a case**, `Allclean`, i.e. for resetting the case to its original state * **updating reference solutions**, `Allupdate`, which must copy the new results to the `validation/reference` directory at the level of the case * **validating results**, `Allvalidate`, i.e. comparing results against reference solutions which allows to develop the case collection into a validation database. Another requirement is that all PNG files created during post-processing are stored in a case-level directory called `postProcessing/report`, from which Snakemake will gather images/plots for the report. Plotting scripts must be written accordingly. This directory is generated automatically for each case when running the workflow. #### Regular setups Regular setups are setups that do not require further parameterization and can run stand-alone also outside of the workflow using the corresponding `Allrun` script. #### Template setups To allow efficient parameter studies, cases may also be provided in the form of templates, featuring a top-level `caseParameterTable.ecsv` file in the [Astropy ECSV](https://docs.astropy.org/en/latest/io/ascii/ecsv.html) format that lists all case variations with the corresponding parameters and their units, e.g. ```text # %ECSV 0.9 # --- # datatype: # - {name: case, datatype: string} # - {name: floatParam, unit: kg*m / s^2, datatype: float64} # - {name: intParam, unit: Pa, datatype: int16} # - {name: stringParam, datatype: string} case floatParam intParam stringParam case1 1.0 1 one case2 2.0 2 two ``` Any ASCII file provided with a setup can then be converted into a [template](https://jinja.palletsprojects.com/en/stable/) by adding the ending `.jinja` to it and filling it with placeholders for the parameters rather than actual values. For using this system in combination with the OpenFOAM Foundation Software, it is recommended to add a top-level `caseParameterDict.jinja` file in the well-known dictionary format: ```shell FoamFile { format ascii; class dictionary; object caseParameterDict; } // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // case {{case}}; floatParam {{floatParam}}; intParam {{intParam}}; stringParam {{stringParam}}; // ************************************************************************* // ``` By using [slash syntax](https://github.com/OpenFOAM/OpenFOAM-dev/commit/6c8732), the values from this dictionary can be picked up in any subdictionary of the case, e.g. ```shell FoamFile { format ascii; class dictionary; object caseParameterDict; } // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // floatParam ${${FOAM_CASE}/caseParameterDict!floatParam}; intParam ${${FOAM_CASE}/caseParameterDict!intParam}; stringParam ${${FOAM_CASE}/caseParameterDict!stringParam}; // ************************************************************************* // ``` When executing the workflow, files with the ending `.jinja` will be rendered using the parameter values from the `caseParameterTable.ecsv` for the selected case. The `.jinja` ending is eliminated in that process and the file becomes a regular dictionary. Note that the workflow renders any number of cases on the fly. If you wish to render only a single case to run it standalone, use a function provided by the multiphasepy package ```shell mpycopy --case ``` which copies the case and fills in parameter values. For more information see `mpycopy --help`. ## Execution ### Configuration #### "What" to process The cases to be included in a batch-process are listed in the file `workflow.yml`, together with a the target directory to which they are copied for execution. Note that the `include` dictionary must reflect the directory structure of the case collection: ```yaml target_dir: run include: cases: regular_case: true template_case: case1: true case2: true case3: false # this case is ignored in the workflow # case4: true # this case is ignored as well ... ``` If the case collection is very large, the `include` dictionary within the `workflow.yml` can be created automatically using a function provided by the multiphasepy package ```shell mpyworkflow collect ``` which walks recursively through the directory and also includes all variants of template cases. For more information see `mpyworkflow collect --help`. There is also a `single_timestep: true` option which can be added to the `workflow.yml` to quickly test a batch before its actual submission. Note that this currently only works in combination with OpenFOAM Foundation software. #### "How" to process [Profiles](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles) are used to configure how batch runs are executed. There is a default profile (`profiles/default/config.yaml`) which is always loaded and contains settings that apply irrespective of the execution environment (PC/HPC), e.g. to specify the software environment used for executing jobs. ##### Containers It is possible to use the Apptainer container runtime by setting ```yaml use-apptainer: true apptainer-args: "--home $HOME -B $(readlink -f $PWD)" # Note that the setting "--home $HOME -B $(readlink -f $PWD)" will mount your # home directory and the current working directory to the container, even if it # is a link to a network storage. If mounting of other filesystems is necessary, # e.g. on a computer cluster, add them like "-B $(readlink -f $PWD),/scratch". config: container: "oras:////:.sif" ``` Note that Apptainer can also process Docker images, e.g. by ```yaml config: container: "docker:////:" ``` The container image used for execution can also be specified at the case level by adding a file `case.yml` containing ```yaml workflow: container: "oras:////:.sif" ``` To forcibly turn off the use of container for a specific case you can also set the container configuration to null or false: ```yaml workflow: container: null # or false ``` Further, the entry can be a template parameter in a `case.yml.jinja` file ```yaml workflow: container: {{image}} ``` whereby the use of a separate image for every variant of a template setup is possible by adding the name of the image to the `caseParameterTable.ecsv` file. If the use of Apptainer is deactivated than the execution environment is determined by the environment provided at the start of a workflow. Note that Apptainer by default takes over all system environment variables, so activating a software environment, e.g., for OpenFOAM Foundation software before starting a workflow can lead to conflicts. It is possible to suppress this behavior by adding ```yaml apptainer-args: "--cleanenv" ``` as an option to Apptainer, but this may be undesirable if system environment variables are actually desired in the container run (e.g. Slurm variables for multi-node execution). ##### Environment Variables It is possible to set environment variables for the entire workflow by setting ```yaml config: container: ... env_var: NAME: "value" ``` in the `profiles/default/config.yaml` configuration file. The environment variables will be available for every rule of the workflow. ##### License management options Some CFD software require the user to manage licenses for their simulations. Most commonly this management takes the form of a license server that can issue a license for the simulation to run. If at the time of request, no license are availables, the simulation will crash. To avoid any inconviences, the workflow integrates a rule that manages license availability: * **local execution**: The workflow will sleep until a license is available. * **remote execution**: The workflow will requeue the allrun group until a license is available. The license checking is limited to the Allrun rule. To work properly, the license checker requires some configuration. It is also necessary to have built the workflow with the **require_license** flag. For an existing workflow, the functionality can be added by ```shell mpyworkflow update . ``` and answering the questions accordingly. ###### License workflow configuration **license_command:** in the `workflow.yml` provide the license command that will be executed to check if a license is available: ```yaml license_command: 'license_checker' ``` The command should return "True" if enough licenses are availables, "False" otherwise. Additional predefined options can be passed to the license command for better context: ```yaml license_command: 'license_checker "${log}" $"{software}" ' ``` available options are: * **log**: path to the `check_license.log` file. * **software**: name of the case simulation software. * **license_server**: url of the license server. * **cores**: number of cores the simulation will run on. > TIPS: If the license command is a bash script, it is highly recommended to use strict mode. **license_server** in the `workflow.yml` provide the license server for each simulation software needed: ```yaml license_server: "Simcenter STAR-CCM+": 1999@starccm.server "Ansys Fluent": 1055@fluent.server ``` > The key of the license sever should be the name of the simulation software that is configured in the `case.yml` file. ###### License case configuration Each case can be configured using the `case.yml` file to use the license checker: ```yaml ... simulation: software: "Simcenter STAR-CCM+" require_license: yes ... ``` **software**: name of the software that is used to identify the license server. > Default value for the software key is determined by the case type. available case type are: 'base', 'OpenFOAM', 'Simcenter STAR-CCM+', 'Ansys Fluent'. **require_license**: if the case require license checking (default: false). ##### Additional Snakemake options In the file `profiles/default/config.yaml` any [command line option](https://snakemake.readthedocs.io/en/stable/executing/cli.html) of Snakemake can be added in order to make the actual command for starting a workflow more compact. ##### Operating on High Performance Computer (HPC) clusters For executing the batch run on an HPC system that uses the Slurm workload manager, there is a separate profile (`profiles/slurm/config.yaml`) to specify the partition to be used, among other things. ### General command sequence (for local execution) #### Quick start To copy the cases listed in `workflow.yml` to the target directory and batch- process the case-level `Allrun` scripts of all cases execute ```shell snakemake -c all ``` which will utilize all cores of your machine. Note that Snakemake always requires you to explicitly specify the number of cores for any command. This is to enforce a conscious choice of the resources used. #### Step by step The workflow definition is organized into so-called *rules*, whose execution is triggered by ```shell snakemake -c ``` The rules in this workflow are named according to the various `All*`-scripts that are provided with each case. To batch-process the case-level `Allrun` scripts of all cases listed in `workflow.yml`, execute ```shell snakemake Allrun -c ``` The `Allrun` rule is the default rule, hence the command ```shell snakemake -c ``` is synonymous. It will copy (render) the cases into the `target_dir` directory and run them, using the supplied number of cores. If the individidual cases require less cores, several simulations will run in parallel. On the other hand, setups asking for more cores than provided in total are scaled down. Note that retrieving and adjusting the number of cores from the simulation setup currently only works for OpenFOAM Foundation software. The workflow automatically reads and possibly adjusts the `numberOfSubdomains` entry in `${FOAM_CASE}/system/decomposeParDict`. Other case-level scripts are batch-processed in a similar manner, e.g. ```shell snakemake Allvalidate -c snakemake Allupdate -c ``` or ```shell snakemake Allclean -c ``` If a case doesn't feature an `Allvalidate` or `Allupdate` script, its execution is simply omitted for this case and no error is reported. It is also possible to only initialize cases by ```shell snakemake init -c ``` which will copy (render) the cases into the `target_dir` directory. If execution of a rule was successful, triggering it again will not do anything. You can force re-running with the `--forceall, -F` option, e.g. ```shell snakemake Allrun -F -c ``` Using ```shell snakemake --report -c 1 ``` an HTML report can be generated which gathers PNG files from case-level `postProcessing/report` subdirectories of all cases included in the workflow. This will generate a file named `report.html`. Alternatively, you can specify the name of the file by ```shell snakemake --report .html -c 1 ``` This command also works if triggered for a failed workflow or in a separate shell for a running workflow, i.e. to generate a premature report. However, the workflow must be past the initialization stage, i.e. ```shell snakemake init -c ``` must be complete. Note that, as a case database grows, the amount of PNG files that are marked for inclusion in the report may be too high for a self- contained HTML file. You will notice when the report takes too long to load in your browser. In this case, recreate the report by ```shell snakemake --report report_name.zip -c 1 ``` which generates a zip directory containing the `.html` file containing links to the actual PNG files in a separate directory `data`. ### Remote execution on a Slurm HPC cluster For configuring the remote execution, specify the partition and the maximum wall-time per job in the `profiles/slurm/config.yaml`. Compared to a local execution, the command sequence for remote execution only differs in two aspects: * The `profiles/slurm/config.yaml` configuration file must be selected through the `--profile` option. Note that `profiles/default/config.yaml` is always loaded, so the settings in `profiles/slurm/config.yaml` apply additionally. An alternative profile could be created for use with other batch systems like PBS. * The `-j, --jobs` option is now needed which specifies the maximum number of jobs submitted in parallel. The set of commands to run the workflow can simply be issued on the submit node, i.e. ```shell snakemake --profile profiles/slurm -j ``` Cancelling `Ctrl+c` the execution sends an `scancel` to the individual jobs submitted by Snakemake. ### Background Execution using tmux To allow logging out use the terminal multiplexer `tmux`. Create a separate session for every workflow. ```shell tmux new-session -s snakemake --profile profiles/slurm init -j snakemake --profile profiles/slurm -j ``` Detach from the session by pressing `Ctrl+b`, then `d`. You can then log out. In order to stop the workflow, reattach to the session ```shell tmux attach -t ``` and cancel the script execution by pressing `Ctrl+c`. The current tmux session can be killed with `Ctrl+d` or `exit`. Note that jobs submitted to an HPC system by Snakemake will not be cancelled by killing a tmux session. Always cancel the Snakemake process first. Existing sessions can be listed with their id by ```shell tmux ls ``` To reconnect to an existing session use: ```shell tmux attach-session -t ``` To kill an existing session use: ```shell tmux kill-session -t ``` For more in depth use please refer to the [documentation](https://github.com/tmux/tmux/wiki). ### Debugging The setting ```yaml keep-going: true # Keeps workflow alive if a single job fails. ``` in `profiles/default/config.yaml` causes the workflow to continue running even if individual jobs fail. Snakemake will report about possible errors, e.g. ```shell Exiting because a job execution failed. Look below for error messages Error in rule Allrun_case: message: None jobid: 7 input: run//.init output: run//.Allrun log: run//log.Allrun (check log file(s) for error details) Errors occurred. Run `snakemake --summary -c 1 | grep missing` to get a list of failed cases. Then check the corresponding log files. Complete log(s): .snakemake/log/.snakemake.log WorkflowError: At least one job did not complete successfully. ``` The output in the `shell:` block is irrelevant, as it just points to the general script for executing the `All