# Simulation batch-processing with Snakemake

[Snakemake](https://snakemake.github.io/) is a Python-based tool originating
from bioinformatics, which allows to create reproducible and scalable data
analyses workflows. Here, Snakemake is used to facilitate the batch-processing
of a collection of Computational Fluid Dynamics (CFD) simulation setups by
leveraging three of its main strengths:

* **Portability**: The integration of [Apptainer](https://apptainer.org/)/
  [Singularity](https://sylabs.io/singularity/)/
  [Docker](https://www.docker.com/) allows to provide the simulation software
  and all other software dependencies in the form of a container image.
* **Scalability**: For single- and multicore execution on workstations and
  computer clusters only the total number of cores to be used or jobs to be
  submitted in parallel has to be specified.
* **Reporting**: HTML reports containing plots from all simulations can be
  generated for sharing results with collaborators.

Furthermore, the workflow described here introduces an efficient way to conduct

* **Parameter studies** by means of case templating, eliminating the need for
  maintaining multiple parameterized sub-versions of a simulation case.

## Prerequisites

### Installation

Note that a standalone installation of Snakemake will not work. The system
described here depends on an installation of the
`multiphasepy` package. You can obtain it from [PYPI](https://pypi.org).
Installing the package automatically installs the correct minimum version of
Snakemake together with a plugin for execution on High Performance Computer
(HPC) clusters. Please refer to the
[installation guideline](https://multiphase-python-repository-by-hzdr.readthedocs.io/en/latest/installation.html)
for further details.

### Case collection structure requirements

It is recommended that all simulation setups are located in a `cases`
subdirectory.

```shell
|--- cases                       # subdirectory containing simulation setups
|   |--- someSetup
|   |--- anotherSetup
|   |--- subdirectory
|       |--- yetAnotherSetup
```

If the case collection is already configured to run as a workflow, you should
see the following files:

```shell
|--- profiles                    # Define how to run the workflow (PC/HPC),
|   |--- default
|   |   |--- config.yaml         # Container selection, Snakemake settings, etc.
|   |--- slurm
|       |--- config.yaml         # Partition, walltime, etc.
|--- workflow                    # Internal scripts for running the workflow
|--- workflow.yml                # List of cases to run
```

If this is not the case, the workflow system first needs to be enabled using the
command line utility [`mpyworkflow`](cli-tools/mpyworkflow).

### Simulation setup structure requirements

The purpose of the workflow is to enable convenient batch-processing of a larger
collection of simulation setups. For the workflow to function, the individual
setups must feature an executable script that contains all commands for

* **running a case**, `Allrun`, i.e. pre-processing, solution and post-
processing

While not mandatory, it is advisable to also provide scripts for

* **cleaning a case**, `Allclean`, i.e. for resetting the case to its original
state
* **updating reference solutions**, `Allupdate`, which must copy the new
results to the `validation/reference` directory at the level of the case
* **validating results**, `Allvalidate`, i.e. comparing results against
reference solutions

which allows to develop the case collection into a validation database.

Another requirement is that all PNG files created during post-processing are
stored in a case-level directory called `postProcessing/report`, from which
Snakemake will gather images/plots for the report. Plotting scripts must be
written accordingly. This directory is generated automatically for each case
when running the workflow.

#### Regular setups

Regular setups are setups that do not require further parameterization and can
run stand-alone also outside of the workflow using the corresponding `Allrun`
script.

#### Template setups

To allow efficient parameter studies, cases may also be provided in the form of
templates, featuring a top-level `caseParameterTable.ecsv` file in the
[Astropy ECSV](https://docs.astropy.org/en/latest/io/ascii/ecsv.html) format
that lists all case variations with the corresponding parameters and their
units, e.g.

```text
# %ECSV 0.9
# ---
# datatype:
# - {name: case, datatype: string}
# - {name: floatParam, unit: kg*m / s^2, datatype: float64}
# - {name: intParam, unit: Pa, datatype: int16}
# - {name: stringParam, datatype: string}
case   floatParam  intParam stringParam
case1  1.0         1        one
case2  2.0         2        two
```

Any ASCII file provided with a setup can then be converted into a
[template](https://jinja.palletsprojects.com/en/stable/) by adding the ending
`.jinja` to it and filling it with placeholders for the parameters rather than
actual values. For using this system in combination with the OpenFOAM Foundation
Software, it is recommended to add a top-level `caseParameterDict.jinja` file in
the well-known dictionary format:

```shell
FoamFile
{
    format      ascii;
    class       dictionary;
    object      caseParameterDict;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

case        {{case}};

floatParam  {{floatParam}};
intParam    {{intParam}};
stringParam {{stringParam}};

// ************************************************************************* //
```

By using [slash syntax](https://github.com/OpenFOAM/OpenFOAM-dev/commit/6c8732),
the values from this dictionary can be picked up in any subdictionary of the
case, e.g.

```shell
FoamFile
{
    format      ascii;
    class       dictionary;
    object      caseParameterDict;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

floatParam  ${${FOAM_CASE}/caseParameterDict!floatParam};
intParam    ${${FOAM_CASE}/caseParameterDict!intParam};
stringParam ${${FOAM_CASE}/caseParameterDict!stringParam};

// ************************************************************************* //
```

When executing the workflow, files with the ending `.jinja` will be rendered
using the parameter values from the `caseParameterTable.ecsv` for the selected
case. The `.jinja` ending is eliminated in that process and the file becomes a
regular dictionary.

Note that the workflow renders any number of cases on the fly. If you wish to
render only a single case to run it standalone, use a function provided by the
multiphasepy package

```shell
mpycopy --case <case-name> <template-directory> <destination-directory>
```

which copies the case and fills in parameter values. For more information see
`mpycopy --help`.

## Execution

### Configuration

#### "What" to process

The cases to be included in a batch-process are listed in the file
`workflow.yml`, together with a the target directory to which they are copied
for execution. Note that the `include` dictionary must reflect the directory
structure of the case collection:

```yaml
target_dir: run
include:
  cases:
    regular_case: true
    template_case:
      case1: true
      case2: true
      case3: false # this case is ignored in the workflow
      # case4: true # this case is ignored as well
      ...
```

If the case collection is very large, the `include` dictionary within the
`workflow.yml` can be created automatically using a function provided by the
multiphasepy package

```shell
mpyworkflow collect <directory_containing_cases>
```

which walks recursively through the directory and also includes all variants
of template cases. For more information see `mpyworkflow collect --help`.

There is also a `single_timestep: true` option which can be added to the
`workflow.yml` to quickly test a batch before its actual submission. Note
that this currently only works in combination with OpenFOAM Foundation software.

#### "How" to process

[Profiles](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles)
are used to configure how batch runs are executed. There is a default
profile (`profiles/default/config.yaml`) which is always loaded and contains
settings that apply irrespective of the execution environment (PC/HPC), e.g. to
specify the software environment used for executing jobs.

##### Containers

It is possible to use the Apptainer container runtime by setting

```yaml
use-apptainer: true
apptainer-args: "--home $HOME -B $(readlink -f $PWD)"
# Note that the setting "--home $HOME -B $(readlink -f $PWD)" will mount your
# home directory and the current working directory to the container, even if it
# is a link to a network storage. If mounting of other filesystems is necessary,
# e.g. on a computer cluster, add them like "-B $(readlink -f $PWD),/scratch".

config:
    container: "oras://<registry>/<namespace>/<tag>:<image>.sif"
```

Note that Apptainer can also process Docker images, e.g. by

```yaml
config:
    container: "docker://<registry>/<namespace>/<tag>:<image>"
```

The container image used for execution can also be specified at the case level
by adding a file `case.yml` containing

```yaml
workflow:
    container: "oras://<different_registry>/<different_namespace>/<different_tag>:<different_image>.sif"
```

To forcibly turn off the use of container for a specific case you can also set
the container configuration to null or false:

```yaml
workflow:
    container: null # or false
```

Further, the entry can be a template parameter in a `case.yml.jinja` file

```yaml
workflow:
    container: {{image}}
```

whereby the use of a separate image for every variant of a template setup is
possible by adding the name of the image to the `caseParameterTable.ecsv` file.

If the use of Apptainer is deactivated than the execution environment is
determined by the environment provided at the start of a workflow.

Note that Apptainer by default takes over all system environment variables, so
activating a software environment, e.g., for OpenFOAM Foundation software before
starting a workflow can lead to conflicts. It is possible to suppress this
behavior by adding

```yaml
apptainer-args: "--cleanenv"
```

as an option to Apptainer, but this may be undesirable if system environment
variables are actually desired in the container run (e.g. Slurm variables for
multi-node execution).

##### Environment Variables

It is possible to set environment variables for the entire workflow by setting

```yaml
config:
  container: ...
  env_var:
    NAME: "value"
```

in the `profiles/default/config.yaml` configuration file.

The environment variables will be available for every rule of the workflow.

##### License management options

Some CFD software require the user to manage licenses for their simulations.
Most commonly this management takes the form of a license server that can issue
a license for the simulation to run. If at the time of request, no license are
availables, the simulation will crash.

To avoid any inconviences, the workflow integrates a rule that manages license
availability:

* **local execution**: The workflow will sleep until a license is available.
* **remote execution**: The workflow will requeue the allrun group until a
license is available.

The license checking is limited to the Allrun rule. To work properly, the
license checker requires some configuration. It is also necessary to have built
the workflow with the **require_license** flag. For an existing workflow, the
functionality can be added by

```shell
mpyworkflow update .
```

and answering the questions accordingly.

###### License workflow configuration

**license_command:** in the `workflow.yml` provide the license command that
will be executed to check if a license is available:

```yaml
license_command: 'license_checker'
```

The command should return "True" if enough licenses are availables, "False" otherwise.

Additional predefined options can be passed to the license command for better
context:

```yaml
license_command: 'license_checker "${log}" $"{software}" '
```

available options are:

* **log**: path to the `check_license.log` file.
* **software**: name of the case simulation software.
* **license_server**: url of the license server.
* **cores**: number of cores the simulation will run on.

> TIPS: If the license command is a bash script, it is highly recommended
to use strict mode.

**license_server** in the `workflow.yml` provide the license server for each
simulation software needed:

```yaml
license_server:
    "Simcenter STAR-CCM+": 1999@starccm.server
    "Ansys Fluent": 1055@fluent.server
```

> The key of the license sever should be the name of the simulation software
that is configured in the `case.yml` file.

###### License case configuration

Each case can be configured using the `case.yml` file to use the license
checker:

```yaml
...
simulation:
    software: "Simcenter STAR-CCM+"
    require_license: yes
...
```

**software**: name of the software that is used to identify the license server.
> Default value for the software key is determined by the case type. available
case type are: 'base', 'OpenFOAM', 'Simcenter STAR-CCM+', 'Ansys Fluent'.

**require_license**: if the case require license checking (default: false).

##### Additional Snakemake options

In the file `profiles/default/config.yaml` any [command line option](https://snakemake.readthedocs.io/en/stable/executing/cli.html)
of Snakemake can be added in order to make the actual command for starting a
workflow more compact.

##### Operating on High Performance Computer (HPC) clusters

For executing the batch run on an HPC system that uses the Slurm workload
manager, there is a separate profile (`profiles/slurm/config.yaml`) to specify
the partition to be used, among other things.

### General command sequence (for local execution)

#### Quick start

To copy the cases listed in `workflow.yml` to the target directory and batch-
process the case-level `Allrun` scripts of all cases execute

```shell
snakemake -c all
```

which will utilize all cores of your machine. Note that Snakemake always
requires you to explicitly specify the number of cores for any command. This
is to enforce a conscious choice of the resources used.

#### Step by step

The workflow definition is organized into so-called *rules*, whose execution is
triggered by

```shell
snakemake <rule> -c <number_of_cores>
```

The rules in this workflow are named according to the various `All*`-scripts
that are provided with each case. To batch-process the case-level `Allrun`
scripts of all cases listed in `workflow.yml`, execute

```shell
snakemake Allrun -c <number_of_cores>
```

The `Allrun` rule is the default rule, hence the command

```shell
snakemake -c <number_of_cores>
```

is synonymous. It will copy (render) the cases into the `target_dir` directory
and run them, using the supplied number of cores. If the individidual cases
require less cores, several simulations will run in parallel. On the other hand,
setups asking for more cores than provided in total are scaled down. Note that
retrieving and adjusting the number of cores from the simulation setup currently
only works for OpenFOAM Foundation software. The workflow automatically reads
and possibly adjusts the `numberOfSubdomains` entry in
`${FOAM_CASE}/system/decomposeParDict`.

Other case-level scripts are batch-processed in a similar manner, e.g.

```shell
snakemake Allvalidate -c <number_of_cores>
snakemake Allupdate -c <number_of_cores>
```

or

```shell
snakemake Allclean -c <number_of_cores>
```

If a case doesn't feature an `Allvalidate` or `Allupdate` script, its execution
is simply omitted for this case and no error is reported.

It is also possible to only initialize cases by

```shell
snakemake init -c <number_of_cores>
```

which will copy (render) the cases into the `target_dir` directory.

If execution of a rule was successful, triggering it again will not do anything.
You can force re-running with the `--forceall, -F` option, e.g.

```shell
snakemake Allrun -F -c <number_of_cores>
```

Using

```shell
snakemake --report -c 1
```

an HTML report can be generated which gathers PNG files from case-level
`postProcessing/report` subdirectories of all cases included in the workflow.
This will generate a file named `report.html`. Alternatively, you can specify
the name of the file by

```shell
snakemake --report <alternative_name>.html -c 1
```

This command also works if triggered for a failed workflow or in a separate
shell for a running workflow, i.e. to generate a premature report. However, the
workflow must be past the initialization stage, i.e.

```shell
snakemake init -c <number_of_cores>
```

must be complete. Note that, as a case database grows, the amount of PNG files
that are marked for inclusion in the report may be too high for a self-
contained HTML file. You will notice when the report takes too long to load in
your browser. In this case, recreate the report by

```shell
snakemake --report report_name.zip -c 1
```

which generates a zip directory containing the `<report_name>.html` file
containing links to the actual PNG files in a separate directory `data`.

### Remote execution on a Slurm HPC cluster

For configuring the remote execution, specify the partition and the maximum
wall-time per job in the `profiles/slurm/config.yaml`.

Compared to a local execution, the command sequence for remote execution only
differs in two aspects:

* The `profiles/slurm/config.yaml` configuration file must be selected through
  the `--profile` option. Note that `profiles/default/config.yaml` is always
  loaded, so the settings in `profiles/slurm/config.yaml` apply additionally.
  An alternative profile could be created for use with other batch systems like
  PBS.
* The `-j, --jobs` option is now needed which specifies the maximum number of
  jobs submitted in parallel.

The set of commands to run the workflow can simply be issued on the submit node,
i.e.

```shell
snakemake --profile profiles/slurm -j <number_of_jobs>
```

Cancelling `Ctrl+c` the execution sends an `scancel` to the individual jobs
submitted by Snakemake.

### Background Execution using tmux

To allow logging out use the terminal multiplexer `tmux`. Create a separate
session for every workflow.

```shell
tmux new-session -s <sessionName>
snakemake --profile profiles/slurm init -j <number_of_jobs>
snakemake --profile profiles/slurm -j <number_of_jobs>
```

Detach from the session by pressing `Ctrl+b`, then `d`. You can then log out.
In order to stop the workflow, reattach to the session

```shell
tmux attach -t <sessionName>
```

and cancel the script execution by pressing `Ctrl+c`. The current tmux session
can be killed with `Ctrl+d` or `exit`. Note that jobs submitted to an HPC system
by Snakemake will not be cancelled by killing a tmux session. Always cancel the
Snakemake process first.

Existing sessions can be listed with their id by

```shell
tmux ls
```

To reconnect to an existing session use:

```shell
tmux attach-session -t <session-id>
```

To kill an existing session use:

```shell
tmux kill-session -t <session-id>
```

For more in depth use please refer to the [documentation](https://github.com/tmux/tmux/wiki).

### Debugging

The setting

```yaml
keep-going: true      # Keeps workflow alive if a single job fails.
```

in `profiles/default/config.yaml` causes the workflow to continue running even
if individual jobs fail. Snakemake will report about possible errors, e.g.

```shell
Exiting because a job execution failed. Look below for error messages
Error in rule Allrun_case:
    message: None
    jobid: 7
    input: run/<path_to_case>/.init
    output: run/<path_to_case>/.Allrun
    log: run/<path_to_case>/log.Allrun (check log file(s) for error details)
Errors occurred. Run `snakemake --summary -c 1 <rule, e.g. Allrun> | grep missing`
to get a list of failed cases.
Then check the corresponding log files.
Complete log(s): .snakemake/log/<timestamp>.snakemake.log
WorkflowError:
At least one job did not complete successfully.
```

The output in the `shell:` block is irrelevant, as it just points to the general
script for executing the `All<script>` at the case level. To debug a case,
consult the corresponding `log.All<script>` as indicated by:

```shell
    log: run/case/log.Allrun (check log file(s) for error details)
```

The case-level `All<script>` stops executing at the first error that occurs. The
error message from the problematic command is shown in `log.All<script>` unless
the command writes to its own log file, e.g. if started with the OpenFOAM
Foundation software run function `runApplication`. In the latter case, consult
the individual log file.

There is also the command

```shell
snakemake --summary -c <number_of_cores> <rule, e.g. Allrun>
```

which can be filtered by

```shell
snakemake --summary -c <number_of_cores> | grep missing
```

to give a list of failed cases an the corresponding log files.

Problematic cases can be fixed in-place, i.e. in the `target_dir` directory.
Then simply restart the rule for which the error occurred, e.g.

```shell
  snakemake Allupdate -c <number_of_cores>
```

whereby only the failed jobs for this rule will be restarted.

If the execution time of previous rules has been negligible, it is better to
delete the problematic case from the `target_dir` directory and to repair it in
the source `cases` directory, to make the fix persistent.

Afterwards, simply restart the workflow

```shell
  snakemake -c <number_of_cores>
```

and rerun follow-up rules. Again, only cases that are missing from the
`target_dir` directory are initialized and executed.

### Sending a custom command to all cases of a batch run

In some circumstances it may be useful to send the same command to all cases
within a batch run. For this purpose, add the corresponding command to the
`workflow.yml` file, as for example

```yaml
custom_command: "( cd validation && ./createGraphs )"
```

and execute it by (possibly in a separate terminal)

```shell
snakemake custom_command -c <number_of_cores>
```

The above example will regenerate plots for all cases included in the batch run,
provided the script creation is handled by a script `validation/createGraphs`
at the case-level. This is particularly useful when reference solutions are
updated and a new report is desired. Other use cases for OpenFOAM Foundation
software include stopping simulations mercyfully by

```yaml
custom_command: "touch stop"
```

which requires that the cases are setup with the `stopAtFile` functionObject:

```cpp
#includeFunc    stopAtFile(action=writeNow)
```

If not, the same may be achieved by

```yaml
custom_command: "foamDictionary system/controlDict -entry stopAt -set writeNow"
```