8. `ch-run`

Run a command in a Charliecloud container.

8.1. Synopsis

$ ch-run [OPTION...] IMAGE -- COMMAND [ARG1 [ARG2 [...]]]

8.2. Description

Run command COMMAND in a fully unprivileged Charliecloud container using the image specified by IMAGE, which can be: (1) a path to a directory, (2) the name of an image in ch-image storage (e.g. example.com:5050/foo) or, if the proper support is enabled, (3) a SquashFS archive. ch-run does not use any setuid or setcap helpers, even when mounting SquashFS images with FUSE.

Run a command in a fully unprivileged Charliecloud container. This program does not use any setuid or setcap helpers, even when using FUSE.

IMAGE can be (1) a path to a directory, (2) the name of an image in ch-image storage (e.g. example.com:5050/foo) or, if the proper support is enabled, (3) a SquashFS archive.

COMMAND is the name of the program to run inside the container, which is given zero or more arguments ARGn. If COMMAND is “-” (i.e., a single hyphen), the command and initial arguments are computed as specified by ENTRYPOINT, CMD, and SHELL instructions in the Dockerfile that built the image (just like Docker). See below, as this procedure can be confusing.

Options:

-b, --bind=SRC[:DST]
Bind-mount SRC at guest DST. The default destination if not specified is to use the same path as the host; i.e., the default is --bind=SRC:SRC. Can be repeated.

With a read-only image (the default), DST must exist. However, if --write or --write-fake are given, DST will be created as an empty directory (possibly with the tmpfs overmount trick described in --bind creates mount points within un-writeable directories!). In this case, DST must be entirely within the image itself, i.e., DST cannot enter a previous bind mount. For example, --bind /foo:/tmp/foo will fail because /tmp is shared with the host via bind-mount (unless $TMPDIR is set to something else or --private-tmp is given).

Most images have ten directories /mnt/[0-9] already available as mount points.

Symlinks in DST are followed, and absolute links can have surprising behavior. Bind-mounting happens after namespace setup but before pivoting into the container image, so absolute links use the host root. For example, suppose the image has a symlink /foo -> /mnt. Then, --bind=/bar:/foo will bind-mount on the host’s /mnt, which is inaccessible on the host because namespaces are already set up and also inaccessible in the container because of the subsequent pivot into the image. Currently, this problem is only detected when DST needs to be created: ch-run will refuse to follow absolute symlinks in this case, to avoid directory creation surprises.

-c, --cd=DIR
Initial working directory in container.

-d, --cdi[=KIND]
Inject CDI resources. If KIND ends with .json, it is interpreted as a path to a JSON resource file, and all resources in that file are injected. Any other non-empty KIND is a CDI resource kind, e.g. --cdi=nvidia.com/gpu; resource files in the CDI search path (see --cdi-dirs below) are searched. (The device name no-op wildcard all can be included, e.g. --cdi=nvidia.com/gpu=all, for compatibility.) If KIND is omitted, inject all known CDI resources. Implies --write-fake so the container image can be written.

--cdi-dirs=PATHS
Colon-separated list of directories to search for CDI JSON resource specification files. Default: CH_RUN_CDI_DIRS if set, otherwise /etc/cdi:/var/run/cdi.

--color[=WHEN]
Color logging output by log level when WHEN:

By default, or if WHEN is auto, tty, if-tty: use color if standard error is a TTY; otherwise, don’t use color.

If WHEN is yes, always, or force; or if --color is specified without an argument: always use color.

If WHEN is no, never, or none: never use color.

This uses ANSI color codes without checking any terminal databases, which should work on all modern terminals.

--env-no-expand
Don’t expand variables when using --set-env.

--feature=FEAT
If feature FEAT is enabled, exit successfully (zero); otherwise, exit unsuccessfully (non-zero). Note this just communicates the tests done by configure, rather than re-testing anything. Valid values of FEAT are:

extglob: extended globs in --unset-env

gc: conservative garbage collection using libgc

json: features that use JSON; currently only CDI (--cdi and friends)

seccomp: root emulation with seccomp (--seccomp)

squash: internal SquashFUSE image mounts using libsquashfuse

overlayfs: unprivileged overlayfs support (--write-fake)

tmpfs-xattrs: user xattrs on tmpfs

--fuse-single
Use single-threaded FUSE, rather than the default multi-threaded. If the image is not a filesystem archive (e.g., SquashFS), this option is accepted but has no effect.

-g, --gid=GID
Run as group GID within container.

--home
Bind-mount your host home directory (i.e., $HOME) at guest /home/$USER, hiding any existing image content at that path. Implies --write-fake so the mount point can be created if needed.

-j, --join
Use the same container (namespaces) as peer ch-run invocations.

--join-pid=PID
Join the namespaces of an existing process.

--join-ct=N
Number of ch-run peers (implies --join; default: see below).

--join-tag=TAG
Label for ch-run peer group (implies --join; default: see below).

-m, --mount=DIR
Use DIR for the SquashFS mount point, which must already exist. If not specified, the default is /var/tmp/$USER.ch/mnt, which will be created if needed.

--no-passwd
Use the image’s /etc/passwd and /etc/group files. (By default, temporary files are created according to the UID and GID maps for the container and bind-mounted over the image’s files.)

-q, --quiet
Be quieter; can be repeated. Incompatible with -v. See the How can I control Charliecloud’s quietness or verbosity? for details.

-s, --storage DIR
Set the storage directory. Equivalent to the same option for ch-image(1).

--seccomp
Using seccomp, intercept some system calls that would fail due to lack of privilege, do nothing, and return fake success to the calling program. This is intended for use by ch-image(1) when building images; see that man page for a detailed discussion.

--set-env, --set-env=FILE, --set-env=VAR=VALUE
Set environment variables with newline-separated file (/ch/environment within the image if not specified) or on the command line. See below for details.

--set-env0, --set-env0=FILE, --set-env0=VAR=VALUE
Like --set-env, but file is null-byte separated.

-t, --private-tmp
Mount a new tmpfs is mounted at the container’s /tmp. (By default, the host’s /tmp, or $TMPDIR if set, is bind-mounted there.)

-u, --uid=UID
Run as user UID within the container.

--unsafe
Enable various unsafe behavior. For internal use only. Seriously, stay away from this option.

--unset-env=GLOB
Unset environment variables whose names match GLOB.

-v, --verbose
Print extra chatter; can be repeated. See the FAQ entry on verbosity for details.

-w, --write
Mount image read-write. By default, the image is mounted read-only. This option should be avoided for most use cases, because changing images live, as opposed to prescriptively with a Dockerfile, destroys their provenance. Also, SquashFS images, which is the best-practice format on parallel filesystems, are read-only and this option is unavailable. Instead, use --write-fake for disposable small data or bind-mount host directories with --bind.

-W, --write-fake[=SIZE]
Overlay a writeable tmpfs on top of the image. This makes the image appear read-write, but it actually remains read-only and unchanged. All data “written” to the image are discarded when the container exits.

The size of the writeable filesystem SIZE is any size specification acceptable to tmpfs, e.g. 4m for 4MiB or 50% for half of physical memory. If this option is specified without SIZE, the default is 12%. Note (1) this limit is a maximum rather than pre-allocated and (2) SIZE larger than memory can be requested without error (the failure happens later if the actual contents become too large).

This requires kernel support and there are some caveats. See section “Writeable overlay with --write-fake” below for details.

-?, --help
Print help and exit.

-V, --version
Print version and exit.

Note

Because ch-run is fully unprivileged, it is not possible to change UIDs and GIDs within the container (the relevant system calls fail). As a corollary, setuid, setgid, and setcap executables do not change their IDs or capabilities.

8.3. Determining the containerized command

8.3.1. Background

UNIX processes execute a sub-program by specifying a list of strings: the program to execute followed by its arguments, if any. (The notion of a singular “command line” is a shell thing, not a UNIX thing.)

For example, if you run the the shell command cc -o foo foo.c, the shell splits that string into words using complicated rules (e.g., for Bash) into the list of four strings cc, -o, foo, foo.c, which UNIX can then execute.

Note

If you want to learn more, the relevant libc functions are the exec(3) family; see e.g. musl’s straightforward implementation. The system call is execve(2).

ch-run(1) works the same way, except it sets up or joins a container before executing the sub-program. This brings us to the question: What is the list of strings that Charliecloud executes?

As described above, this list is given on the command line. ch-run(1) expects — after its own options and the container image (IMAGE above) — at least one argument specifying COMMAND and its arguments.

8.3.2. Default

If COMMAND anything other than a single hyphen, the command list is simply everything after Charliecloud’s arguments, split by the shell into words. For example, the command:

$ ch-run alpine:3.17 -- echo -e 'hello world'

will execute the containerized command [echo, -e, hello world], i.e., a list of 3 strings. (The -- tells ch-run that its own arguments are finished. It is not always required, but it’s a best practice and in this case prevents ch-run from trying and failing to interpret echo’s -e argument.)

In this case, whether and how the image specifies ENTRYPOINT and CMD has no effect. This is the historical Charliecloud behavior.

8.3.3. Using `ENTRYPOINT` and/or `CMD`

If COMMAND is a single hyphen, i.e. -, then the container’s default command is used. These are specified at build time with Dockerfile instructions. ENTRYPOINT is the default command and its required arguments (possibly mediated by SHELL as described in the next section), while CMD is the default command’s default optional arguments, used if no optional arguments are given on the command line. At least one of ENTRYPOINT and CMD must be specified in the image.

Because the default command must be specifically requested, ch-run has no --entrypoint option, unlike Docker and Podman. Otherwise, ch-run should be bug-compatible with Docker with respect to these three instructions. Note that the actual behavior, described here to the best of our ability, differs somewhat from the Docker documentation.

Both ENTRYPOINT and CMD specify a list of strings (regardless of whether the “exec” or “shell” form, which we’ll get to shortly). If the instruction is absent, the corresponding list is empty. The containerized command is then the concatenation of the ENTRYPOINT list with ARGn from the command line, if any, otherwise the CMD list.

For example, suppose we have an image hello built with this Dockerfile:

FROM alpine:3.17
ENTRYPOINT [ "echo", "hello" ]
CMD [ "world" ]

By default, both ENTRYPOINT and CMD are ignored:

$ ch-run hello -- printf "trosseau whirled"
trosseau whirled

Here, the containerized command list is entirely from the command line: [printf, hello whirled] (2 items).

We can request use of both defaults with a COMMAND of hyphen and no ARGn:

$ ch-run hello -- -
hello world

This executes [echo, hello, world], i.e. ENTRYPOINT’s list [echo, hello] followed by CMD’s [world].

Finally, we can use the default command and required args from the image but optional arguments from the command line with a COMMAND of hyphen and at least one code:ARGn:

$ ch-run hello -- - unfurled

This executes [echo, hello, unfurled], i.e. ENTRYPOINT’s two-item list followed by [unfurled] from the command line.

Warning

There is no way to override an image’s non-empty CMD with an empty optional argument list.

8.3.4. Shell form of `ENTRYPOINT` and/or `CMD`

The ENTRYPOINT and CMD instructions can be given in “shell form”, in which case the single instruction argument is combined with any prior SHELL, or [/bin/sh, -c] by default, to create a three-item list that is then processed as described above.

For example, the following two Dockerfiles are equivalent:

FROM alpine:3.17
ENTRYPOINT echo hello
CMD ["world"]

FROM alpine:3.17
ENTRYPOINT ["/bin/sh", "-c", "echo hello"]
CMD ["world"]

Use of shell-form ENTRYPOINT or CMD is discouraged because it can lead to counterintuitive behavior. For example, the following two Dockerfiles are also equivalent:

FROM alpine:3.17
ENTRYPOINT echo a 0='$0' 1='$1' 2='$2'
SHELL ["/bin/moustache", "-c"]
CMD echo b

FROM alpine:3.17
ENTRYPOINT ["/bin/sh", "-c", "echo 0='$0' 1='$1' 2='$2'"]
CMD ["/bin/moustache", "-c", "echo b"]

Executing these defaults yields:

$ ch-run hello -- -
a 0='/bin/moustache' '1=-c' 2='echo b'

Puzzled? Us too. The executed command list is [/bin/sh, -c, echo 0='$0' 1='$1' 2='$2', /bin/moustache, -c, echo b] (6 items), computed by concatenating the ENTRYPOINT and CMD lists, just as above. CMD’s list is then (per POSIX) arguments to the script given for -c. This behavior is not well-known because that script usually doesn’t take any arguments.

If ARGn is given on the command line, those will also become arguments to the code:-c script:

$ ch-run hello -- - echo c d
a 0='echo' 1='c' 2='d'

Docker does this too, despite documenting that shell-form ENTRYPOINT “will ignore any CMD or docker run command line arguments”:

$ cat Dockerfile
FROM alpine:3.17
ENTRYPOINT echo a 0=$0 1=$1 2=$2
SHELL ["/bin/moustache", "-c"]
CMD echo b
$ docker build -t hello .
[...]
Successfully tagged hello:latest
$ docker run hello
a 0=/bin/moustache 1=-c 2=echo b

8.3.5. ENTRYPOINT and CMD Comprehensive Examples

8.3.5.1. 1. No ENTRYPOINT:

1.1 No CMD:

FROM alpine:3.17
# No ENTRYPOINT or CMD

Output:

$ ch-run hello -- -
Error: No CMD, ENTRYPOINT, or command specified

1.2 CMD in shell form:

FROM alpine:3.17
CMD echo b

Output:

$ ch-run hello -- -
b

1.3 CMD in exec form:

FROM alpine:3.17
CMD ["echo", "b"]

Output:

$ ch-run hello -- -
b

8.3.5.2. 2. ENTRYPOINT in Exec Form:

2.1 No CMD:

FROM alpine:3.17
ENTRYPOINT ["echo", "a"]

Output:

$ ch-run hello -- -
a

2.2 CMD in exec form:

FROM alpine:3.17
ENTRYPOINT ["echo", "a"]
CMD ["b"]

When both the ENTRYPOINT and CMD are in exec form, the CMD is appended to the ENTRYPOINT command to create the command echo a b, which produces the output:

$ ch-run hello -- -
a b

2.3 CMD in shell form:

FROM alpine:3.17
ENTRYPOINT ["echo", "a"]
CMD echo b

When ENTRYPOINT is in exec form and CMD is in shell form, they concatenate to form a command such as, echo a /bin/sh -c “echo b”, which produces the output:

$ ch-run hello -- -
a /bin/sh -c echo b

8.3.5.3. 3. ENTRYPOINT in Shell Form:

3.1 No CMD:

FROM alpine:3.17
ENTRYPOINT echo a

Output:

$ ch-run hello -- -
a

3.2 CMD in exec form:

FROM alpine:3.17
ENTRYPOINT echo a
CMD ["echo", "b"]

Similar to the previous example, when ENTRYPOINT is in shell form and CMD is in exec form, they concatenate to form a command such as, /bin/sh -c “echo a” “echo b”. The second argument is unused by the shell command, and executes echo a then exits, which produces the output:

$ ch-run hello -- -
a

3.3 CMD in shell form:

FROM alpine:3.17
ENTRYPOINT echo a
CMD echo b

When ENTRYPOINT and CMD are in shell form, ENTRYPOINT, they concatenate to form a command such as, /bin/sh -c “echo a” /bin/sh -c “echo b”. The CMD part becomes an unused argument to the first shell command, which executes echo a then exits, which produces the output:

$ ch-run hello -- -
a

8.4. Image format

ch-run supports two different image formats.

The first is a simple directory that contains a Linux filesystem tree. This can be accomplished by:

ch-convert directly from ch-image or another builder to a directory.
Charliecloud’s tarball workflow: build or pull the image, ch-convert it to a tarball, transfer the tarball to the target system, then ch-convert the tarball to a directory.
Manually mount a SquashFS image, e.g. with squashfuse(1) and then un-mount it after run with fusermount -u.
Any other workflow that produces an appropriate directory tree.

The second is a SquashFS image archive mounted internally by ch-run, available if it’s linked with the optional libsquashfuse_ll shared library. ch-run mounts the image filesystem, services all FUSE requests, and unmounts it, all within ch-run. By default, we use a multi-threaded FUSE; for single-threaded say --fuse-single. See --mount above to set the mount point location.

Like other FUSE implementations, Charliecloud calls the fusermount3(1) utility to mount the SquashFS filesystem. However, this executable does not need to be installed setuid root, and in fact ch-run actively suppresses its setuid bit if set (using prctl(2)).

Prior versions of Charliecloud provided wrappers for the squashfuse and squashfuse_ll SquashFS mount commands and fusermount -u unmount command. We removed these because we concluded they had minimal value-add over the standard, unwrapped commands.

Warning

Currently, Charliecloud unmounts the SquashFS filesystem when user command COMMAND’s process exits. It does not monitor any of its child processes. Therefore, if the user command spawns child processes and then exits before them (e.g., some daemons), those children will have the image unmounted from underneath them. In this case, the workaround is to mount/unmount using external tools. We expect to remove this limitation in a future version.

8.5. Host files and directories available in container via bind mounts

In addition to any directories specified by the user with --bind, ch-run has standard host files and directories that are bind-mounted in as well.

The following host files and directories are bind-mounted at the same location in the container. These give access to the host’s devices and various kernel facilities. (Recall that Charliecloud provides minimal isolation and containerized processes are mostly normal unprivileged processes.) They cannot be disabled and are required; i.e., they must exist both on host and within the image.

/dev

/proc

/sys

Optional; bind-mounted only if path exists on both host and within the image, without error or warning if not.

/etc/hosts and /etc/resolv.conf. Because Charliecloud containers share the host network namespace, they need the same hostname resolution configuration.

/etc/machine-id. Provides a unique ID for the OS installation; matching the host works for most situations. Needed to support D-Bus, some software licensing situations, and likely other use cases. See also issue #1050.

/var/lib/hugetlbfs at guest /var/opt/cray/hugetlbfs, and /var/opt/cray/alps/spool. These support Cray MPI.

Additional bind mounts done by default but can be disabled; see the options above.

$HOME at /home/$USER (and image /home is hidden). Makes user data and init files available.

/tmp (or $TMPDIR if set) at guest /tmp. Provides a temporary directory that persists between container runs and is shared with non-containerized application components.

temporary files at /etc/passwd and /etc/group. Usernames and group names need to be customized for each container run.

8.6. Multiple processes in the same container with `--join`

By default, different ch-run invocations use different user and mount namespaces (i.e., different containers). While this has no impact on sharing most resources between invocations, there are a few important exceptions. These include:

ptrace(2), used by debuggers and related tools. One can attach a debugger to processes in descendant namespaces, but not sibling namespaces. The practical effect of this is that (without --join), you can’t run a command with ch-run and then attach to it with a debugger also run with ch-run.
Cross-memory attach (CMA) is used by cooperating processes to communicate by simply reading and writing one another’s memory. This is also not permitted between sibling namespaces. This affects various MPI implementations that use CMA to pass messages between ranks on the same node, because it’s faster than traditional shared memory.

--join is designed to address this by placing related ch-run commands (the “peer group”) in the same container. This is done by one of the peers creating the namespaces with unshare(2) and the others joining with setns(2).

To do so, we need to know the number of peers and a name for the group. These are specified by additional arguments that can (hopefully) be left at default values in most cases:

--join-ct sets the number of peers. The default is the value of the first of the following environment variables that is defined: OMPI_COMM_WORLD_LOCAL_SIZE, SLURM_STEP_TASKS_PER_NODE, SLURM_CPUS_ON_NODE.
--join-tag sets the tag that names the peer group. The default is environment variable SLURM_STEP_ID, if defined; otherwise, the PID of ch-run’s parent. Tags can be re-used for peer groups that start at different times, i.e., once all peer ch-run have replaced themselves with the user command, the tag can be re-used.

Caveats:

One cannot currently add peers after the fact, for example, if one decides to start a debugger after the fact. (This is only required for code with bugs and is thus an unusual use case.)
ch-run instances race. The winner of this race sets up the namespaces, and the other peers use the winner to find the namespaces to join. Therefore, if the user command of the winner exits, any remaining peers will not be able to join the namespaces, even if they are still active. There is currently no general way to specify which ch-run should be the winner.
If --join-ct is too high, the winning ch-run’s user command exits before all peers join, or ch-run itself crashes, IPC resources such as semaphores and shared memory segments will be leaked. These appear as files in /dev/shm/ and can be removed with rm(1).
Many of the arguments given to the race losers, such as the image path and --bind, will be ignored in favor of what was given to the winner.

8.7. Writeable overlay with `--write-fake`

If you need the image to stay read-only but appear writeable, you may be able to use --write-fake to overlay a writeable tmpfs atop the image. This requires kernel support. Specifically:

To use the feature at all, you need unprivileged overlayfs support. This is available in upstream 5.11 (February 2021), but distributions vary considerably. If you don’t have this, the container will fail to start with error “operation not permitted”.
For a fully functional overlay, you need a tmpfs that supports xattrs in the user namespace. This is available in upstream 6.6 (October 2023). If you don’t have this, most things will work fine, but some operations will fail with “I/O error”, for example creating a directory with the same path as a previously deleted directory. There will also be syslog noise about xattr problems.

(overlayfs can also use xattrs in the trusted namespace, but this requires CAP_SYS_ADMIN on the host and thus is not helpful for unprivileged containers.)

8.8. Using host resources with Container Device Interface (CDI)

ch-run can inject host resources into a container at runtime without altering the underlying image. We follow Container Device Interface (CDI), an emerging standard for such injection.

Common use cases are shared libraries for proprietary hardware (e.g., nVidia GPUs or Cray networking) or site-specific configuration files. The resources must be compatible with the Linux distribution within the image, with libc being the most common concern.

TL;DR

In many cases, you just want all available resources. If your sysadmins have configured your host correctly, you can just say ch-run -d for that and stop reading this section.

8.8.1. CDI overview and vocabulary

A CDI resource specification file is a JSON file that prescribes image modifications made during container setup, before invoking the user command. While the intent of the standard is to make devices (i.e., hardware gadgets) available inside containers, it is quite flexible: this spec file can list device files, filesystem or bind mounts, environment variables, and arbitrary hook programs. Christopher Desiniotis gave a good talk at Container Plumbing Days 2024 introducing CDI (slides, video).

OCI hooks, which are arbitrary programs run during container setup, serve a similar purpose. In our view, CDI’s declarative approach is better, because a resource spec file gives a clear description of what is to be done rather than relying on a program that may be opaque and may make inappropriate assumptions (especially for ch-run, which is not an OCI runtime).

CDI does overload terminology in ways that we believe is confusing. Most importantly, what we refer to here as a “resource”, meaning a collection of modifications (e.g., libraries to bind-mount, environment variables to set, hooks to run, etc.), is called a “device” by CDI. We use “resource” to avoid confusion with device files (e.g. in /dev) or physical hardware (see CDI issue #246). Also, CDI refers to both the CDI standard itself as well as the JSON/YAML files describing resources as “specifications” (see CDI issue #245); we reserve “specification” or “spec” for the files and use “standard” for CDI.

Here is an example resource spec file:

{
  "cdiVersion": "0.5.0",
  "kind": "nvidia.com/gpu",
  "devices": [ {
      "name": "foo",
      "containerEdits": {
        "deviceNodes": [ { "path": "/dev/nvidia0" },
                         { "path": "/dev/dri/card0" } ],
        "hooks": [ { "hookName": "createContainer",
                     "path": "/usr/bin/nvidia-ctk",
                     "args": [ "nvidia-ctk",
                               "hook", "create-symlinks",
                               "--link", "../card0::/dev/dri/by-path/pci-0000:07:00.0-card",
                             ] } ] } } ] }
  "containerEdits": {
    "env": [ "NVIDIA_VISIBLE_DEVICES=void" ],
    "deviceNodes": [ { "path": "/dev/nvidia-modeset" },
                     { "path": "/dev/nvidiactl" } ],
    "mounts": [
      { "hostPath": "/run/nvidia-fabricmanager/socket",
        "containerPath": "/run/nvidia-fabricmanager/socket",
        "options": [ "ro", "nosuid", "nodev", "bind", "noexec" ] },
      { "hostPath": "/usr/bin/nvidia-smi",
        "containerPath": "/usr/bin/nvidia-smi",
        "options": [ "ro", "nosuid", "nodev", "bind" ] },
      { "hostPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.535.161.08",
        "containerPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.535.161.08",
        "options": [ "ro", "nosuid", "nodev", "bind" ] } ]
    "hooks": [
      { "hookName": "createContainer",
        "path": "/usr/bin/nvidia-ctk",
        "args": [
          "nvidia-ctk",
          "hook", "update-ldcache",
          "--folder", "/usr/lib/x86_64-linux-gnu" ] } ] }
}

This specifies:

A single resource (CDI “device”) named foo, of kind nvidia.com/gpu, comprising:
1. Two device files to be made available in the container, /dev/nvidia0 and /dev/dri/card0.
2. One symlink to create inside the container, /dev/by-path/pci-0000:07:00.0-card → ../card0.
A set of container changes to be made once regardless of which resources are selected (this example has just one resource, but real spec files typically have several), comprising:
1. One environment variable to set, NVIDIA_VISIBLE_DEVICES.
2. Two device files to be made available in the container, /dev/nvidia-modeset and /dev/nvidiactl.
3. Three bind-mounts from the host into the container: a socket (/run/nvidia-fabricmanager/socket), executable (nvidia-smi), and shared library (libcuda.so.535.161.08).
4. One hook that updates the container’s linker cache, scanning only guest directory /usr/lib/x86_64-linux-gnu.

8.8.2. Charliecloud’s implementation

Our CDI implementation differs from others in some important ways, though we believe Charliecloud is still compliant. These stem from fundamental properties of Charliecloud that clash with CDI assumptions as well as design choices. This section lists the differences that should have meaningful implications for users.

Host /dev is bind-mounted into the guest (at the same path); therefore, all of the host’s device files and ancillary files (e.g., symlinks under /dev) are available in a Charliecloud container regardless of CDI. For this reason:
- We ignore the devices field and everything within it, as well as containerEdits/deviceNodes.
- Resources are only selectable by kind, not individually. See below for details.
Charliecloud is fully unprivileged; therefore, only a subset of mount options are available. For this reason:
- Elements of mounts that do not include bind, or rbind, in options are skipped.
(rbind is treated as a synonym for bind; we use MS_BIND | MS_REC for both) (we haven’t yet seen any mounts in a resource file without bind or rbind)
The container’s ld cache is automatically updated for mounts targeting files that appear to be shared object libraries; thus, hooks to run ldconfig are unnecessary and ignored.
We chose to interpret CDI resource files as fully prescriptive, rather than mostly prescriptive plus hook programs. For this reason:
- Hooks (containerEdits/hooks) are interpreted as statements implemented by Charliecloud, like the rest of the resource file, rather than running the actual hook program. See below for details on actual hooks.
We recognize the brittleness and are monitoring the situation. However, we have not yet encountered any hooks that are both useful under Charliecloud and (in our view) merit an external program.
We try hard to minimize dependencies, and any YAML resource file could be easily converted to JSON (e.g., with yq). Therefore, ch-run has no YAML parser. For this reason:
- Only JSON resource files are supported.
Large numbers of bind-mounts with possibly-long names clutter listings (e.g., /proc/mounts or findmnt(1)) and in the extreme may cause functionality or performance problems. Resource specification files tend to declare lots of bind-mounts; e.g. the spec for one of our not-that-large systems declares 47 mounts, one per host file. For this reason:
- Any given mount target may be a symlink into some collective mount point, rather than an actual mount point, to reduce the number of mounts.
- Additional files and directories may appear in the container via these collective mounts. However, they are in locations that should not affect containerized applications, and as always filesystem permissions will be enforced.
CDI resources can set environment variables, but this only one way that ch-run can (un)set variables. For this reason:
- Environment variables are set in the order that CDI options appear on the command line relative to other user-specified environment options, e.g. --set-env and --unset-env. See Environment variables below for details.

8.8.3. Hooks

8.8.3.1. Behavior summary

Presently, CDI hooks fall into three categories for Charliecloud:

Known hooks that we need, with behavior emulated internally (i.e, ch-run does what the hook needs, adapted for Charliecloud, rather than running the hook).
Known hooks that we don’t need; we ignore these quietly (i.e., logged but a level hidden by default).
Unknown hooks. We warn about these, because they need to be either moved into one of the first to categories or actually run. (That is, we’re still figuring out what’s needed for Charliecloud here.)

The next two sections document known hooks.

Note

nVidia Container Toolkit CDI hooks can be spelled either nvidia-ctk hook (two words) or nvidia-cdi-hook (one word, different acronym). We treat the two spellings the same.

8.8.3.2. Emulated hooks

nvidia-cdi-hook update-ldcache. This hook updates the container’s linker cache (i.e., /etc/ld.so.cache), notably using the host’s ldconfig. For now at least, we instead use the container’s ldconfig, the reasoning being that (1) the container’s linker updating its own cache is lower-risk compatibility wise and (2) it seems unlikely that an image would be compatible with nVidia libraries and have a linker cache but no ldconfig executable.

If the image has no ldconfig, ch-run exits with an error and the container does not run. This indicates the assumption above is false, so please report this error as a bug.

8.8.3.3. Ignored hooks

nvidia-cdi-hook create-symlinks. This creates one or more symlinks. In our experience, the links created already exist in the host’s /dev or are created by ldconfig(8).
nvidia-cdi-hook chmod. This changes file permissions, but in unprivileged Charliecloud containers, the invoking user will already have access to all appropriate files.
nvidia-cdi-hook enable-cuda-compat. This is for CUDA Forward Compatibility, which lets you use a libcuda.so and CUDA build-time libraries that are newer than the kernel module (nvidia.ko). For example: (1) host has older CUDA kernel module 10.1, (2) container built with newer 11.0, (3) host has a newer libcuda.so 11.0 from somewhere. This would let us run new containers on old hosts, which seemed like a deferrable use case.

8.9. Environment variables

Unlike most other implementations, ch-run’s baseline for the container environment is to pass through the host environment unaltered. From this starting point, the environment is altered in this order:

$HOME, $PATH, and $TMPDIR are adjusted to avoid common breakage (see below).
User-specified changes are executed in the order they appear on the command line (i.e., -d/--devices, --device, --set-env, and --unset-env, some of which can appear multiple times).
$CH_RUNNING is set.

8.9.1. Built-in environment changes

Prior to user changes, i.e. can be altered by the user:

$HOME

If --home is specified, then your home directory is bind-mounted into the guest at /home/$USER. If you also have a different home directory path on the host, an inherited $HOME will be incorrect inside the guest, which confuses lots of software, notably Spack. Thus, with --home, $HOME is set to /home/$USER (by default, it is unchanged.)

$PATH

We append /bin to $PATH if it’s not already present. This is because newer Linux distributions replace some root-level directories, such as /bin, with symlinks to their counterparts in /usr. Some of these distributions (e.g., Fedora 24) have also dropped /bin from the default $PATH. This is a problem when the guest OS does not have a merged /usr (e.g., Debian 8 “Jessie”).

8.9.2. Setting variables with `--set-env` or `--set-env0`

The purpose of these two options is to set environment variables within the container. Values given replace any already in the environment (i.e., inherited from the host shell) or set by earlier uses of the options. These flags take an optional argument with two possible forms:

If the argument contains an equals sign (=, ASCII 61), that sets an environment variable directly. For example, to set FOO to the string value bar:
```
$ ch-run --set-env=FOO=bar ...
```
Single straight quotes around the value (', ASCII 39) are stripped, though be aware that both single and double quotes are also interpreted by the shell. For example, this example is similar to the prior one; the double quotes are removed by the shell and the single quotes are removed by ch-run:
```
$ ch-run --set-env="'BAZ=qux'" ...
```
If the argument does not contain an equals sign, it is a host path to a file containing zero or more variables using the same syntax as above (except with no prior shell processing).

With --set-env, this file contains a sequence of assignments separated by newline (n or ASCII 10); with --set-env0, the assignments are separated by the null byte (i.e., 0 or ASCII 0). Empty assignments are ignored, and no comments are interpreted. (This syntax is designed to accept the output of printenv and be easily produced by other simple mechanisms.) The file need not be seekable.

For example:
```
$ cat /tmp/env.txt
FOO=bar
BAZ='qux'
$ ch-run --set-env=/tmp/env.txt ...
```
For directory images only (because the file is read before containerizing), guest paths can be given by prepending the image path.
If there is no argument, the file /ch/environment within the image is used. This file is commonly populated by ENV instructions in the Dockerfile. For example, equivalently to form 2:
```
$ cat Dockerfile
[...]
ENV FOO=bar
ENV BAZ=qux
[...]
$ ch-image build -t foo .
$ ch-convert foo /var/tmp/foo.sqfs
$ ch-run --set-env /var/tmp/foo.sqfs -- ...
```
(Note the image path is interpreted correctly, not as the --set-env argument.)

At present, there is no way to use files other than /ch/environment within SquashFS images.

Environment variables are expanded for values that look like search paths, unless --env-no-expand is given prior to --set-env. In this case, the value is a sequence of zero or more possibly-empty items separated by colon (:, ASCII 58). If an item begins with dollar sign ($, ASCII 36), then the rest of the item is the name of an environment variable. If this variable is set to a non-empty value, that value is substituted for the item; otherwise (i.e., the variable is unset or the empty string), the item is deleted, including a delimiter colon. The purpose of omitting empty expansions is to avoid surprising behavior such as an empty element in $PATH meaning the current directory.

For example, to set HOSTPATH to the search path in the current shell (this is expanded by ch-run, though letting the shell do it happens to be equivalent):

$ ch-run --set-env='HOSTPATH=$PATH' ...

To prepend /opt/bin to this current search path:

$ ch-run --set-env='PATH=/opt/bin:$PATH' ...

To prepend /opt/bin to the search path set by the Dockerfile, as retrieved from guest file /ch/environment (here we really cannot let the shell expand $PATH):

$ ch-run --set-env --set-env='PATH=/opt/bin:$PATH' ...

Examples of valid assignment, assuming that environment variable BAR is set to bar and UNSET is unset or set to the empty string:

Assignment	Name	Value
`FOO=bar`	`FOO`	`bar`
`FOO=bar=baz`	`FOO`	`bar=baz`
`FLAGS=-march=foo -mtune=bar`	`FLAGS`	`-march=foo -mtune=bar`
`FLAGS='-march=foo -mtune=bar'`	`FLAGS`	`-march=foo -mtune=bar`
`FOO=$BAR`	`FOO`	`bar`
`FOO=$BAR:baz`	`FOO`	`bar:baz`
`FOO=`	`FOO`	empty string
`FOO=$UNSET`	`FOO`	empty string
`FOO=baz:$UNSET:qux`	`FOO`	`baz:qux` (not `baz::qux`)
`FOO=:bar:baz::`	`FOO`	`:bar:baz::`
`FOO=''`	`FOO`	empty string
`FOO=''''`	`FOO`	`''` (two single quotes)

Example invalid assignments:

Assignment	Problem
`FOO bar`	no equals separator
`=bar`	name cannot be empty

Example valid assignments that are probably not what you want:

Assignment	Name	Value	Problem
`FOO="bar"`	`FOO`	`"bar"`	double quotes aren’t stripped
`FOO=bar # baz`	`FOO`	`bar # baz`	comments not supported
`FOO=bartbaz`	`FOO`	`bartbaz`	backslashes are not special
`FOO=bar`	`FOO`	`bar`	leading space in key
`FOO= bar`	`FOO`	`bar`	leading space in value
`$FOO=bar`	`$FOO`	`bar`	variables not expanded in key
`FOO=$BAR baz:qux`	`FOO`	`qux`	variable `BAR baz` not set

8.9.3. Removing variables with `--unset-env`

The purpose of --unset-env=GLOB is to remove unwanted environment variables. The argument GLOB is a glob pattern (dialect fnmatch(3) with the FNM_EXTMATCH flag where supported); all variables with matching names are removed from the environment.

Warning

Because the shell also interprets glob patterns, if any wildcard characters are in GLOB, it is important to put it in single quotes to avoid surprises.

GLOB must be a non-empty string.

Example 1: Remove the single environment variable FOO:

$ export FOO=bar
$ env | fgrep FOO
FOO=bar
$ ch-run --unset-env=FOO $CH_TEST_IMGDIR/chtest -- env | fgrep FOO
$

Example 2: Hide from a container the fact that it’s running in a Slurm allocation, by removing all variables beginning with SLURM. You might want to do this to test an MPI program with one rank and no launcher:

$ salloc -N1
$ env | egrep '^SLURM' | wc
   44      44    1092
$ ch-run $CH_TEST_IMGDIR/mpihello-openmpi -- /hello/hello
[... long error message ...]
$ ch-run --unset-env='SLURM*' $CH_TEST_IMGDIR/mpihello-openmpi -- /hello/hello
0: MPI version:
Open MPI v3.1.3, package: Open MPI root@c897a83f6f92 Distribution, ident: 3.1.3, repo rev: v3.1.3, Oct 29, 2018
0: init ok cn001.localdomain, 1 ranks, userns 4026532530
0: send/receive ok
0: finalize ok

Example 3: Clear the environment completely (remove all variables):

$ ch-run --unset-env='*' $CH_TEST_IMGDIR/chtest -- env
$

Example 4: Remove all environment variables except for those prefixed with either WANTED_ or ALSO_WANTED_:

$ export WANTED_1=yes
$ export ALSO_WANTED_2=yes
$ export NOT_WANTED_1=no
$ ch-run --unset-env='!(WANTED_*|ALSO_WANTED_*)' $CH_TEST_IMGDIR/chtest -- env
WANTED_1=yes
ALSO_WANTED_2=yes
$

Note that some programs, such as shells, set some environment variables even if started with no init files:

$ ch-run --unset-env='*' $CH_TEST_IMGDIR/debian_9ch -- bash --noprofile --norc -c env
SHLVL=1
PWD=/
_=/usr/bin/env
$

8.10. Examples

Run the command echo hello inside a Charliecloud container using the unpacked image at /data/foo:

$ ch-run /data/foo -- echo hello
hello

Run an MPI job that can use CMA to communicate:

$ srun ch-run --join /data/foo -- bar

8.11. Syslog

By default, ch-run logs its command line to syslog. (This can be disabled by configuring with --disable-syslog.) This includes: (1) the invoking real UID, (2) the number of command line arguments, and (3) the arguments, separated by spaces. For example:

Dec 10 18:19:08 mybox ch-run: uid=1000 args=7: ch-run -v /var/tmp/00_tiny -- echo hello "wor l}\$d"

Logging is one of the first things done during program initialization, even before command line parsing. That is, almost all command lines are logged, even if erroneous, and there is no logging of program success or failure.

Arguments are serialized with the following procedure. The purpose is to provide a human-readable reconstruction of the command line while also allowing each argument to be recovered byte-for-byte.

If an argument contains only printable ASCII bytes that are not whitespace, shell metacharacters, double quote (", ASCII 34 decimal), or backslash (, ASCII 92), then log it unchanged.

Otherwise, (a) enclose the argument in double quotes and (b) backslash-escape double quotes, backslashes, and characters interpreted by Bash (including POSIX shells) within double quotes.

The verbatim command line typed in the shell cannot be recovered, because not enough information is provided to UNIX programs. For example, echo 'foo' is given to programs as a sequence of two arguments, echo and foo; the two spaces and single quotes are removed by the shell. The zero byte, ASCII NUL, cannot appear in arguments because it would terminate the string.

8.12. Exit status

If the user command is started successfully and exits normally, ch-run’s exit status is that of the user command. Otherwise, the exit status is one of:

31	Miscellaneous `ch-run` failure other than the below
49	Unable to start user command (i.e., `execvp(2)` failed)
84	SquashFUSE loop exited on signal before user command was complete
87	Feature queried by `--feature` is not available
128 + N	User command killed by signal N

8. ch-run

8.1. Synopsis

8.2. Description

8.3. Determining the containerized command

8.3.1. Background

8.3.2. Default

8.3.3. Using ENTRYPOINT and/or CMD

8.3.4. Shell form of ENTRYPOINT and/or CMD

8.3.5. ENTRYPOINT and CMD Comprehensive Examples

8.3.5.1. 1. No ENTRYPOINT:

8.3.5.2. 2. ENTRYPOINT in Exec Form:

8.3.5.3. 3. ENTRYPOINT in Shell Form:

8.4. Image format

8.5. Host files and directories available in container via bind mounts

8.6. Multiple processes in the same container with --join

8.7. Writeable overlay with --write-fake

8.8. Using host resources with Container Device Interface (CDI)

8.8.1. CDI overview and vocabulary

8.8.2. Charliecloud’s implementation

8.8.3. Hooks

8.8.3.1. Behavior summary

8.8.3.2. Emulated hooks

8.8.3.3. Ignored hooks

8.9. Environment variables

8.9.1. Built-in environment changes

8.9.2. Setting variables with --set-env or --set-env0

8.9.3. Removing variables with --unset-env

8.10. Examples

8.11. Syslog

8.12. Exit status

8. `ch-run`

8.3.3. Using `ENTRYPOINT` and/or `CMD`

8.3.4. Shell form of `ENTRYPOINT` and/or `CMD`

8.6. Multiple processes in the same container with `--join`

8.7. Writeable overlay with `--write-fake`

8.9.2. Setting variables with `--set-env` or `--set-env0`

8.9.3. Removing variables with `--unset-env`