Using a config file to configure the k8s-device-plugin and gpu-feature-discovery

Kevin Klues <kklues@nvidia.com>

Last Updated:

03-June-2022

Table of Contents

Overview 2

Design Details 2

Deploying and Testing 4

Deployment via helm 4

Specifying Multiple Configuration Files 5

Updating Per-Node Configuration With a Node Label 6

Enabling gpu-feature-discovery 6

Overview

At present, the only way to configure the k8s-device-plugin or gpu-feature-discovery is via a set of command line flags or environment variables.

However, as we begin to add more sophisticated features to these components, a configuration file is more appropriate to express the complex settings that are possible.

This document outlines the details of a new configuration file format and shows examples of how to deploy these components using it.

Design Details

Although it is possible to run the k8s-device-plugin as a standalone component, it is often run in conjunction with gpu-feature-discovery to apply resource-specific labels to nodes. In order to apply these labels effectively, gpu-feature-discovery must be aware of the configuration set up for the k8s-device-plugin. At present, only a small number of configuration options are shared between them, but as we move forward, more and more configuration options will need to be shared.

To this end, we have defined a configuration file that is common to both components. Common configuration options are presented at the top level in the configuration file, with component specific options embedded in sections specific to each component.

At present the following set of command line flags, environment variables, and default values are available when configuring the k8s-device-plugin:

Flag	Envvar	Default Value
--mig-strategy	$MIG_STRATEGY	"none"
--fail-on-init-error	$FAIL_ON_INIT_ERROR	true
--nvidia-driver-root	$NVIDIA_DRIVER_ROOT	"/"
--pass-device-specs	$PASS_DEVICE_SPECS	false
--device-list-strategy	$DEVICE_LIST_STRATEGY	"envvar"
--device-id-strategy	$DEVICE_ID_STRATEGY	"uuid"

Similarly, gpu-feature-discovery has the following flags, envvars, and default values:

Flag	Envvar	Default Value
--mig-strategy	$GFD_MIG_STRATEGY	"none"
--fail-on-init-error	$GFD_FAIL_ON_INIT_ERROR	true
--oneshot	$GFD_ONESHOT	false
--no-timestamp	$GFD_NO_TIMESTAMP	false
--sleep-interval	$GFD_SLEEP_INTERVAL	60s
--output-file	$GFD_OUTPUT_FILE	/etc/kubernetes/node-feature-discovery/features.d/gfd

Merging these options (where appropriate) and defining a common configuration file around them results in the following (with the default values for each option shown below):

version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: "envvar"
    deviceIDStrategy: "uuid"
  gfd:
    oneshot: false
    noTimestamp: false
    outputFile: /etc/kubernetes/node-feature-discovery/features.d/gfd
    sleepInterval: 60s

To use this new configuration file, both k8s-device-plugin and gpu-feature-discovery have been extended with the following flag / envvar:

--config-file

$CONFIG_FILE

Note: The existing flags / envvars can still be used to configure each of these components. The order of precedence when applying a configuration option is (1) flag, (2) envvar, (3) config file.

Deploying and Testing

This section walks through the steps to deploy and run the k8s-device-plugin and gpu-feature-discovery components using a configuration file as described above. These instructions assume you are deploying via helm.

In general, we provide a mechanism to pass multiple configuration files to helm, with the ability to choose which configuration file should be applied to a node via a node label.

In this way, a single daemonset can be used to deploy each component, but custom configurations can be applied to different nodes throughout the cluster.

Deployment via helm

First, add the nvidia-device-plugin repository if you don’t have it already:

$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
$ helm repo update

Then, verify that v0.12.0 (or later) of the plugin is available:

$ helm search repo nvdp --devel
NAME CHART VERSION APP VERSION DESCRIPTION
nvdp/nvidia-device-plugin 0.12.2 0.12.2 A Helm chart for ...

Create a valid config file on your local filesystem, such as the following:

cat << EOF > /tmp/dp-example-config0.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
EOF

And deploy the device plugin via helm (pointing it at this config file and giving it a name):

$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.12.2 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set-file config.map.config=/tmp/dp-example-config0.yaml

Under the hood this will deploy a configmap associated with the plugin and put the contents of the config file into it, using the name ‘config’ as its key. It will then start the plugin such that this config gets applied when the plugin comes online.

If you don’t want the plugin’s helm chart to create the configmap for you, you can also point it at a pre-created configmap as follows:

$ kubectl create ns nvidia-device-plugin

$ kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \
--from-file=config=/tmp/dp-example-config0.yaml

$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.12.2 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set config.name=nvidia-plugin-configs

Specifying Multiple Configuration Files

As mentioned previously, multiple config files can be specified, with the ability to pick a default one and use a node label to customize which one is actually used on a node-by-node basis.

To do this, create a second config file with the following contents:

cat << EOF > /tmp/dp-example-config1.yaml
version: v1
flags:
  migStrategy: "mixed" # Only change from config0.yaml
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
EOF

And redeploy the device plugin via helm (pointing it at both configs with a specified default).

$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.12.2 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set config.default=config0 \
    --set-file config.map.config0=/tmp/dp-example-config0.yaml \
    --set-file config.map.config1=/tmp/dp-example-config1.yaml

As before, this can also be done with a pre-created configmap if desired:

$ kubectl create ns nvidia-device-plugin

$ kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \
--from-file=config0=/tmp/dp-example-config0.yaml \
--from-file=config1=/tmp/dp-example-config1.yaml

$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.12.2 \
    --namespace nvidia-device-plugin \
    --create-namespace \
--set config.default=config0 \
    --set config.name=nvidia-plugin-configs

Note: If the config.default flag is not explicitly set, then a default value will be inferred from the config if one of the config names is set to ‘default’. If neither of these are set, then the deployment will fail unless there is only one config provided. In the case of just a single config being provided, it will be chosen as the default because there is no other option.

Updating Per-Node Configuration With a Node Label

With this setup, plugins on all nodes will have config0 configured for them by default. However, the following label can be set to change which configuration is applied:

kubectl label nodes <node-name> –-overwrite \
nvidia.com/device-plugin.config=<config-name>

Note: This label can be applied either before or after the plugin is started to get the desired configuration applied on the node. Anytime it changes value, the plugin will immediately be updated to start serving the desired configuration. If it is set to an unknown value, it will skip reconfiguration. If it is ever unset, it will fallback to the default.

Enabling gpu-feature-discovery

As of v0.12.0, the gpu-feature-discovery helm chart is now included as a subchart of the plugin. To enable it, simply set gfd.enabled=true during helm install.

$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.12.2 \
    --namespace nvidia-device-plugin \
    --create-namespace \
--set gfd.enabled=true \
    --set config.default=config0 \
    --set-file config.map.config0=/tmp/dp-example-config0.yaml \
    --set-file config.map.config1=/tmp/dp-example-config1.yaml

Under the hood this will also deploy node-feature-discovery since it is a prerequisite of gpu-feature-discovery. If you already have node-feature-discovery deployed on your cluster and do not wish for it to be pulled in by this installation, you can disable it with nfd.enabled=false.

Note: The same (plugin-specific) label is used to reconfigure both the plugin and GFD.

IT 기술 정리 블로그

nvidia2