Using a config file to configure the k8s-device-plugin and gpu-feature-discovery
Kevin Klues <kklues@nvidia.com>
Last Updated:
03-June-2022
Table of Contents
Specifying Multiple Configuration Files 5
Updating Per-Node Configuration With a Node Label 6
Enabling gpu-feature-discovery 6
Overview
At present, the only way to configure the k8s-device-plugin or gpu-feature-discovery is via a set of command line flags or environment variables.
However, as we begin to add more sophisticated features to these components, a configuration file is more appropriate to express the complex settings that are possible.
This document outlines the details of a new configuration file format and shows examples of how to deploy these components using it.
Design Details
Although it is possible to run the k8s-device-plugin as a standalone component, it is often run in conjunction with gpu-feature-discovery to apply resource-specific labels to nodes. In order to apply these labels effectively, gpu-feature-discovery must be aware of the configuration set up for the k8s-device-plugin. At present, only a small number of configuration options are shared between them, but as we move forward, more and more configuration options will need to be shared.
To this end, we have defined a configuration file that is common to both components. Common configuration options are presented at the top level in the configuration file, with component specific options embedded in sections specific to each component.
At present the following set of command line flags, environment variables, and default values are available when configuring the k8s-device-plugin:
Flag | Envvar | Default Value |
--mig-strategy | $MIG_STRATEGY | "none" |
--fail-on-init-error | $FAIL_ON_INIT_ERROR | true |
--nvidia-driver-root | $NVIDIA_DRIVER_ROOT | "/" |
--pass-device-specs | $PASS_DEVICE_SPECS | false |
--device-list-strategy | $DEVICE_LIST_STRATEGY | "envvar" |
--device-id-strategy | $DEVICE_ID_STRATEGY | "uuid" |
Similarly, gpu-feature-discovery has the following flags, envvars, and default values:
Flag | Envvar | Default Value |
--mig-strategy | $GFD_MIG_STRATEGY | "none" |
--fail-on-init-error | $GFD_FAIL_ON_INIT_ERROR | true |
--oneshot | $GFD_ONESHOT | false |
--no-timestamp | $GFD_NO_TIMESTAMP | false |
--sleep-interval | $GFD_SLEEP_INTERVAL | 60s |
--output-file | $GFD_OUTPUT_FILE | /etc/kubernetes/node-feature-discovery/features.d/gfd |
Merging these options (where appropriate) and defining a common configuration file around them results in the following (with the default values for each option shown below):
version: v1 flags: migStrategy: "none" failOnInitError: true nvidiaDriverRoot: "/" plugin: passDeviceSpecs: false deviceListStrategy: "envvar" deviceIDStrategy: "uuid" gfd: oneshot: false noTimestamp: false outputFile: /etc/kubernetes/node-feature-discovery/features.d/gfd sleepInterval: 60s |
To use this new configuration file, both k8s-device-plugin and gpu-feature-discovery have been extended with the following flag / envvar:
--config-file | $CONFIG_FILE | "" |
Note: The existing flags / envvars can still be used to configure each of these components. The order of precedence when applying a configuration option is (1) flag, (2) envvar, (3) config file.
Deploying and Testing
This section walks through the steps to deploy and run the k8s-device-plugin and gpu-feature-discovery components using a configuration file as described above. These instructions assume you are deploying via helm.
In general, we provide a mechanism to pass multiple configuration files to helm, with the ability to choose which configuration file should be applied to a node via a node label.
In this way, a single daemonset can be used to deploy each component, but custom configurations can be applied to different nodes throughout the cluster.
Deployment via helm
First, add the nvidia-device-plugin repository if you don’t have it already:
$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin $ helm repo update |
Then, verify that v0.12.0 (or later) of the plugin is available:
$ helm search repo nvdp --devel NAME CHART VERSION APP VERSION DESCRIPTION nvdp/nvidia-device-plugin 0.12.2 0.12.2 A Helm chart for ... |
Create a valid config file on your local filesystem, such as the following:
cat << EOF > /tmp/dp-example-config0.yaml version: v1 flags: migStrategy: "none" failOnInitError: true nvidiaDriverRoot: "/" plugin: passDeviceSpecs: false deviceListStrategy: envvar deviceIDStrategy: uuid EOF |
And deploy the device plugin via helm (pointing it at this config file and giving it a name):
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --version=0.12.2 \ --namespace nvidia-device-plugin \ --create-namespace \ --set-file config.map.config=/tmp/dp-example-config0.yaml |
Under the hood this will deploy a configmap associated with the plugin and put the contents of the config file into it, using the name ‘config’ as its key. It will then start the plugin such that this config gets applied when the plugin comes online.
If you don’t want the plugin’s helm chart to create the configmap for you, you can also point it at a pre-created configmap as follows:
$ kubectl create ns nvidia-device-plugin |
$ kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \ --from-file=config=/tmp/dp-example-config0.yaml |
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --version=0.12.2 \ --namespace nvidia-device-plugin \ --create-namespace \ --set config.name=nvidia-plugin-configs |
Specifying Multiple Configuration Files
As mentioned previously, multiple config files can be specified, with the ability to pick a default one and use a node label to customize which one is actually used on a node-by-node basis.
To do this, create a second config file with the following contents:
cat << EOF > /tmp/dp-example-config1.yaml version: v1 flags: migStrategy: "mixed" # Only change from config0.yaml failOnInitError: true nvidiaDriverRoot: "/" plugin: passDeviceSpecs: false deviceListStrategy: envvar deviceIDStrategy: uuid EOF |
And redeploy the device plugin via helm (pointing it at both configs with a specified default).
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --version=0.12.2 \ --namespace nvidia-device-plugin \ --create-namespace \ --set config.default=config0 \ --set-file config.map.config0=/tmp/dp-example-config0.yaml \ --set-file config.map.config1=/tmp/dp-example-config1.yaml |
As before, this can also be done with a pre-created configmap if desired:
$ kubectl create ns nvidia-device-plugin |
$ kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \ --from-file=config0=/tmp/dp-example-config0.yaml \ --from-file=config1=/tmp/dp-example-config1.yaml |
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --version=0.12.2 \ --namespace nvidia-device-plugin \ --create-namespace \ --set config.default=config0 \ --set config.name=nvidia-plugin-configs |
Note: If the config.default flag is not explicitly set, then a default value will be inferred from the config if one of the config names is set to ‘default’. If neither of these are set, then the deployment will fail unless there is only one config provided. In the case of just a single config being provided, it will be chosen as the default because there is no other option.
Updating Per-Node Configuration With a Node Label
With this setup, plugins on all nodes will have config0 configured for them by default. However, the following label can be set to change which configuration is applied:
kubectl label nodes <node-name> –-overwrite \ nvidia.com/device-plugin.config=<config-name> |
Note: This label can be applied either before or after the plugin is started to get the desired configuration applied on the node. Anytime it changes value, the plugin will immediately be updated to start serving the desired configuration. If it is set to an unknown value, it will skip reconfiguration. If it is ever unset, it will fallback to the default.
Enabling gpu-feature-discovery
As of v0.12.0, the gpu-feature-discovery helm chart is now included as a subchart of the plugin. To enable it, simply set gfd.enabled=true during helm install.
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --version=0.12.2 \ --namespace nvidia-device-plugin \ --create-namespace \ --set gfd.enabled=true \ --set config.default=config0 \ --set-file config.map.config0=/tmp/dp-example-config0.yaml \ --set-file config.map.config1=/tmp/dp-example-config1.yaml |
Under the hood this will also deploy node-feature-discovery since it is a prerequisite of gpu-feature-discovery. If you already have node-feature-discovery deployed on your cluster and do not wish for it to be pulled in by this installation, you can disable it with nfd.enabled=false.
Note: The same (plugin-specific) label is used to reconfigure both the plugin and GFD.