SRIOV Network device plugin for Kubernetes
Table of Contents
SRIOV Network Device Plugin
The SRIOV network device plugin is Kubernetes device plugin for discovering and advertising SRIOV network virtual functions (VFs) in a Kubernetes host.
Features
- Handles SRIOV capable/not-capable devices (NICs and Accelerators alike)
- Supports devices with both Kernel and userspace(uio and VFIO) drivers
- Allows resource grouping using "Selector"
- User configurable resourceName
- Detects Kubelet restarts and auto-re-register
- Detects Link status (for Linux network devices) and updates associated VFs health accordingly
- Extensible to support new device types with minimal effort if not already supported
To deploy workloads with SRIOV VF this plugin needs to work together with the following two CNI plugins:
-
Multus CNI
- Retrieves allocated network device information of a Pod
-
SRIOV CNI
-
During Pod creation, plumbs allocated SRIOV VF to a Pods network namespace using VF information given by Multus
-
On Pod deletion, reset and release the VF from the Pod
This implementation follows the design discuessed in this proposal document.
Please follow the Multus Quick Start for multi network interface support in Kubernetes.
Supported SRIOV NICs
The following NICs were tested with this implementation. However, other SRIOV capable NICs should work as well.
- Intel® Ethernet Controller X710 Series 4x10G
- PF driver : v2.4.6
- VF driver: v3.5.6
please refer to Intel download center for installing latest Intel Ethernet Controller-X710-Series drivers
- Intel® 82599ES 10 Gigabit Ethernet Controller
- PF driver : v4.4.0-k
- VF driver: v3.2.2-k
please refer to Intel download center for installing latest Intel-® 82599ES 10 Gigabit Ethernet drivers
- Mellanox ConnectX®-4 Lx EN Adapter
- Mellanox ConnectX®-5 Adapter
Network card drivers are available as a part of the various linux distributions and upstream.
To download the latest Mellanox NIC drivers, click here.
Quick Start
This section explains an exmaple deployment of SRIOV Network device plugin in Kubernetes. Required YAML files can be found in deployments/ directory.
Network Object CRDs
Multus uses Custom Resource Definitions(CRDs) for defining additional network attachements. These network attachment CRDs follow the standards defined by K8s Network Plumbing Working Group(NPWG). Please refer to Multus documentation for more information.
Build and configure Multus
- Compile Multus executable:
$ git clone https://github.com/intel/multus-cni.git
$ cd multus-cni
$ ./build
$ cp bin/multus /opt/cni/bin
- Copy the multus Configuration file from the Deployments folder to the CNI Configuration diectory
$ cp deployments/cni-conf.json /etc/cni/net.d/
- Configure Kubernetes network CRD with Multus
$ kubectl create -f deployments/crdnetwork.yaml
Build SRIOV CNI
- Compile SRIOV-CNI (dev/k8s-deviceid-model branch):
$ git clone https://github.com/intel/sriov-cni.git
$ cd sriov-cni
$ make
$ cp build/sriov /opt/cni/bin
- Create the SRIOV Network CRD
$ kubectl create -f deployments/sriov-crd.yaml
Build and run SRIOV network device plugin
- Clone the sriov-network-device-plugin
$ git clone https://github.com/intel/sriov-network-device-plugin.git
$ cd sriov-network-device-plugin
- Build executable binary using
make
$ make
On successful build the sriovdp
executable can be found in ./build
directory. It is recommended to run the plugin in a container or K8s Pod. The follow on steps cover how to build and run the Docker image of the plugin.
- Build docker image
$ make image
See following sections on how to configure and run SRIOV device plugin.
Configurations
Config parameters
This plugin creates device plugin endpoints based on the configurations given in file /etc/pcidp/config.json
. This configuration file is in json format as shown below:
{
"resourceList": [{
"resourceName": "intel_sriov_netdevice",
"selectors": {
"vendors": ["8086"],
"devices": ["154c", "10ed"],
"drivers": ["i40evf", "ixgbevf"]
}
},
{
"resourceName": "intel_sriov_dpdk",
"selectors": {
"vendors": ["8086"],
"devices": ["154c", "10ed"],
"drivers": ["vfio-pci"],
"pfNames": ["enp0s0f0","enp2s2f1"]
}
},
{
"resourceName": "mlnx_sriov_rdma",
"isRdma": true,
"selectors": {
"vendors": ["15b3"],
"devices": ["1018"],
"drivers": ["mlx5_ib"]
}
}
]
}
"resourceList"
should contain a list of config objects. Each config object may consist of following fields:
Field | Required | Description | Type - Accepted values | Example/Accepted values |
---|
"resourceName" | Yes | Endpoint resource name | string - must be unique and should not contain special characters | "sriov_net_A" |
"selectors" | No | A map of device selectors | Each selector is a map of string list. | "vendors": ["8086"], "devices": ["154c", "10ed"], "drivers": ["vfio-pci"], "pfNames": ["enp2s2f0"] |
"isRdma" | No | Mount RDMA resources | bool - boolean value true or false | "isRdma": true |
Command line arguments
This plugin accepts the following optional run-time command line arguments:
./sriovdp --help
Usage of ./sriovdp:
-alsologtostderr
log to standard error as well as files
-config-file string
JSON device pool config file location (default "/etc/pcidp/config.json")
-log_backtrace_at value
when logging hits line file:N, emit a stack trace
-log_dir string
If non-empty, write log files in this directory
-logtostderr
log to standard error instead of files
-resource-prefix string
resource name prefix used for K8s extended resource (default "intel.com")
-stderrthreshold value
logs at or above this threshold go to stderr
-v value
log level for V logs
-vmodule value
comma-separated list of pattern=N settings for file-filtered logging
Assumptions
This plugin does not bind or unbind any driver to any device whether it's PFs or VFs. It also doesn't create Virtual functions either. Usually, the virtual functions are created at boot time when kernel module for the device is loaded. Required device drivers could be loaded on system boot-up time by white-listing/black-listing the right modules. But plugin needs to be aware of the driver type of the resources(i.e. devices) that it is registering as K8s extended resource so that it's able to create appropriate Device Specs for the requested resource.
For exmaple, if the driver type is uio(i.e. igb_uio.ko) then there are specific device files to add in Device
Spec. For vfio-pci, device files are different. And if it is Linux kernel network driver then there is no device file to be added.
The idea here is, user creates a resource config for each resource pool as shown in Config parameters by specifying the resource name, a list resource "selectors".
The device plugin will initially discover all PCI network resources in the host and populate an initial "device list". Each "resource pool" then applies its selectors on this list and add devices that satisfies the selectors constraints. Each selector narrows down the list of devices for the resource pool. Currently, the selectors are applied in following order:
- "vendors" - The vendor hex code of device
- "devices" - The device hex code of device
- "drivers" - The driver name the device is registered with
- "pfNames" - The Physical funtion name
Workflow
- Load device's (Physical funtion if it is SRIOV capable) kernel module and bind the driver to the PF
- Create required Virtual functions
- Bind all VF with right drivers
- Create resource config entry in
/etc/pcidp/config.json
- Run SRIOV device plugin (as daemonset)
On successfull run, the allocatable resource list for the node should be updated with resource discovered by the plugin as shown below. Note that the resource name appended with the -resource-prefix
i.e. "intel.com/sriov_net_A"
.
$ kubectl get node node1 -o json | jq '.status.allocatable'
{
"cpu": "8",
"ephemeral-storage": "169986638772",
"hugepages-1Gi": "0",
"hugepages-2Mi": "8Gi",
"intel.com/sriov_net_A": "8",
"intel.com/sriov_net_B": "8",
"memory": "7880620Ki",
"pods": "1k"
}
Example deployments
We assume that you have working K8s cluster configured with Multus meta plugin for multi-network support. Please see Features and Quick Start sections for more information on required CNI plugins.
The images directory contains example Docker file, sample specs along with build scripts to deploy the SRIOV device plugin as daemonset. Please see README.md building docker the image.
# Create ConfigMap
$ kubectl create -f images/configMap.yaml
configmap/sriovdp-config created
# Create sriov-device-plugin-daemonset
$ kubectl create -f images/sriovdp-daemonset.yaml
daemonset.extensions/kube-sriov-device-plugin-amd64 created
$kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system kube-sriov-device-plugin-amd64-46wpv 1/1 Running 0 4s
There are some example Pod specs and related network CRD yaml files can be found in deployments directory for a sample deployments.
Testing SRIOV workloads
Leave the sriov device plugin running and open a new terminal session for following steps.
Deploy test Pod
$ kubectl create -f pod-tc1.yaml
pod "testpod1" created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
sriov-device-plugin 1/1 Running 0 7h
testpod1 1/1 Running 0 3s
Verify Pod network interfaces
$ kubectl exec -it testpod1 -- ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
3: eth0@if17511: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP
link/ether 0a:58:c0:a8:4a:b1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.74.177/24 scope global eth0
valid_lft forever preferred_lft forever
17508: net0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
link/ether ce:d8:06:08:e6:3f brd ff:ff:ff:ff:ff:ff
inet 10.56.217.179/24 scope global net0
valid_lft forever preferred_lft forever
Verify Pod routing table
$ kubectl exec -it testpod1 -- route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.74.1 0.0.0.0 UG 0 0 0 eth0
10.56.217.0 0.0.0.0 255.255.255.0 U 0 0 0 net0
192.168.0.0 192.168.74.1 255.255.0.0 UG 0 0 0 eth0
192.168.74.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
Pod device information
The allocated device information are exported in Container's environment variable. The variable name is PCIDEVICE_
appended with full extended resource name(i.e. intel.com/sriov) which is capitailzed and any special characters(".", "/") are replaced with underscore("_"). In case of multiple devices from same extended resource pool, the device IDs are delimited with commas(",").
For example, if 2 devices are allocated from intel.com/sriov
extended resource then the allocated device information will be found in following env variable:
PCIDEVICE_INTEL_COM_SRIOV=0000:03:02.1,0000:03:04.3
Issues and Contributing
We welcome your feedback and contributions to this project. Please see the CONTRIBUTING.md for contribution guidelines.
Copyright 2018 © Intel Corporation.