AWS Container Insights Receiver
Overview
AWS Container Insights Receiver (awscontainerinsightreceiver
) is an AWS specific receiver that supports CloudWatch Container Insights. CloudWatch Container Insights collect, aggregate,
and summarize metrics and logs from your containerized applications and microservices. Data are collected as as performance log events
using embedded metric format. From the EMF data, Amazon CloudWatch can create the aggregated CloudWatch metrics at the cluster, node, pod, task, and service level.
CloudWatch Container Insights has been supported by ECS Agent and CloudWatch Agent to collect infrastructure metrics for many resources such as such as CPU, memory, disk, and network. To migrate existing customers to use OpenTelemetry, AWS Container Insights Receiver (together with CloudWatch EMF Exporter) aims to support the same CloudWatch Container Insights experience for the following platforms:
- Amazon ECS
- Amazon EKS
- Kubernetes platforms on Amazon EC2
Design of AWS Container Insights Receiver
See the design doc
Configuration
Example configuration:
receivers:
awscontainerinsightreceiver:
# all parameters are optional
collection_interval: 60s
container_orchestrator: eks
add_service_as_attribute: true
prefer_full_pod_name: false
add_full_pod_name_metric_label: false
There is no need to provide any parameters since they are all optional.
collection_interval (optional)
The interval at which metrics should be collected. The default is 60 second.
container_orchestrator (optional)
The type of container orchestration service, e.g. eks or ecs. The default is eks.
add_service_as_attribute (optional)
Whether to add the associated service name as attribute. The default is true
prefer_full_pod_name (optional)
The "PodName" attribute is set based on the name of the relevant controllers like Daemonset, Job, ReplicaSet, ReplicationController, ... If it cannot be set that way and PrefFullPodName is true, the "PodName" attribute is set to the pod's own name. The default value is false.
add_full_pod_name_metric_label (optional)
The "FullPodName" attribute is the pod name including suffix. If false FullPodName label is not added. The default value is false
Sample configuration for Container Insights
This is a sample configuration for AWS Container Insights using the awscontainerinsightreceiver
and awsemfexporter
for an EKS cluster:
# create namespace
apiVersion: v1
kind: Namespace
metadata:
name: aws-otel-eks
labels:
name: aws-otel-eks
---
# create cwagent service account and role binding
apiVersion: v1
kind: ServiceAccount
metadata:
name: aws-otel-sa
namespace: aws-otel-eks
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: aoc-agent-role
rules:
- apiGroups: [""]
resources: ["pods", "nodes", "endpoints"]
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources: ["replicasets"]
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["nodes/proxy"]
verbs: ["get"]
- apiGroups: [""]
resources: ["nodes/stats", "configmaps", "events"]
verbs: ["create", "get"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["otel-container-insight-clusterleader"]
verbs: ["get","update"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: aoc-agent-role-binding
subjects:
- kind: ServiceAccount
name: aws-otel-sa
namespace: aws-otel-eks
roleRef:
kind: ClusterRole
name: aoc-agent-role
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-agent-conf
namespace: aws-otel-eks
labels:
app: opentelemetry
component: otel-agent-conf
data:
otel-agent-config: |
extensions:
health_check:
receivers:
awscontainerinsightreceiver:
processors:
batch/metrics:
timeout: 60s
exporters:
awsemf:
namespace: ContainerInsights
log_group_name: '/aws/containerinsights/{ClusterName}/performance'
log_stream_name: '{NodeName}'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
parse_json_encoded_attr_values: [Sources, kubernetes]
metric_declarations:
# node metrics
- dimensions: [[NodeName, InstanceId, ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- dimensions: [[ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- node_cpu_usage_total
- node_cpu_limit
- node_memory_working_set
- node_memory_limit
# pod metrics
- dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- pod_cpu_utilization
- pod_memory_utilization
- pod_network_rx_bytes
- pod_network_tx_bytes
- pod_cpu_utilization_over_pod_limit
- pod_memory_utilization_over_pod_limit
- dimensions: [[PodName, Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- pod_cpu_reserved_capacity
- pod_memory_reserved_capacity
- dimensions: [[PodName, Namespace, ClusterName]]
metric_name_selectors:
- pod_number_of_container_restarts
# cluster metrics
- dimensions: [[ClusterName]]
metric_name_selectors:
- cluster_node_count
- cluster_failed_node_count
# service metrics
- dimensions: [[Service, Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- service_number_of_running_pods
# node fs metrics
- dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
metric_name_selectors:
- node_filesystem_utilization
# namespace metrics
- dimensions: [[Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- namespace_number_of_running_pods
debug:
verbosity: detailed
service:
pipelines:
metrics:
receivers: [awscontainerinsightreceiver]
processors: [batch/metrics]
exporters: [awsemf]
extensions: [health_check]
---
# create Daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: aws-otel-eks-ci
namespace: aws-otel-eks
spec:
selector:
matchLabels:
name: aws-otel-eks-ci
template:
metadata:
labels:
name: aws-otel-eks-ci
spec:
containers:
- name: aws-otel-collector
image: {collector-image-url}
env:
#- name: AWS_REGION
# value: "us-east-1"
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: HOST_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
imagePullPolicy: Always
command:
- "/awscollector"
- "--config=/conf/otel-agent-config.yaml"
volumeMounts:
- name: rootfs
mountPath: /rootfs
readOnly: true
- name: dockersock
mountPath: /var/run/docker.sock
readOnly: true
- name: varlibdocker
mountPath: /var/lib/docker
readOnly: true
- name: containerdsock
mountPath: /run/containerd/containerd.sock
readOnly: true
- name: sys
mountPath: /sys
readOnly: true
- name: devdisk
mountPath: /dev/disk
readOnly: true
- name: otel-agent-config-vol
mountPath: /conf
resources:
limits:
cpu: 200m
memory: 200Mi
requests:
cpu: 200m
memory: 200Mi
volumes:
- configMap:
name: otel-agent-conf
items:
- key: otel-agent-config
path: otel-agent-config.yaml
name: otel-agent-config-vol
- name: rootfs
hostPath:
path: /
- name: dockersock
hostPath:
path: /var/run/docker.sock
- name: varlibdocker
hostPath:
path: /var/lib/docker
- name: containerdsock
hostPath:
path: /run/containerd/containerd.sock
- name: sys
hostPath:
path: /sys
- name: devdisk
hostPath:
path: /dev/disk/
serviceAccountName: aws-otel-sa
To deploy to an EKS cluster
kubectl apply -f config.yaml
Available Metrics and Resource Attributes
Cluster
cluster_failed_node_count | Count |
cluster_node_count | Count |
ClusterName |
NodeName |
Type |
Version |
Sources |
Cluster Namespace
namespace_number_of_running_pods | Count |
ClusterName |
NodeName |
Namespace |
Type |
Version |
Sources |
kubernete |
Cluster Service
service_number_of_running_pods | Count |
ClusterName |
NodeName |
Namespace |
Service |
Type |
Version |
Sources |
kubernete |
Node
node_cpu_limit | Millicore |
node_cpu_request | Millicore |
node_cpu_reserved_capacity | Percent |
node_cpu_usage_system | Millicore |
node_cpu_usage_total | Millicore |
node_cpu_usage_user | Millicore |
node_cpu_utilization | Percent |
node_memory_cache | Bytes |
node_memory_failcnt | Count |
node_memory_hierarchical_pgfault | Count/Second |
node_memory_hierarchical_pgmajfault | Count/Second |
node_memory_limit | Bytes |
node_memory_mapped_file | Bytes |
node_memory_max_usage | Bytes |
node_memory_pgfault | Count/Second |
node_memory_pgmajfault | Count/Second |
node_memory_request | Bytes |
node_memory_reserved_capacity | Percent |
node_memory_rss | Bytes |
node_memory_swap | Bytes |
node_memory_usage | Bytes |
node_memory_utilization | Percent |
node_memory_working_set | Bytes |
node_network_rx_bytes | Bytes/Second |
node_network_rx_dropped | Count/Second |
node_network_rx_errors | Count/Second |
node_network_rx_packets | Count/Second |
node_network_total_bytes | Bytes/Second |
node_network_tx_bytes | Bytes/Second |
node_network_tx_dropped | Count/Second |
node_network_tx_errors | Count/Second |
node_network_tx_packets | Count/Second |
node_number_of_running_containers | Count |
node_number_of_running_pods | Count |
ClusterName |
InstanceType |
NodeName |
Type |
Version |
Sources |
kubernete |
Node Disk IO
node_diskio_io_serviced_async | Count/Second |
node_diskio_io_serviced_read | Count/Second |
node_diskio_io_serviced_sync | Count/Second |
node_diskio_io_serviced_total | Count/Second |
node_diskio_io_serviced_write | Count/Second |
node_diskio_io_service_bytes_async | Bytes/Second |
node_diskio_io_service_bytes_read | Bytes/Second |
node_diskio_io_service_bytes_sync | Bytes/Second |
node_diskio_io_service_bytes_total | Bytes/Second |
node_diskio_io_service_bytes_write | Bytes/Second |
AutoScalingGroupName |
ClusterName |
InstanceId |
InstanceType |
NodeName |
EBSVolumeId |
device |
Type |
Version |
Sources |
kubernete |
|
|
Node Filesystem
node_filesystem_available | Bytes |
node_filesystem_capacity | Bytes |
node_filesystem_inodes | Count |
node_filesystem_inodes_free | Count |
node_filesystem_usage | Bytes |
node_filesystem_utilization | Percent |
AutoScalingGroupName |
ClusterName |
InstanceId |
InstanceType |
NodeName |
EBSVolumeId |
device |
fstype |
Type |
Version |
Sources |
kubernete |
|
|
Node Network
node_interface_network_rx_bytes | Bytes/Second |
node_interface_network_rx_dropped | Count/Second |
node_interface_network_rx_errors | Count/Second |
node_interface_network_rx_packets | Count/Second |
node_interface_network_total_bytes | Bytes/Second |
node_interface_network_tx_bytes | Bytes/Second |
node_interface_network_tx_dropped | Count/Second |
node_interface_network_tx_errors | Count/Second |
node_interface_network_tx_packets | Count/Second |
AutoScalingGroupName |
ClusterName |
InstanceId |
InstanceType |
NodeName |
Type |
Version |
interface |
Sources |
kubernete |
|
|
Pod
pod_cpu_limit | Millicore |
pod_cpu_request | Millicore |
pod_cpu_reserved_capacity | Percent |
pod_cpu_usage_system | Millicore |
pod_cpu_usage_total | Millicore |
pod_cpu_usage_user | Millicore |
pod_cpu_utilization | Percent |
pod_cpu_utilization_over_pod_limit | Percent |
pod_memory_cache | Bytes |
pod_memory_failcnt | Count |
pod_memory_hierarchical_pgfault | Count/Second |
pod_memory_hierarchical_pgmajfault | Count/Second |
pod_memory_limit | Bytes |
pod_memory_mapped_file | Bytes |
pod_memory_max_usage | Bytes |
pod_memory_pgfault | Count/Second |
pod_memory_pgmajfault | Count/Second |
pod_memory_request | Bytes |
pod_memory_reserved_capacity | Percent |
pod_memory_rss | Bytes |
pod_memory_swap | Bytes |
pod_memory_usage | Bytes |
pod_memory_utilization | Percent |
pod_memory_utilization_over_pod_limit | Percent |
pod_memory_working_set | Bytes |
pod_network_rx_bytes | Bytes/Second |
pod_network_rx_dropped | Count/Second |
pod_network_rx_errors | Count/Second |
pod_network_rx_packets | Count/Second |
pod_network_total_bytes | Bytes/Second |
pod_network_tx_bytes | Bytes/Second |
pod_network_tx_dropped | Count/Second |
pod_network_tx_errors | Count/Second |
pod_network_tx_packets | Count/Second |
pod_number_of_container_restarts | Count |
pod_number_of_containers | Count |
pod_number_of_running_containers | Count |
AutoScalingGroupName |
ClusterName |
InstanceId |
InstanceType |
K8sPodName |
Namespace |
NodeName |
PodId |
Type |
Version |
Sources |
kubernete |
pod_status |
Pod Network
pod_interface_network_rx_bytes | Bytes/Second |
pod_interface_network_rx_dropped | Count/Second |
pod_interface_network_rx_errors | Count/Second |
pod_interface_network_rx_packets | Count/Second |
pod_interface_network_total_bytes | Bytes/Second |
pod_interface_network_tx_bytes | Bytes/Second |
pod_interface_network_tx_dropped | Count/Second |
pod_interface_network_tx_errors | Count/Second |
pod_interface_network_tx_packets | Count/Second |
AutoScalingGroupName |
ClusterName |
InstanceId |
InstanceType |
K8sPodName |
Namespace |
NodeName |
PodId |
Type |
Version |
interface |
Sources |
kubernete |
pod_status |
|
|
Container
container_cpu_limit | Millicore |
container_cpu_request | Millicore |
container_cpu_usage_system | Millicore |
container_cpu_usage_total | Millicore |
container_cpu_usage_user | Millicore |
container_cpu_utilization | Percent |
container_memory_cache | Bytes |
container_memory_failcnt | Count |
container_memory_hierarchical_pgfault | Count/Second |
container_memory_hierarchical_pgmajfault | Count/Second |
container_memory_limit | Bytes |
container_memory_mapped_file | Bytes |
container_memory_max_usage | Bytes |
container_memory_pgfault | Count/Second |
container_memory_pgmajfault | Count/Second |
container_memory_request | Bytes |
container_memory_rss | Bytes |
container_memory_swap | Bytes |
container_memory_usage | Bytes |
container_memory_utilization | Percent |
container_memory_working_set | Bytes |
number_of_container_restarts | Count |
AutoScalingGroupName |
ClusterName |
ContainerId |
ContainerName |
InstanceId |
InstanceType |
K8sPodName |
Namespace |
NodeName |
PodId |
Type |
Version |
Sources |
kubernetes |
container_status |
container_status_reason |
container_last_termination_reason |
The attribute container_status_reason
is present only when container_status
is in "Waiting" or "Terminated" State. The attribute container_last_termination_reason
is present only when container_status
is in "Terminated" State.
This is a sample configuration for AWS Container Insights using the awscontainerinsightreceiver
and awsemfexporter
for an ECS cluster to collect the instance level metrics:
receivers:
awscontainerinsightreceiver:
collection_interval: 10s
container_orchestrator: ecs
processors:
batch/metrics:
timeout: 60s
exporters:
awsemf:
namespace: ContainerInsightsEC2Instance
log_group_name: '/aws/ecs/containerinsights/{ClusterName}/performance'
log_stream_name: 'instanceTelemetry/{ContainerInstanceId}'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
parse_json_encoded_attr_values: [Sources]
metric_declarations:
# instance metrics
- dimensions: [ [ ContainerInstanceId, InstanceId, ClusterName] ]
metric_name_selectors:
- instance_cpu_utilization
- instance_memory_utilization
- instance_network_total_bytes
- instance_cpu_reserved_capacity
- instance_memory_reserved_capacity
- instance_number_of_running_tasks
- instance_filesystem_utilization
- dimensions: [ [ClusterName] ]
metric_name_selectors:
- instance_cpu_utilization
- instance_memory_utilization
- instance_network_total_bytes
- instance_cpu_reserved_capacity
- instance_memory_reserved_capacity
- instance_number_of_running_tasks
- instance_cpu_usage_total
- instance_cpu_limit
- instance_memory_working_set
- instance_memory_limit
debug:
verbosity: detailed
service:
pipelines:
metrics:
receivers: [awscontainerinsightreceiver]
processors: [batch/metrics]
exporters: [awsemf,debug]
To deploy to an ECS cluster check this doc for details
Available Metrics and Resource Attributes
Instance
instance_cpu_limit | Millicore |
instance_cpu_reserved_capacity | Percent |
instance_cpu_usage_system | Millicore |
instance_cpu_usage_total | Millicore |
instance_cpu_usage_user | Millicore |
instance_cpu_utilization | Percent |
instance_memory_cache | Bytes |
instance_memory_failcnt | Count |
instance_memory_hierarchical_pgfault | Count/Second |
instance_memory_hierarchical_pgmajfault | Count/Second |
instance_memory_limit | Bytes |
instance_memory_mapped_file | Bytes |
instance_memory_max_usage | Bytes |
instance_memory_pgfault | Count/Second |
instance_memory_pgmajfault | Count/Second |
instance_memory_reserved_capacity | Percent |
instance_memory_rss | Bytes |
instance_memory_swap | Bytes |
instance_memory_usage | Bytes |
instance_memory_utilization | Percent |
instance_memory_working_set | Bytes |
instance_network_rx_bytes | Bytes/Second |
instance_network_rx_dropped | Count/Second |
instance_network_rx_errors | Count/Second |
instance_network_rx_packets | Count/Second |
instance_network_total_bytes | Bytes/Second |
instance_network_tx_bytes | Bytes/Second |
instance_network_tx_dropped | Count/Second |
instance_network_tx_errors | Count/Second |
instance_network_tx_packets | Count/Second |
instance_number_of_running_tasks | Count |
| |
ClusterName |
InstanceType |
AutoScalingGroupName |
Type |
Version |
Sources |
ContainerInstanceId |
InstanceId |
Instance Disk IO
instance_diskio_io_serviced_async | Count/Second |
instance_diskio_io_serviced_read | Count/Second |
instance_diskio_io_serviced_sync | Count/Second |
instance_diskio_io_serviced_total | Count/Second |
instance_diskio_io_serviced_write | Count/Second |
instance_diskio_io_service_bytes_async | Bytes/Second |
instance_diskio_io_service_bytes_read | Bytes/Second |
instance_diskio_io_service_bytes_sync | Bytes/Second |
instance_diskio_io_service_bytes_total | Bytes/Second |
instance_diskio_io_service_bytes_write | Bytes/Second |
ClusterName |
InstanceType |
AutoScalingGroupName |
Type |
Version |
Sources |
ContainerInstanceId |
InstanceId |
EBSVolumeId |
Instance Filesystem
instance_filesystem_available | Bytes |
instance_filesystem_capacity | Bytes |
instance_filesystem_inodes | Count |
instance_filesystem_inodes_free | Count |
instance_filesystem_usage | Bytes |
instance_filesystem_utilization | Percent |
ClusterName |
InstanceType |
AutoScalingGroupName |
Type |
Version |
Sources |
ContainerInstanceId |
InstanceId |
EBSVolumeId |
|
|
Instance Network
instance_interface_network_rx_bytes | Bytes/Second |
instance_interface_network_rx_dropped | Count/Second |
instance_interface_network_rx_errors | Count/Second |
instance_interface_network_rx_packets | Count/Second |
instance_interface_network_total_bytes | Bytes/Second |
instance_interface_network_tx_bytes | Bytes/Second |
instance_interface_network_tx_dropped | Count/Second |
instance_interface_network_tx_errors | Count/Second |
instance_interface_network_tx_packets | Count/Second |
ClusterName |
InstanceType |
AutoScalingGroupName |
Type |
Version |
Sources |
ContainerInstanceId |
InstanceId |
EBSVolumeId |
|
|
Warnings
Root permissions
When using this component, the collector process needs root permission to be able to read the content of the files located in the following locations:
/
/var/run/docker.sock
/var/lib/docker
/run/containerd/containerd.sock
/sys
/dev/disk
This requirement comes from the fact that this component is based on cAdvisor.