Node Resources

The nodeResources analyzer is available to determine if the nodes in the cluster have sufficient resources to run an application. This is useful in preflight checks to avoid deploying a version that will not work, and it's useful in support bundles to collect and analyze in case the available resources of a shared cluster are being reserved for cluster workloads or if an autoscaling group is changing the resources available.

This analyzer's outcome when clause compares the condition specified with the resources present on each or all nodes. It's possible to create an analyzer to report on both aggregate values of all nodes in the cluster or individual values of any node in the cluster.

This analyzer also supports a filters property. If provided, the nodes analyzed will be filtered to any node that matches the filters specified.

Available Filters

All filters can be integers or strings that are parsed using the Kubernetes resource standard. The fields here are from the nodes capacity and allocatable. Note that allocatable is not "free" or "available", but it's the amount of the capacity that is not reserved by other pods and processes.

Filter Name	Description
`cpuArchitecture`	The architecture of the CPU available to the node. Expressed as a string, e.g. `amd64`
`cpuCapacity`	The amount of CPU available to the node.
`cpuAllocatable`	The amount of allocatable CPU after the Kubernetes components have been started
`memoryCapacity`	The amount of memory available to the node
`memoryAllocatable`	The amount of allocatable Memory after the Kubernetes components have been started
`podCapacity`	The number of pods that can be started on the node
`podAllocatable`	The number of pods that can be started on the node after Kubernetes is running
`ephemeralStorageCapacity`	The amount of ephemeral storage on the node
`ephemeralStorageAllocatable`	The amount of ephemeral storage on the node after Kubernetes is running
`matchLabel`	Specific selector label or labels the node must contain in its metadata
`matchExpressions`	A list of selector label expressions that the node needs to match in its metadata
`resourceName`	The name of the resource to filter on. This is useful for filtering on custom resources
`resourceCapacity`	The amount of the resource available to the node.
`resourceAllocatable`	The amount of allocatable resource after the Kubernetes components have been started

CPU and Memory units are expressed as Go Quantities: 16Gi, 8Mi, 1.5m, 5 etc.

Outcomes

The when value in an outcome of this analyzer contains the nodes that match the filters, if any filters are defined. If there are no defined filters, the when value contains all nodes in the cluster.

The conditional in the when value supports the following:

Aggregate	Description
`count`( )	The number of nodes that match the filter (default if not specified)
`sum(filterName)`	Sum of filterName in all nodes that match any filter specified
`min(filterName)`	Min of filterName in all nodes that match any filter specified
`max(filterName)`	Max of filterName in all nodes that match any filter specified
`nodeCondition(conditionType)`	used for checking node conditions such as Ready, PIDPressure, etc

Example Analyzer Definition

apiVersion: troubleshoot.sh/v1beta2
kind: Preflight
metadata:
  name: sample
spec:
  analyzers:
    - nodeResources:
        checkName: Must have at least 3 nodes in the cluster
        outcomes:
          - fail:
              when: "count() < 3"
              message: This application requires at least 3 nodes
          - warn:
              when: "count() < 5"
              message: This application recommends at last 5 nodes.
          - pass:
              message: This cluster has enough nodes.

    - nodeResources:
        checkName: Every node in the cluster must have at least 16Gi of memory
        outcomes:
          - fail:
              when: "min(memoryCapacity) <= 16Gi"
              message: All nodes must have at least 16 GB of memory
          - pass:
              message: All nodes have at least 16 GB of memory

    - nodeResources:
        checkName: Total CPU Cores in the cluster is 20 or greater
        outcomes:
          - fail:
              when: "sum(cpuCapacity) < 20"
              message: The cluster must contain at least 20 cores
          - pass:
              message: There are at least 20 cores in the cluster

    - nodeResources:
        checkName: Nodes that have 6 cores have at least 16 GB of memory also
        filters:
          cpuCapacity: "6"
        outcomes:
          - fail:
              when: "min(memoryCapacity) < 16Gi"
              message: All nodes that have 6 or more cores must have at least 16 GB of memory
          - pass:
              message:  All nodes with 6 or more cores have at least 16 GB of memory

    - nodeResources:
        checkName: Must have 3 nodes with at least 6 cores
        filters:
          cpuCapacity: "6"
        outcomes:
          - fail:
              when: "count() < 3"
              message: This application requires at least 3 nodes with 6 cores each
          - pass:
              message: This cluster has enough nodes with enough cores

    - nodeResources:
        checkName: Must have 1 node with 16 GB (available) memory and 5 cores (on a single node) with amd64 architecture
        filters:
          allocatableMemory: 16Gi
          cpuArchitecture: amd64
          cpuCapacity: "5"
        outcomes:
          - fail:
              when: "count() < 1"
              message: This application requires at least 1 node with 16GB available memory and 5 cpu cores with amd64 architecture
          - pass:
              message: This cluster has a node with enough memory and cpu cores

    - nodeResources:
        checkName: Node status check
        outcomes:
          - fail:
              when: "nodeCondition(Ready) == False"
              message: "Not all nodes are online."
          - fail:
              when: "nodeCondition(Ready) == Unknown"
              message: "Not all nodes are online."
          - pass:
              message: "All nodes are online."

Filter by labels

Filtering by labels was introduced in Kots 1.19.0 and Troubleshoot 0.9.42.

Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users, but do not directly imply semantics to the core system. Labels can be used to organize and to select subsets of objects. Troubleshoot allows users to analyze nodes that match one or more labels. For example, to require a certain number of nodes with certain labels as a preflight check. Multiple filters may be specified and all are required to match for the node to match.

    - nodeResources:
        checkName: Must have Mongo running
        filters:
          allocatableMemory: 16Gi
          cpuCapacity: "5"
          selector:
            matchLabel:
               kubernetes.io/role: database-primary-replica
        outcomes:
          - fail:
              when: "count() < 1"
              message: Must have 1 node with 16 GB (available) memory and 5 cores (on a single node) running Mongo Operator.
          - pass:
              message: This cluster has a node with enough memory and cpu capacity running Mongo Operator.

- nodeResources:
    checkName: Must have at least 1 node with 3 cores that is not a storage, queue or control plane node
    filters:
      cpuCapacity: "3"
      selector:
        matchExpressions:
        # An AND operation will be applied to this list of expressions
        # Nodes that are not storage or queue nodes
        - key: node.kubernetes.io/role
          operator: NotIn # Other operations are In, Exists, DoesNotExist
          values:   # An OR operation i.e any node that does not have "node.kubernetes.io/role=storage" or "node.kubernetes.io/role=queue" label
          - storage
          - queue
        # Nodes that are not control-plane nodes
        - key: node-role.kubernetes.io/control-plane
          operator: NotIn
          values:
          - "true"
    outcomes:
      - pass:
          when: "count() >= 1"
          message: "Found {{ .NodeCount }} nodes with at least 3 CPU cores"
      - fail:
          message: "{{ .NodeCount }} nodes do not meet the minimum requirements"

Filter by GPU resources

resoucrceName is used to filter on custom resources. For example, to filter on GPU resources, you can use the resourceName filter with the resource name nvidia.com/gpu. resourceCapacity and resourceAllocatable filters are used to filter on the capacity and allocatable resources of the custom resource.

- nodeResources:
    checkName: Must have at least 1 node with 1 GPU
    filters:
      resourceName: nvidia.com/gpu
      resourceCapacity: "1"
    outcomes:
      - pass:
          when: "count() >= 1"
          message: "Found {{ .NodeCount }} nodes with at least 1 GPU"
      - fail:
          message: "{{ .NodeCount }} nodes do not meet the minimum requirements"

- nodeResources:
    checkName: Must have at least 4 Intel i915 GPUs in the cluster
    filters:
      resourceName: gpu.intel.com/i915
    outcomes:
      - pass:
          when: "min(resourceAllocatable) > 4"
          message: "This application requires at least 4 Intel i915 GPUs"
      - fail:
          message: "{{ .NodeCount }} nodes do not meet the minimum requirements"

- nodeResources:
   filters:
     resourceName: nvidia.com/gpu
   checkName: Must have at least 3 GPU-enabled nodes in the cluster
   outcomes:
     - pass:
         when: "count() >= 3"
         message: "This application requires at least 3 GPU-enabled nodes"

Message Templating

To make the outcome message more informative, you can include certain values gathered by the NodeResources collector as templates. The templates are enclosed in double curly braces with a dot separator. The following templates are available:

Template	Description
`.NodeCount`	The number of nodes that match the filter
`.CPUArchitecture`	The architecture of the CPU available to the node
`.CPUCapacity`	The amount of CPU available to the node
`.MemoryCapacity`	The amount of memory available to the node
`.PodCapacity`	The number of pods that can be started on the node
`.EphemeralStorageCapacity`	The amount of ephemeral storage on the node
`.AllocatableMemory`	The amount of allocatable Memory after the Kubernetes components have been started
`.AllocatableCPU`	The amount of allocatable CPU after the Kubernetes components have been started
`.AllocatablePods`	The number of pods that can be started on the node after Kubernetes is running
`.AllocatableEphemeralStorage`	The amount of ephemeral storage on the node after Kubernetes is running

Example Analyzer Message Templating Definition

    - nodeResources:
        filters:
          cpuArchitecture: arm64
        checkName: Must have at least 3 nodes in the cluster
        outcomes:
          - fail:
              when: "count() < 3"
              message: "This application requires at least 3 nodes. {{ .CPUArchitecture }}, it should only return the {{ .NodeCount }} nodes that match that filter"
          - warn:
              when: "count() < 5"
              message: This application recommends at last 5 nodes.
          - pass:
              message: This cluster has enough nodes.

    - nodeResources:
        filters:
          cpuArchitecture: arm64
          cpuCapacity: "2"
        checkName: Must have at least 3 nodes in the cluster
        outcomes:
          - fail:
              when: "count() < 3"
              message: "This application requires at least 3 nodes. {{ .CPUArchitecture }}-{{ .CPUCapacity }}, it should only return the {{ .NodeCount }} nodes that match that filter"
          - warn:
              when: "count() < 5"
              message: This application recommends at last 5 nodes.
          - pass:
              message: This cluster has enough nodes.