OpenShift Monitoring Guide¶

This guide provides comprehensive documentation for implementing and maintaining monitoring systems in OpenShift environments. It covers essential monitoring configurations, integration with enterprise systems, and best practices for maintaining cluster observability.

Core Health Monitoring¶

Effective cluster health monitoring requires regular assessment of key operational metrics and system states.

Cluster Operator Status¶

Monitor cluster operators using these essential commands:

# View comprehensive operator health status
oc get clusteroperators -o custom-columns=NAME:.metadata.name,VERSION:.status.versions[*].version,AVAILABLE:.status.conditions[?(@.type=="Available")].status,PROGRESSING:.status.conditions[?(@.type=="Progressing")].status,DEGRADED:.status.conditions[?(@.type=="Degraded")].status

# Identify degraded operators
oc get co --all-namespaces -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Degraded" and .status=="True")) | .metadata.name'

Node Health Assessment¶

Monitor node health and resource utilization:

# Assess node status
oc get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[?(@.type=="Ready")].status,VERSION:.status.nodeInfo.kubeletVersion

# Review resource utilization
oc adm top nodes

Prometheus and Grafana Integration¶

OpenShift's monitoring stack leverages Prometheus and Grafana for metrics collection and visualization.

Prometheus Configuration¶

Configure Prometheus retention and storage:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      retention: 15d
      volumeClaimTemplate:
        spec:
          storageClassName: fast
          resources:
            requests:
              storage: 100Gi
    alertmanagerMain:
      nodeSelector:
        node-role.kubernetes.io/infra: ""

Grafana Dashboard Management¶

Manage Grafana dashboards and access:

# Obtain Grafana route
oc get route grafana -n openshift-monitoring

# Import custom dashboards
oc create configmap custom-dashboard \
  --from-file=my-dashboard.json \
  -n openshift-monitoring

Alert Management¶

Implement effective alert management through proper configuration and routing.

Alert Routing Configuration¶

Configure alert routing based on severity and team requirements:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: alert-routing
  namespace: openshift-monitoring
spec:
  route:
    receiver: team-notifications
    routes:
    - matchers:
      - name: severity
        value: critical
      receiver: pager-duty
    - matchers:
      - name: severity
        value: warning
      receiver: slack
  receivers:
  - name: pager-duty
    pagerdutyConfigs:
    - serviceKey:
        name: pagerduty-key
        key: service-key
  - name: slack
    slackConfigs:
    - apiURL:
        name: slack-webhook
        key: url
      channel: '#alerts'

[Previous sections remain as above through Alert Management]

Air-Gapped Environment Monitoring¶

Air-gapped environments require specific monitoring configurations to ensure functionality without external dependencies.

Metrics Storage Configuration¶

Implement appropriate local storage for metrics retention:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-data
  namespace: openshift-monitoring
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: local-storage

Internal Alert Management¶

Configure alert management systems for internal routing:

apiVersion: monitoring.coreos.com/v1
kind: AlertmanagerConfig
metadata:
  name: internal-alerts
spec:
  route:
    receiver: internal-webhook
  receivers:
  - name: internal-webhook
    webhookConfigs:
    - url: "http://internal-alert-manager.example.com/webhook"

Capacity Planning¶

Effective capacity planning requires systematic collection and analysis of resource utilization trends.

Resource Utilization Analysis¶

Implement systematic resource monitoring:

# Collect CPU utilization metrics
oc adm top nodes --heapster-namespace=openshift-monitoring --heapster-scheme=https

# Export historical metrics
curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" \
  https://$(oc -n openshift-monitoring get route prometheus-k8s -o jsonpath="{.spec.host}")/api/v1/query_range \
  -d 'query=sum(container_memory_usage_bytes)' \
  -d 'start=2024-01-01T00:00:00Z' \
  -d 'end=2024-01-31T23:59:59Z' \
  -d 'step=1h'

Enterprise System Integration¶

Integration with enterprise monitoring systems requires careful configuration of data export and federation capabilities.

Metrics Federation¶

Configure Prometheus federation for external system integration:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: federate
  namespace: openshift-monitoring
spec:
  endpoints:
  - interval: 30s
    port: web
    path: /federate
    params:
      match[]:
      - '{job="kubernetes-nodes"}'
  selector:
    matchLabels:
      app: federate

Remote Write Configuration¶

Implement remote write functionality for external metric storage:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      remoteWrite:
      - url: "https://prometheus.example.com/api/v1/write"
        writeRelabelConfigs:
        - sourceLabels: [__name__]
          regex: 'container_.*'
          action: keep

Operational Best Practices¶

Successful monitoring implementation requires adherence to established operational practices. Organizations should implement clear procedures for alert management, including defined escalation paths and response protocols. Regular review and adjustment of monitoring thresholds ensures optimal system observability while preventing alert fatigue.

Documentation of monitoring configurations and architectural decisions supports long-term maintenance and knowledge transfer. As cluster scale increases, monitoring system capacity should be reviewed and adjusted accordingly.

Establish clear ownership and maintenance responsibilities for monitoring systems, ensuring consistent oversight and timely updates to monitoring configurations as cluster requirements evolve.