OpenShift Monitoring Guide¶
This guide provides comprehensive documentation for implementing and maintaining monitoring systems in OpenShift environments. It covers essential monitoring configurations, integration with enterprise systems, and best practices for maintaining cluster observability.
Core Health Monitoring¶
Effective cluster health monitoring requires regular assessment of key operational metrics and system states.
Cluster Operator Status¶
Monitor cluster operators using these essential commands:
# View comprehensive operator health status
oc get clusteroperators -o custom-columns=NAME:.metadata.name,VERSION:.status.versions[*].version,AVAILABLE:.status.conditions[?(@.type=="Available")].status,PROGRESSING:.status.conditions[?(@.type=="Progressing")].status,DEGRADED:.status.conditions[?(@.type=="Degraded")].status
# Identify degraded operators
oc get co --all-namespaces -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Degraded" and .status=="True")) | .metadata.name'
Node Health Assessment¶
Monitor node health and resource utilization:
# Assess node status
oc get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[?(@.type=="Ready")].status,VERSION:.status.nodeInfo.kubeletVersion
# Review resource utilization
oc adm top nodes
Prometheus and Grafana Integration¶
OpenShift's monitoring stack leverages Prometheus and Grafana for metrics collection and visualization.
Prometheus Configuration¶
Configure Prometheus retention and storage:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
retention: 15d
volumeClaimTemplate:
spec:
storageClassName: fast
resources:
requests:
storage: 100Gi
alertmanagerMain:
nodeSelector:
node-role.kubernetes.io/infra: ""
Grafana Dashboard Management¶
Manage Grafana dashboards and access:
# Obtain Grafana route
oc get route grafana -n openshift-monitoring
# Import custom dashboards
oc create configmap custom-dashboard \
--from-file=my-dashboard.json \
-n openshift-monitoring
Alert Management¶
Implement effective alert management through proper configuration and routing.
Alert Routing Configuration¶
Configure alert routing based on severity and team requirements:
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: alert-routing
namespace: openshift-monitoring
spec:
route:
receiver: team-notifications
routes:
- matchers:
- name: severity
value: critical
receiver: pager-duty
- matchers:
- name: severity
value: warning
receiver: slack
receivers:
- name: pager-duty
pagerdutyConfigs:
- serviceKey:
name: pagerduty-key
key: service-key
- name: slack
slackConfigs:
- apiURL:
name: slack-webhook
key: url
channel: '#alerts'
Air-Gapped Environment Monitoring¶
Air-gapped environments require specific monitoring configurations to ensure functionality without external dependencies.
Metrics Storage Configuration¶
Implement appropriate local storage for metrics retention:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-data
namespace: openshift-monitoring
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: local-storage
Internal Alert Management¶
Configure alert management systems for internal routing:
apiVersion: monitoring.coreos.com/v1
kind: AlertmanagerConfig
metadata:
name: internal-alerts
spec:
route:
receiver: internal-webhook
receivers:
- name: internal-webhook
webhookConfigs:
- url: "http://internal-alert-manager.example.com/webhook"
Capacity Planning¶
Effective capacity planning requires systematic collection and analysis of resource utilization trends.
Resource Utilization Analysis¶
Implement systematic resource monitoring:
# Collect CPU utilization metrics
oc adm top nodes --heapster-namespace=openshift-monitoring --heapster-scheme=https
# Export historical metrics
curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" \
https://$(oc -n openshift-monitoring get route prometheus-k8s -o jsonpath="{.spec.host}")/api/v1/query_range \
-d 'query=sum(container_memory_usage_bytes)' \
-d 'start=2024-01-01T00:00:00Z' \
-d 'end=2024-01-31T23:59:59Z' \
-d 'step=1h'
Enterprise System Integration¶
Integration with enterprise monitoring systems requires careful configuration of data export and federation capabilities.
Metrics Federation¶
Configure Prometheus federation for external system integration:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: federate
namespace: openshift-monitoring
spec:
endpoints:
- interval: 30s
port: web
path: /federate
params:
match[]:
- '{job="kubernetes-nodes"}'
selector:
matchLabels:
app: federate
Remote Write Configuration¶
Implement remote write functionality for external metric storage:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://prometheus.example.com/api/v1/write"
writeRelabelConfigs:
- sourceLabels: [__name__]
regex: 'container_.*'
action: keep
Operational Best Practices¶
Successful monitoring implementation requires adherence to established operational practices. Organizations should implement clear procedures for alert management, including defined escalation paths and response protocols. Regular review and adjustment of monitoring thresholds ensures optimal system observability while preventing alert fatigue.
Documentation of monitoring configurations and architectural decisions supports long-term maintenance and knowledge transfer. As cluster scale increases, monitoring system capacity should be reviewed and adjusted accordingly.
Establish clear ownership and maintenance responsibilities for monitoring systems, ensuring consistent oversight and timely updates to monitoring configurations as cluster requirements evolve.