debugging
diagnostics
openshift
troubleshooting
Troubleshooting Guide
Common Issues
Installation Failures
Check API server health:
# Test API server health
curl -k https://api.demo.k8s.local:6443/healthz
# Verify API server version
curl -k https://api.demo.k8s.local:6443/version
Check node and machine status:
oc get nodes
oc get machines
oc describe node <node-name>
Review events:
oc get events --sort-by= '.metadata.creationTimestamp'
Examine operator status:
oc get clusteroperators
oc describe co <operator-name>
Check machine configuration:
oc get pods -n openshift-machine-config-operator
oc logs -n openshift-machine-config-operator -l k8s-app= machine-config-server
Check installation logs:
openshift-install gather bootstrap --dir /root/cluster
Network Issues
Verify DNS resolution:
# Check API server resolution
dig api.demo.k8s.local +short
# Check internal API server resolution
dig api-int.demo.k8s.local +short
# Check application wildcard DNS
dig *.apps.demo.k8s.local +short
Check pod networking:
oc get pods -n openshift-sdn
oc logs -n openshift-sdn -l app = sdn
oc get network.config.openshift.io cluster -o yaml
Review service endpoints:
oc get endpoints -A
oc get svc -A
Test network connectivity:
oc debug node/<node-name> -- chroot /host ip addr show
oc debug node/<node-name> -- chroot /host ping <target-ip>
Resource Constraints
Check resource usage:
oc adm top nodes
oc adm top pods --containers= true --all-namespaces
Review quota usage:
oc get resourcequota -A
oc describe quota -n <namespace>
Monitor storage:
oc get pv,pvc --all-namespaces
oc get volumeattachment
Certificate Issues
Check certificate status:
oc get csr
oc get secret -n openshift-config
Review certificate expiration:
oc get secret -n openshift-kube-apiserver-operator kube-apiserver-to-kubelet-signer -o jsonpath = '{.metadata.annotations.auth\.openshift\.io/certificate-not-after}'
Verify API server certificates:
oc get apiserver cluster -o yaml
Authentication and Authorization
Check identity provider configuration:
oc get oauth cluster -o yaml
oc get identity
Review role bindings:
oc get clusterrolebinding
oc get rolebinding --all-namespaces
Registry Issues
Check registry status:
oc get pods -n openshift-image-registry
oc get configs.imageregistry.operator.openshift.io cluster -o yaml
Review storage configuration:
oc get pvc -n openshift-image-registry
oc describe pvc -n openshift-image-registry
Collecting Diagnostics
Gather must-gather data:
# General cluster data
oc adm must-gather
# Specific component data
oc adm must-gather --image= registry.redhat.io/rhacm2/acm-must-gather-rhel8:v2.8
Review cluster logs:
# Control plane logs
oc logs -n openshift-controller-manager deployment/controller-manager
# Node logs
oc adm node-logs <node-name> -u kubelet
# Specific pod logs
oc logs -n <namespace> <pod-name> --previous
Export cluster state:
# Full cluster state
oc get all -A -o yaml > cluster-state.yaml
# Specific component state
oc get nodes -o yaml > nodes-state.yaml
oc get co -o yaml > operators-state.yaml
Check etcd health:
oc rsh -n openshift-etcd etcd-<control-plane-node> etcdctl endpoint health
oc rsh -n openshift-etcd etcd-<control-plane-node> etcdctl endpoint status -w table
Monitor API server metrics:
oc get --raw /metrics | grep apiserver_request_duration_seconds
DRP -Specific Troubleshooting
Check DRP machine status:
drpcli machines show <machine-uuid>
Examine task execution:
drpcli tasks status <task-uuid>
drpcli tasks logs <task-uuid>
API Health Verification
Test API server health:
# Test API server health directly
curl -k https://api.demo.k8s.local:6443/healthz
# Get API server version
curl -k https://api.demo.k8s.local:6443/version
Best Practices
Maintain cluster documentation including:
Network configuration
Storage layout
Authentication setup
Custom configurations
Implement systematic log collection and retention
Create and maintain runbooks for common issues
Document configuration changes and their rationale
Establish clear escalation paths for different types of issues