Troubleshooting Guide¶

Common Issues¶

Installation Failures¶

Check API server health:

# Test API server health
curl -k https://api.demo.k8s.local:6443/healthz

# Verify API server version
curl -k https://api.demo.k8s.local:6443/version

Check node and machine status:

oc get nodes
oc get machines
oc describe node <node-name>

Review events:

oc get events --sort-by='.metadata.creationTimestamp'

Examine operator status:

oc get clusteroperators
oc describe co <operator-name>

Check machine configuration:

oc get pods -n openshift-machine-config-operator
oc logs -n openshift-machine-config-operator -l k8s-app=machine-config-server

Check installation logs:

openshift-install gather bootstrap --dir /root/cluster

Network Issues¶

Verify DNS resolution:

# Check API server resolution
dig api.demo.k8s.local +short

# Check internal API server resolution
dig api-int.demo.k8s.local +short

# Check application wildcard DNS
dig *.apps.demo.k8s.local +short

Check pod networking:

oc get pods -n openshift-sdn
oc logs -n openshift-sdn -l app=sdn
oc get network.config.openshift.io cluster -o yaml

Review service endpoints:
```
oc get endpoints -A
oc get svc -A
```

Test network connectivity:

oc debug node/<node-name> -- chroot /host ip addr show
oc debug node/<node-name> -- chroot /host ping <target-ip>

Resource Constraints¶

Check resource usage:

oc adm top nodes
oc adm top pods --containers=true --all-namespaces

Review quota usage:

oc get resourcequota -A
oc describe quota -n <namespace>

Monitor storage:

oc get pv,pvc --all-namespaces
oc get volumeattachment

Certificate Issues¶

Check certificate status:

oc get csr
oc get secret -n openshift-config

Review certificate expiration:

oc get secret -n openshift-kube-apiserver-operator kube-apiserver-to-kubelet-signer -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}'

Verify API server certificates:
```
oc get apiserver cluster -o yaml
```

Authentication and Authorization¶

Check identity provider configuration:

oc get oauth cluster -o yaml
oc get identity

Review role bindings:

oc get clusterrolebinding
oc get rolebinding --all-namespaces

Registry Issues¶

Check registry status:

oc get pods -n openshift-image-registry
oc get configs.imageregistry.operator.openshift.io cluster -o yaml

Review storage configuration:

oc get pvc -n openshift-image-registry
oc describe pvc -n openshift-image-registry

Collecting Diagnostics¶

Gather must-gather data:

# General cluster data
oc adm must-gather

# Specific component data
oc adm must-gather --image=registry.redhat.io/rhacm2/acm-must-gather-rhel8:v2.8

Review cluster logs:

# Control plane logs
oc logs -n openshift-controller-manager deployment/controller-manager

# Node logs
oc adm node-logs <node-name> -u kubelet

# Specific pod logs
oc logs -n <namespace> <pod-name> --previous

Export cluster state:

# Full cluster state
oc get all -A -o yaml > cluster-state.yaml

# Specific component state
oc get nodes -o yaml > nodes-state.yaml
oc get co -o yaml > operators-state.yaml

Check etcd health:

oc rsh -n openshift-etcd etcd-<control-plane-node> etcdctl endpoint health
oc rsh -n openshift-etcd etcd-<control-plane-node> etcdctl endpoint status -w table

Monitor API server metrics:

oc get --raw /metrics | grep apiserver_request_duration_seconds

DRP-Specific Troubleshooting¶

Check DRP machine status:
```
drpcli machines show <machine-uuid>
```

Examine task execution:

drpcli tasks status <task-uuid>
drpcli tasks logs <task-uuid>

API Health Verification¶

Test API server health:

# Test API server health directly
curl -k https://api.demo.k8s.local:6443/healthz

# Get API server version
curl -k https://api.demo.k8s.local:6443/version

Best Practices¶

Maintain cluster documentation including:
Network configuration
Storage layout
Authentication setup
Custom configurations
Implement systematic log collection and retention
Create and maintain runbooks for common issues
Document configuration changes and their rationale
Establish clear escalation paths for different types of issues