Skip to content

Troubleshooting Guide

Common Issues

Installation Failures

  1. Check API server health:

    # Test API server health
    curl -k https://api.demo.k8s.local:6443/healthz
    
    # Verify API server version
    curl -k https://api.demo.k8s.local:6443/version
    

  2. Check node and machine status:

    oc get nodes
    oc get machines
    oc describe node <node-name>
    

  3. Review events:

    oc get events --sort-by='.metadata.creationTimestamp'
    

  4. Examine operator status:

    oc get clusteroperators
    oc describe co <operator-name>
    

  5. Check machine configuration:

    oc get pods -n openshift-machine-config-operator
    oc logs -n openshift-machine-config-operator -l k8s-app=machine-config-server
    

  6. Check installation logs:

    openshift-install gather bootstrap --dir /root/cluster
    

Network Issues

  1. Verify DNS resolution:

    # Check API server resolution
    dig api.demo.k8s.local +short
    
    # Check internal API server resolution
    dig api-int.demo.k8s.local +short
    
    # Check application wildcard DNS
    dig *.apps.demo.k8s.local +short
    

  2. Check pod networking:

    oc get pods -n openshift-sdn
    oc logs -n openshift-sdn -l app=sdn
    oc get network.config.openshift.io cluster -o yaml
    

  3. Review service endpoints:

    oc get endpoints -A
    oc get svc -A
    

  4. Test network connectivity:

    oc debug node/<node-name> -- chroot /host ip addr show
    oc debug node/<node-name> -- chroot /host ping <target-ip>
    

Resource Constraints

  1. Check resource usage:

    oc adm top nodes
    oc adm top pods --containers=true --all-namespaces
    

  2. Review quota usage:

    oc get resourcequota -A
    oc describe quota -n <namespace>
    

  3. Monitor storage:

    oc get pv,pvc --all-namespaces
    oc get volumeattachment
    

Certificate Issues

  1. Check certificate status:

    oc get csr
    oc get secret -n openshift-config
    

  2. Review certificate expiration:

    oc get secret -n openshift-kube-apiserver-operator kube-apiserver-to-kubelet-signer -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}'
    

  3. Verify API server certificates:

    oc get apiserver cluster -o yaml
    

Authentication and Authorization

  1. Check identity provider configuration:

    oc get oauth cluster -o yaml
    oc get identity
    

  2. Review role bindings:

    oc get clusterrolebinding
    oc get rolebinding --all-namespaces
    

Registry Issues

  1. Check registry status:

    oc get pods -n openshift-image-registry
    oc get configs.imageregistry.operator.openshift.io cluster -o yaml
    

  2. Review storage configuration:

    oc get pvc -n openshift-image-registry
    oc describe pvc -n openshift-image-registry
    

Collecting Diagnostics

  1. Gather must-gather data:

    # General cluster data
    oc adm must-gather
    
    # Specific component data
    oc adm must-gather --image=registry.redhat.io/rhacm2/acm-must-gather-rhel8:v2.8
    

  2. Review cluster logs:

    # Control plane logs
    oc logs -n openshift-controller-manager deployment/controller-manager
    
    # Node logs
    oc adm node-logs <node-name> -u kubelet
    
    # Specific pod logs
    oc logs -n <namespace> <pod-name> --previous
    

  3. Export cluster state:

    # Full cluster state
    oc get all -A -o yaml > cluster-state.yaml
    
    # Specific component state
    oc get nodes -o yaml > nodes-state.yaml
    oc get co -o yaml > operators-state.yaml
    

  4. Check etcd health:

    oc rsh -n openshift-etcd etcd-<control-plane-node> etcdctl endpoint health
    oc rsh -n openshift-etcd etcd-<control-plane-node> etcdctl endpoint status -w table
    

  5. Monitor API server metrics:

    oc get --raw /metrics | grep apiserver_request_duration_seconds
    

DRP-Specific Troubleshooting

  1. Check DRP machine status:

    drpcli machines show <machine-uuid>
    

  2. Examine task execution:

    drpcli tasks status <task-uuid>
    drpcli tasks logs <task-uuid>
    

API Health Verification

Test API server health:

# Test API server health directly
curl -k https://api.demo.k8s.local:6443/healthz

# Get API server version
curl -k https://api.demo.k8s.local:6443/version

Best Practices

  1. Maintain cluster documentation including:
  2. Network configuration
  3. Storage layout
  4. Authentication setup
  5. Custom configurations

  6. Implement systematic log collection and retention

  7. Create and maintain runbooks for common issues

  8. Document configuration changes and their rationale

  9. Establish clear escalation paths for different types of issues