AI Summary
This video provides a focused troubleshooting checklist for CKA Domain 2, covering cluster architecture, core components, and practical commands to diagnose issues. It walks through the control plane and worker node components, explaining their roles and how to check them when things go wrong.
Chapters
A Kubernetes cluster consists of a control plane (decision-making) and worker nodes (execution). Troubleshooting starts by identifying whether the problem is with the control plane or worker nodes.
Kubelet runs on every node, reports node status, manages containers via CRI, enforces pod specs, collects resource usage, and invokes network/storage plugins. Check its pods and endpoints when pods fail or nodes are unhealthy.
Demonstrates disabling anonymous authentication and read-only port in kubelet config. After changes, direct pod access and metrics are blocked, enhancing security.
CRI makes kubelet runtime-agnostic, supporting containerd, CRI-O, etc. Use crictl to list/inspect containers, check logs, and exec into containers for troubleshooting.
Runs on each node as a DaemonSet, implements networking rules for services. Check kube-proxy logs and local forwarding rules if services are unreachable.
Front door of the cluster; authenticates, authorizes, validates requests, runs admission controllers, and persists state to etcd. Check logs and static manifests if kubectl fails.
Three misconfigurations found: semicolon in manifest, unknown flag, wrong etcd port (23000 vs 2379). Fixed by editing manifest and correcting flags/ports.
Runs controllers to ensure actual state matches desired state. If resources aren't reconciling, check controller manager logs and watch cycles.
Found unknown flag 'sidecar-insertion' in kube-controller-manager config. Removed the line to fix the issue.
Persistent store for all cluster objects. If unhealthy, control plane loses memory. Check cluster membership, health, and write success rates.
Selects best node for each pod. If pods are pending, check scheduler logs and conditions to see why filters/ranking failed.
Effective troubleshooting requires determining whether the issue is at the control plane or node level, then examining component-specific configuration files, logs, and ports. Understanding each component's role and interaction is key to resolving cluster problems.
Clickbait Check
90% Legit"Title accurately describes the content: a focused troubleshooting guide for CKA Domain 2."
Mentioned in this Video
Tutorial Checklist
Study Flashcards (11)
What are the two main parts of a Kubernetes cluster?
easy
Click to reveal answer
What are the two main parts of a Kubernetes cluster?
Control plane (decision-making) and worker nodes (execution).
01:02
What is the role of kubelet?
medium
Click to reveal answer
What is the role of kubelet?
Runs on every node, reports node status, manages containers via CRI, enforces pod specs, collects resource usage, and invokes plugins.
01:43
What is CRI and why is it important?
medium
Click to reveal answer
What is CRI and why is it important?
Container Runtime Interface; makes kubelet runtime-agnostic, allowing Kubernetes to work with containerd, CRI-O, etc.
04:58
Which tool is used to directly interact with containers via CRI?
easy
Click to reveal answer
Which tool is used to directly interact with containers via CRI?
crictl (CRI-CTL).
05:24
What does kube-proxy do?
medium
Click to reveal answer
What does kube-proxy do?
Runs on each node as a DaemonSet, implements networking rules to forward traffic to correct pod endpoints.
06:10
What is the first component to check if kubectl calls fail?
easy
Click to reveal answer
What is the first component to check if kubectl calls fail?
API server (kube-apiserver).
07:11
What are three common misconfigurations found in the API server demo?
hard
Click to reveal answer
What are three common misconfigurations found in the API server demo?
Semicolon in manifest, unknown flag, wrong etcd port (23000 instead of 2379).
11:20
What is the role of the controller manager?
medium
Click to reveal answer
What is the role of the controller manager?
Runs controllers to ensure actual state matches desired state (e.g., replica sets, jobs).
11:38
What is etcd?
easy
Click to reveal answer
What is etcd?
Persistent key-value store that stores all cluster objects and configuration.
14:01
What component is responsible for scheduling pods onto nodes?
easy
Click to reveal answer
What component is responsible for scheduling pods onto nodes?
kube-scheduler.
14:44
What should you check if pods are stuck in pending state?
medium
Click to reveal answer
What should you check if pods are stuck in pending state?
Scheduler logs and conditions to see why filters/ranking failed.
15:12
💡 Key Takeaways
API Server Unresponsive
Shows a real troubleshooting scenario where kubectl fails with 'connection refused' on port 6443.
08:16Three Misconfigurations Found
Demonstrates systematic log analysis to uncover multiple errors in API server manifest.
11:20Unknown Flag in Controller Manager
Highlights how a simple typo in config can break the controller manager, and how logs reveal it.
13:05Full Transcript
[00:00] Hi everyone today we will cover cluster architecture for CKA domain 2 troubleshooting and guide you through the cluster and ARTME the processes that run in it and exactly where to look when things go wrong. Let's dive in.
[00:16] This video is aimed at CKA candidates if you are studying for the exam treat this as a focus troubleshooting checklist. Understand what each component does, how they talk to each
[00:33] other and the practical commands and check to run when something breaks. In domain 2, we focus on 4 areas, the cluster and its nodes, the core cluster components,
[00:48] monitoring and resources, check and services and networking. I'll walk you through each area in sequence so you know both the theory and the troubleshooting steps.
[01:02] At the highest level a Kubernetes cluster is a group of machines working together, a control plane that makes decisions and worker nodes that runs your workloads.
[01:18] If you are looking at the diagram on the slide, imagine the control plane at the top and worker and worker nodes below it. The control plane holds components that accept request and record
[01:30] state. Worker nodes run pods and actually execute the containers. Troubleshooting starts by asking is the problem with the decision making which is the control plane or the execution
[01:43] which is the worker node. Now first the node agent which is the cubelet. It runs on every node. It continuously reports node status to the control plane. Its core job is managing
[01:58] container on its node, pulling images, starting and stopping container via the container runtime. Enforcing the pod specs the API server provides. When the API server gives an instruction
[02:16] like create, update or delete a pod, Kubelet implements it. Kubelet also collects resource usage on the node and invokes network and storage plugins
[02:28] so pods get their networking and volumes. If you need to prob the Kubelet, check its pods and expose antipoints for metrics and pod information. these are the places to look when pods fail to start or nodes report the old conditions
[02:46] now coming towards the demo in this demo we will disable the unauthorized and authenticated access via Kubelet
[02:59] now first checking the services of the Kubelet Here you can see the configuration file location.
[03:16] Now accessing this file file as you can see that anonymous authentication is enabled
[03:33] the read only port is also available. Now first taking the normal operation if I get the ports command to direct link or a direct request you can see all the pod information
[03:50] because the request is not authenticated. Now first changing the anonymous authentication changing it to true changing the true to false and then authorization we change it to webhook
[04:12] and for read only ports we don't want that anyone access the matrix as well so we configure 0 as the 0 to only port now restarting the kubelet
[04:31] as you can see now we are not able to access the pods through direct requests and we are not able to see the matrix as well so in this form you can enable the
[04:44] kubelet security or you can check for any nif configuration in the kubelet configuration file that is obtained through the through checking the services of the kubelet.
[04:58] Now moving forward kubelet talks to the container runtime through a standard interface called the CRI. This makes kubelet runtime agnostic.
[05:10] Because of the CRI which is container runtime interface, Kubernetes works with container D, CRIO or other runtime without changing the core logic. Container D is a common lightweight runtime you will see in modern clusters.
[05:24] For troubleshooting, containers directly use the runtime-aware CLI, for example CRI-CTL or normally we call it CHI-CTL. to list the containers, inspect them, check the logs and execute the into the running containers.
[05:41] This helps you separate container level problems from the Kubernetes level problem. In the next demo we will be using the KICTL commands to check the logs and check the status
[05:56] of the different containers that are running on the control plane. Now moving forward, Q-Proxy runs on each node and implements the networking rules that let the services reach the correct pods.
[06:10] Now think of it as the traffic director. When a service is created, Q-Proxy sets up the local networking rules to forward traffic to the correct pod endpoints. In many clusters, Q-Proxy runs as a pod in the control plan and namespace and is deployed
[06:26] as a daemon set so every node has one and when you delete it it will automatically be created if client can't reach a service check the kube proxy logs and the local forwarding rules
[06:40] on the node that's often where services reachability problem shows up now moving forward we come towards the kube api server the api server is the cluster's front door
[06:56] Everything like kubectl commands, controller, kubelets talk to it. It authenticates, authorizes, validates, requests, run admission controllers and persists cluster state to the cluster store.
[07:11] When you create an object the API server is responsible for receiving that request, first validating it and then the scoring the status changes if any are requested. If the cluster is unresponsive or kubectl calls fails, the API server is the first point to check.
[07:31] Now for quick troubleshooting, what happens is like the client request arrives at the API server. The API server first validates, authenticates and persists the object.
[07:43] Other components like scheduler, controller watch for the changes and react accordingly. If something stops at any step the API server logs and the static manifest on the control plane nodes are your first checkpoint.
[07:57] Now coming towards live demo In this live demo we will be troubleshooting the kube API As you can see when I run the kubectl command it is unresponsive and the error shows that 6443 was refused
[08:16] So 6443 is the port for the kube API. Now checking the cryctl first and checking the container there is no container being
[08:28] built. Now checking the log files there is no log file created as well. So the actually the container has not run for once as well.
[08:48] Now checking the logs. Now moving towards the configuration file. The manifest for the in the cube API.
[09:02] Now you see that the metadata there is a semicolon using sudo for colon. Now if we change it this is the one error.
[09:14] Now let's see if the if we can run the kubectl command again. Let's first check the container. Now the container has been built.
[09:27] The container is available. we can check the logs of this container
[09:39] now if you see the logs again checking the container giving the proper container ID now you can see that there is an unknown flag and instead of mob
[10:01] modus has been written now we can check on the configuration file and make the correction accordingly
[10:19] now again checking if the container is now available eqdapi container through crictl command now checking the logs again
[10:40] now if you see that there is a connection refusal and a port number 23000 has been used now checking the 23000 port now you can see that the etcd is configured with 23000 port
[10:56] configured with 23000 ports but the actual port number is 2379
[11:08] now after making the changes lets see if the kube api server is now responding and we are able to run the kubectl command again
[11:20] Yes, now it is working. So there were three misconfigurations and we were able to find those misconfigurations through the different kinds of logs that are available.
[11:38] Now moving forward the controller manager runs controllers that ensures the clusters actual state matches with the desired state. like replica sets, jobs, endpoints, procession volume, service accounts and more.
[11:52] Controllers continuously compare current state to the desired state and make the changes to converge that state. For example, restarting pods when replicas drop.
[12:04] Controllers don receive external requests directly they act through API server If the resources aren reconciling as expected the control manager logs in it watch the cycle and its watch cycle are there
[12:19] to investigate. Now coming towards the live demo.
[12:32] Now as we can see that all these control components are present in the cube system name space and here the controller is not working.
[12:50] Now checking the logs of the particular pod.
[13:05] In the logs we can see that there is an unknown flag that is sidecar insertion.
[13:17] Now let us open the configuration file of kube controller and find the flag that is causing the error.
[13:32] Now removing the line which is causing the problem and then again see whether the pod is not available is now available or not as you can see now the pod is running
[14:01] now moving forward etc is the classic persistent store the single source of truth it stores all the objects in configuration like pause, nodes, secrets, roles, everything.
[14:17] Any change you make is first written to ETCD. Other components read or watch ETCD for updates. If ETCD becomes unhealthy, the control plane loses its memory that can break the scheduling,
[14:32] reconciliation and more. For diagnosis, confirm cluster membership, health and death rights are succeeding.
[14:44] Now moving forward the Cubed Scheduler. The Scheduler is responsible for choosing the best node for each pod. It filters out nodes that don't meet requirement, then ranks the remaining nodes and picks the
[14:58] best match based on the scoring rules. It watches the API server for unscheduled pods then assign them. If pods are stuck in the pending state, scheduler condition and logs are the right place to
[15:12] look for. Check why pods fail filters or why the ranking pick a node that later rejected the pod. So if the pod is not scheduling, the control panel you want to check is the Qt scheduler.
[15:29] Now let us wrap it up. During troubleshooting, determine whether issue exists at control plan or the node level.
[15:41] The first step is to determine where to look for the control plan or the node level. Now the control plan has multiple control components. Then we have to follow the components specific to each problem.
[15:56] See their configuration file, know the configuration file location and the specific ports and try to interpret the logs collected from different sources.
[16:09] If this helped, like and subscribe the channel and good luck for