TubeSum ← Transcribe a video

Common Kubernetes Real Time Challenges | 3 Production Scenarios

0h 33m video Transcribed Jun 7, 2026 Watch on YouTube ↗
Intermediate 6 min read For: DevOps engineers and Kubernetes practitioners with basic cluster experience looking to handle production challenges.

AI Summary

DevOps engineer Abhishek discusses three real-time Kubernetes challenges in production: resource sharing across namespaces, handling OOMKilled errors, and cluster upgrades. He emphasizes that interviewers often ask about real-world problems to gauge practical experience.

[01:12]
Resource Sharing Challenge

Multiple teams share a single production cluster. Without resource quotas, a memory-leaking pod in one namespace can starve other namespaces, causing crashes.

[07:24]
Solution: Resource Quotas and Limits

Set resource quotas per namespace to limit total CPU/memory usage. Then set resource limits per pod to restrict individual pod consumption, reducing blast radius from cluster to pod.

[19:02]
Handling OOMKilled Errors

When a pod is OOMKilled despite resource limits, take thread dumps (kill -3) and heap dumps (jstack) for Java apps and share with developers for root cause analysis.

[25:14]
Upgrade Challenges

Cluster upgrades require careful planning: read release notes for breaking changes, take backups, upgrade control plane components in order (etcd, API server, scheduler), then worker nodes by draining, cordoning, upgrading kubelet, and rejoining.

Mastering resource management, OOMKilled diagnosis, and upgrade procedures demonstrates real-world Kubernetes expertise that interviewers value.

Clickbait Check

95% Legit

"Title accurately promises three production scenarios and delivers detailed, practical explanations."

Tutorial Checklist

1 07:24 Create a ResourceQuota object for each namespace to limit total CPU and memory.
2 13:16 Set resource limits on each pod/deployment to cap individual pod resource usage.
3 21:25 When a pod is OOMKilled, log into the pod and take thread dump (kill -3) and heap dump (jstack) for Java apps.
4 27:31 Before upgrade, create a detailed manual including backup steps and release notes review.
5 29:58 Upgrade worker nodes: drain node, cordon/unschedule, upgrade kubelet, then uncordon and rejoin.

Study Flashcards (5)

What Kubernetes resource limits total CPU/memory for all pods in a namespace?

easy Click to reveal answer

ResourceQuota

07:54

What is the blast radius if only resource quotas are set but no pod resource limits?

medium Click to reveal answer

The blast radius is reduced to the namespace level.

12:04

What command can be used to take a thread dump in a Java pod?

medium Click to reveal answer

kill -3

21:48

What is the correct order to upgrade control plane components?

hard Click to reveal answer

etcd, then kube-apiserver, then scheduler.

29:20

What steps are performed on a worker node during upgrade?

hard Click to reveal answer

Drain, cordon/unschedule, upgrade kubelet, then uncordon and rejoin.

29:58

🔥 Best Moments

💡

Blast Radius Concept

Explains how resource quotas reduce blast radius from cluster to namespace, a key insight for cluster stability.

12:04
🤯

Thread Dump for OOMKilled

Practical debugging step: taking thread dumps and heap dumps to help developers fix memory leaks.

21:25
💡

Release Notes Importance

Emphasizes reading release notes before upgrades to avoid breaking changes, a commonly missed step.

28:07

Full Transcript

Download .txt

[00:00] Hello everyone my name is Abhishek and welcome back to my channel. In today's video let's talk about three real-time challenges that DevOps engineers face with their Kubernetes clusters in the production environment.

[00:16] Now this is very very important because a lot of times when you give the interviews interviewers ask about the real-time challenges that you have faced

[00:28] while working on Kubernetes and if you cannot explain the answer to this question there are chances that interviewer might feel you don't have real-time working experience with Kubernetes.

[00:42] So please watch this video till the end. I will take three real-time challenges that DevOps engineers face. Almost every DevOps engineer face these challenges in their organization.

[00:55] I will explain them as detailed as possible. Let's start with the most common challenge that almost every devops engineer face in their organization that is resource sharing.

[01:12] So when you have a Kubernetes cluster, it can be in the dev environment, it can be in the staging environment or it can be in the production environment.

[01:27] How do you allocate the resources of this Kubernetes cluster between multiple development teams? Right. End of the day, each development team will not have their own Kubernetes cluster in the

[01:43] production environment. Right. You have multiple microservices in your organization deploying to the same Kubernetes cluster or you have multiple project teams within your organization.

[01:59] But when you talk about production level Kubernetes cluster, all of them are using the same cluster. cluster. So as a devops engineer how do you organize this kubernetes cluster?

[02:13] How do you share the resources and allocate the resources of this kubernetes cluster between the development teams or the project teams? So one obvious answer

[02:28] that you might be talking about is Abhishek what I will do is I will create multiple namespaces right so if you are talking about let's say a e-commerce

[02:42] application so in the e-commerce application as you you have a web team which is taking care of let's say the login logout all the front end work and

[02:56] then you have probably the payments team which is taking care of the payments related application micro service then you have let's say the transactions team

[03:08] or the delivery shipment team so for each team you can say that Abhishek I will be creating the namespaces so here comes the actual challenge when you

[03:21] create namespaces let's assume this kubernetes cluster all together which has multiple worker nodes all the worker nodes together let's say has 100 cpu and 100 gb ram

[03:41] so for purpose of making it easy assume there are five worker nodes on this kubernetes cluster And each worker node has 20 CPU and 20 GB RAM. Just for the purpose of easiness usually it will be like 32 or 64 GB RAM per one worker node.

[04:00] So now there is 100 CPU and 100 GB RAM. And if you let's say do not split the resources between the namespaces.

[04:12] So you have just created the namespaces on this kubernetes cluster and people started deploying their services onto this kubernetes cluster. So payment scheme if this is their namespace they started deploying their services onto this namespace.

[04:30] Other people have deployed their services onto the namespace. Now, assume there is one service or some services of the payments within the payments namespace.

[04:46] For some reason, they are leaking memory. What does leaking memory mean? That means they are consuming memory more than required for some reason.

[04:58] There can be some issue with this application or the payments team has not performed the performance related testing of this application. And whatever the reason it is, it started leaking the memory and it started using most of the resources of this Kubernetes cluster.

[05:18] So ideally, these applications let's say have to consume 2 GB RAM. But because of the leakage of memory and because multiple requests are coming to these services,

[05:31] what it started doing is it started using more amount of resources. Probably it took 32 GB RAM. So, these three services instead of taking 6 GB RAM, just for example, they started consuming 32 GB RAM.

[05:49] And what is the result of it? the other services will start seeing less number of resources on this cluster. So if this namespace itself is taking 40 gd ram then all the other namespaces will only be left with 60 gd ram.

[06:08] Because of this maybe one of the service in any of the namespace might not get the resources and it might get crashed.

[06:21] It might go into crash loop back off because of the OOM killed issue. Why it get OOM killed that is out of memory killed error because of some other application

[06:39] in some other namespace is consuming more number of resources. Now this is the responsibility of the DevOps engineer to make sure that you provide only

[06:52] the required amount of resources to a particular namespace. So, how do you do that? How do you provide only required amount of resources to a particular namespace?

[07:06] So for that what we can do. Just a second. So as a DevOps engineer to solve this issue which is the most common thing in many organizations.

[07:24] So for each namespace we have to come up with a resource product. So even I face this challenge in one of my previous organizations.

[07:38] So that's why I am trying to address this and probably if you explain this in your interviews. interviewer will be very much convinced that you have real-time exposure. So for each namespace you allocate the resource quota.

[07:54] Now what does resource quota mean? If you are not aware of this concept I will explain very quickly. Resource quota is a limit that you apply to the namespace. So beyond the allocated resource quota if I say this namespace has to consume 15 So if you have 15 GB I mean at the most the limit the resource quota that I said for this namespace is 15 GB then all the services within this namespace it can be 5 services 10 services

[08:27] 20 services or even 100 services together they cannot consume more than 50 GB RAM 15 GB RAM right. So resource quota is a limit that you set on a particular namespace.

[08:42] So because of this now the namespace will only start using the provided RAM and CPU. Resource quota can be of both the CPU and RAM.

[08:56] You can set the limit with respect to CPU and with respect to RAM. So as a DevOps engineer what you will do is you will talk to this development teams ask

[09:08] the payments team ok what is the required amount of resources that you require for your all the microservices. They cannot say 40 GB they cannot say 60 GB right. What they should do is at their end they have to perform performance benchmarking right.

[09:28] So when you approach the develop development teams they have to do something called as performance benchmarking and using the performance benchmarking they have to come up with a ideal

[09:41] number. So they will say that ok Abhishek for this particular namespace we need 15 GB RAM and 15 CPU. So that will be the resource quota that you will set on this and probably this namespace

[09:54] will say we need 20 CPU and 20 GB RAM. This namespace might say 25 CPU and 25 GB RAM. Of course as a DevOps engineer you can question them.

[10:07] You can ask them that for 10 microservices probably why would you need so much thing. But end of the day if the development team is performing the performance benchmarking

[10:19] as per the standards of your organization you cannot question them beyond that point. You should allocate them the required amount of resources. You can set up a meeting them with them ask them the questions but end of the day they

[10:33] will provide you the required amount of resources. So perfect you got the required amount of resources if all the namespaces together is exceeding 100 CPU and 100 GB RAM.

[10:47] Now again it is your responsibility to scale the cluster probably you can add one more work alone and make sure all the namespaces get required amount of resources. Okay Abhishek one challenge is solved. So as you said I will set up the

[11:04] resource quota and because of the resource quota the challenge with namespaces is solved. But what if what if I tell you this is only 50% of the

[11:16] challenge. Now the other 50 percentage that is left is once you allocated let's say 15 CPU and 15 GB RAM for this particular namespace okay there are five

[11:33] services in this namespace. Assume and you have allocated 15 GB RAM again the challenge is there is one microservice as I have explained previously this particular microservice is leaking memory. Now what will happen is previously it

[11:52] impacted the entire cluster but now it will impact one particular namespace. So the impact is still there but the impact is only restricted at this point of

[12:04] time to one particular namespace of the Kubernetes cluster. So the blast radius So blast radius is basically a term that we use in the industry where something is causing an issue.

[12:18] What is the blast radius for it. In this sense the blast radius came down from cluster to namespace. So using resource quota you have reduced the blast radius as a devops instance but still it is restricted to a particular namespace.

[12:34] For that what you will do is you will set up resource requests or resource limits is the important thing actually because resource requests is basically saying what is the minimum amount of resources that are required for your pod to run.

[12:59] But in our case we are talking about resource limits. So resource quota is the limit that is set on a particular namespace. Similarly resource limit is a limit that you set on a particular pod.

[13:16] Okay. We will also do this practically in one of the troubleshooting sessions. We are doing a troubleshooting playlist. I will show practically how to set up resource quota on namespace. It is just one single command but still I will show you and similarly I will also show

[13:32] you how to set up resource limits on a particular pod. I think we have already covered this but I will try to show again. So it is just adding few fields to your pod resource or your deployment resource.

[13:47] But what is the result of this? Previously we restricted limit on a particular namespace. With resource limit we will restrict limit on a particular pod. So again as a DevOps insider you will go back to the development team and you will take the performance benchmarking of each application or each microservice.

[14:11] Okay you will keep the data of the performance benchmarking of each microservice and when you deploy through your scripts through your CI CD or when you write the YAML files for

[14:25] this microservices what you will do this time is for each microservice you will set up resource limits. On a rough scale for easy example let's assume you have 5 microservices you will say 3 GB

[14:42] RAM for this 3 GB RAM for this 3 GB RAM for this and 3 GB RAM for this. This is the worst way of saying I know but just for the easy example. In general it will not be like that probably this microservice might need 8 CPU 8 GB RAM

[14:56] whereas this microservice might only need 1 CPU and 1 GB RAM as per the performance benchmarking. So you will just take the details and depending upon their thing you will allocate that.

[15:09] Now after that even after allocating HGB memory for this particular pod if it is leaking memory then you will simply say that ok we have already created the namespace for you we have created

[15:25] resource limit for you. If one of your pod is causing an issue and it is impacting the other microservices within your pod then you have to handle that.

[15:38] Who has to handle that? The development team. Now how will you help the development team with that? That is our scenario number 2. I will go to that point but before that I want everyone to understand this scenario.

[15:55] This is a real time challenge where for the namespaces you will use resource quota and for each pod you will set up resource limits. So that the blast radius is now only reduced to one of your pod.

[16:12] Let me draw this again and try to explain before we move to the next scenario. So previously when you have this Kubernetes cluster if there is a pod the blast radius

[16:24] So, the pod was the entire Kubernetes cluster. You have the namespaces of course but the pod started consuming resources from the cluster from all the nodes of your cluster because of which the entire cluster has become your blast radius

[16:42] Once you set up resource quota the blast radius has come down to a particular namespace. Now because of this pod which is leaking memory what happened is the other pods

[17:00] within the namespace might not get the resources okay so black radius is for that particular namespace. If you set up resource limits on this particular pod

[17:12] where you say out of 15 GB memory only 8 GB memory is allocated to this pod after it started consuming HGV memory and if it goes beyond that what will

[17:25] happen is the pod gets crashed. Now this pod will get crashed previously if you did not have the resource quota some other pod in some other namespace might

[17:38] get up crashed. If you don't have resource limits probably any other pod in this namespace might have got crashed. Once you set up both the resource quota and and the resource limits on that particular namespace sorry on that particular pod.

[17:52] Now the blast radius has come down to that particular pod itself. So this pod is getting crashed and this pod is going into out of memory right.

[18:07] With this we have completed our real time scenario number one. So you can explain this the same way in the interview and what is the outcome of this explanation.

[18:19] You will simply say as a DevOps engineer when I joined the organization there was a cluster which was shared between multiple development teams. So because of one of the pod which is leaking memory the entire cluster was impacted.

[18:34] We did not know which pod is going down why is it going down. So I mean which pod is going down in which namespace because of out of memory. So what I have done as a DevOps engineer I have set up the resource quota I have set up the resource limits on that particular pod of the namespaces and because of which I can identify which pod is creating the issue and I can identify the last radius coming down to one particular pod.

[19:02] Now let's move to our scenario number 2. It is just a continuation of the same scenario. Now in the previous case I have explained that ok once you set up the resource quota

[19:16] and resource limit you have identified which pod is leaking memory and what you have done is you noticed that this particular pod is going into oom-killed which is a graph loop backoff.

[19:37] As I have explained crash loop backoff is just a status it is not an error the actual error is OOM killed that is out of memory killed.

[19:50] When you do kubectl get pods you will see the status as crash loop backoff but when you do kubectl describe on that particular pod you will see that the crash loop backoff is because of OOM killed.

[20:04] Now in this case what you will do? Okay as a DevOps engineer you have noticed that one of your pod went into OOM killed.

[20:16] This is another very very important real time challenge that you can explain to the interviewer. Or interviewer themselves might ask you how will you resolve the OOM killed issue on a

[20:31] pod or how do you resolve the issue of a pod going down because of the memory issues so what you can say is okay I have noticed this issue on our production

[20:46] environment or on our dev or staging environment I have noticed that one of my pod is going down because of the OOM killed issue I have already set up the

[20:59] resource quota I have already set up the resource limits still the pod is going into OAM killed even after giving 8 GB RAM as per the performance benchmarking

[21:12] of the development team so what I have done is let's assume this application is a Java application this microservice is a Java microservice so you can explain

[21:25] the interviewer that I went to this pod and I have shared the thread dump and also the sheep dump. So you will log into the pod for example thread dump is taken

[21:48] using the kill minus three command okay if you go to the pod and if you execute the kill minus three command you can get the thread dump. If you execute a JSTACK command you can get the heat dump.

[22:04] So instead of explaining the commands I'm just explaining for your easiness. It depends upon the programming language to language. In some programming languages you cannot even take the thread dumps directly. But if you take Java as an example instead of complicating by explaining the commands

[22:22] You can simply tell them that I went to this pod and I have taken the thread dumps and heap dumps. And I have shared this thread dumps and heap dumps with the development team.

[22:35] As a DevOps engineer probably you cannot do much more than this. What you can do is you can just share them the thread dump heap dump. They will understand which thread or which you know part of this microservice is linking

[22:53] the memory and they will perform their analysis. They will create a jera ticket or you know they will come up with a root cause analysis and they will say that ok this was the issue let me fix it and let me come with a new version

[23:06] of the application. Okay and you will deploy the new version of the application on the cluster again. So this is your scope as a DevOps engineer where when you identify a out of memory killed

[23:20] microservice remember you must have set the resource quota and the resource limit. If you haven't set the resource quota and resource limit you might be seeing out of memory killed

[23:33] Because of any other microservice consuming more resources on the cluster and this particular microservice or pod not getting the resources. Probably the pod created on the cluster beyond 200 MB the pod was not getting the resources.

[23:51] So it got crashed. So only if you have set resource quota and resource limit which is the primary thing as a DevOps engineer still if you see the OOM field on a particular pod you will just

[24:05] go to that particular pod because you have already provided it required resources as per the performance benchmarking. You will take the thread dump and heap dump different programming languages have different

[24:18] commands in some programming languages like golang probably you cannot take the thread dump heap dump directly it is a different scenario but let's cover the most common ones like in Java programming language you will just take thread dump and heat dump and share

[24:32] it with the developer. This is your responsibility after that if you still want to know what developer does so they will analyze the threads they will try to see why a particular thread is leaking memory and they will perform some analysis in depth on that thread come up with a new version and share it with you So this is our scenario 2 This is how we handle the OOM killed error

[24:58] So you can explain this scenario as well after explaining the first scenario. Now let's move to the third scenario which you know again is a very common one.

[25:14] You can explain one of the challenges that I have faced in my organization is Upgrades. You know interviewer will always be convinced if you say upgrades and the

[25:31] other two scenarios as well because they are very very common. Almost every DevOps engineer in every organization will face the challenges like our

[25:43] scenario number one which is resource splitting or sharing the resources OOM killed probably there is no organization where a DevOps engineer might have not

[25:56] seen the OOM killed error similarly there is no organization or there is no DevOps engineer who might have not performed upgrades it's a very common Today if you are using k8 1.29 on your kubernetes cluster you will have to upgrade to 1.30 at

[26:16] some point of time. So you have to go through the upgrade process. Of course I have not created the upgrade related video on my channel. I will do that very soon where step 1 right from what is required what are the prerequisites

[26:33] I will cover each and everything practically by taking a EKS cluster or a cube ADM cluster whatever most number of people will benefit from.

[26:45] I will take one of these scenarios because there are slight changes in the steps. But I will try to explain the complete upgrade process end to end. But for now at a high level you can say one of our challenges was upgrades.

[27:01] So you know it is always challenging as a DevOps engineer when it comes to upgrades we have to be very careful. So how do we overcome this challenge is we have prepared a very detailed manual.

[27:17] This is how you can explain the interviewer as a DevOps engineer who has experience in upgrades. What I have done is I have created a manual with detailed steps.

[27:31] Our Kubernetes cluster you can explain that is a kubadium cluster or EKS cluster. Because I have performed couple of upgrades till now I have prepared a end-to-end manual with detailed steps where I have documented things like how to take backup before performing the upgrade.

[27:53] How to go through the release notes. You know people miss this step always during the interview which is very very important. Whenever there is a new version of any particular application it can be kubernetes it can be

[28:07] Argo CD it can be Istio anything they have a release notes which is very important to read because there might be a change in 1.30 of the kubernetes cluster without reading

[28:22] release notes if you perform the upgrade your cluster might go completely down. There might be a breaking change there might be a feature that was upgraded from beta to stable.

[28:34] Who knows you might not be using that feature and that might be deprecated also. Sorry you might be using a feature that might be deprecated also. So you should always read the release notes.

[28:46] Of course there might not be that significant impact of your cluster going down. done but still if you don't read the release notes you might run into some adverse effect. So I have created a manual with detailed steps which explains how to take the backup of resources,

[29:04] how to read the release notes, what exactly are the points in the release note that might impact our cluster. I have noted that down then you will say I have divided steps per control plane components

[29:20] and the worker nodes the data plane conference. In the control plane conference I have detailed the steps on how to start with etcd and then how to upgrade the version of a cube API server.

[29:38] Then how to upgrade the version of a scheduler. What are the steps? What is the order? I have noted down and then in the worker node or the data plane. I have explained how to first you know the steps are basically let's say you have multiple worker nodes.

[29:58] What you usually do in the organization you know these are the real time steps. First you make first you drain a node. Okay when you have three nodes first you will pick up a node and you will drain the node.

[30:14] Training means you will give this particular node time to move or the scheduler time to move the pods that are running on this particular node to a different node or to different nodes

[30:27] of your Kubernetes cluster. Once the pods are moved and the node is empty they don't have any resources running on it no microservices are running on the node then you will make the node unschedulable.

[30:40] You will do both of them at the same time. Okay. So you will taint the node or you will card down the node. These are the terms or you will make the node unschedulable.

[30:53] And now there will be no services that will be scheduled on this node. So almost cube scheduler will only have these two nodes that are working. So you will disconnect this node.

[31:06] What you will do is you will upgrade the cubelet. So you will remove it to the new version and you will install other new version new packages that are required on this particular node and then you will bring this node up.

[31:22] You will join the node again back to the Kubernetes cluster and this time you will remove the unschedulable taint and any other taint that you have applied. So now this node is running with 1.29 this node is running with 1.29 but this node is

[31:38] running with 1.30. Again you will keep both of these nodes active and you will cord on this node make it unschedulable drain the cords again you will perform the activity it will come to 1.30 and similarly you will perform on this.

[31:55] You can explain the upgrade challenges that you have you can say it is very challenging in your environment you have prepared a detailed manual detail steps So these are the three real time challenges that every DevOps engineer face in their organization

[32:11] with respect to Kubernetes cluster. Even if you explain 70 to 80 percentage of what I have explained, you can easily convince the interviewer that you have real time working experience on Kubernetes and you are really

[32:25] good at solving the Kubernetes challenges. Thank you so much for watching today's video. Thank you so much for watching today's video. Do let me know definitely in the comment section how did you find this video.

[32:39] Now I am trying to give you as many real time things as possible. I am also doing Kubernetes troubleshooting series. So most of the Kubernetes real time things are covered on this channel. I have recently done a video on Istio as well.

[32:53] Next I will be doing video on advanced things of Kubernetes. So your feedback is very important for me. I look forward further feedback in the comment section. Thank you so much for watching it.

[33:05] See you all in the next video. Take care. Bye-bye.

⚡ Saved you 0h 33m reading this? Transcribe any YouTube video for free — no signup needed.