Kubernetes Architecture Internals
Understanding Kubernetes architecture is not academic — it is the difference between debugging a problem in minutes versus hours. When a pod is stuck in Pending, you need to know that the scheduler could not find a node with enough resources. When a service is unreachable, you need to understand how kube-proxy programs iptables rules. When etcd runs out of disk, you need to know that the entire cluster stops.
This page dissects every component of a Kubernetes cluster.
Architecture Overview
Control Plane Components
The control plane makes global decisions about the cluster (scheduling, detecting and responding to events). In production, control plane components run across multiple machines for high availability.
kube-apiserver
The API server is the front door to Kubernetes. Every operation — creating a pod, scaling a deployment, reading logs — goes through the API server. It is the only component that talks to etcd directly.
What it does:
- Validates and admits requests — checks that incoming API requests are well-formed and authorized
- Persists state to etcd — writes the desired state to the cluster's database
- Serves the Kubernetes API — RESTful HTTP API, used by kubectl, client libraries, and other components
- Watches — provides a watch mechanism so components can be notified of changes
How it processes a request:
kubectl apply -f pod.yaml
│
▼
┌──────────────────────┐
│ 1. Authentication │ Who are you? (x509 cert, bearer token, OIDC)
├──────────────────────┤
│ 2. Authorization │ Are you allowed? (RBAC, ABAC, Webhook)
├──────────────────────┤
│ 3. Admission Control│ Should we modify/reject? (Mutating, Validating)
├──────────────────────┤
│ 4. Validation │ Is the object valid? (schema, constraints)
├──────────────────────┤
│ 5. Persist to etcd │ Write the object to the database
├──────────────────────┤
│ 6. Notify watchers │ Tell scheduler, controllers about the new object
└──────────────────────┘Key configuration flags:
# Common kube-apiserver flags (managed clusters handle this for you)
--etcd-servers=https://etcd-1:2379,https://etcd-2:2379,https://etcd-3:2379
--service-cluster-ip-range=10.96.0.0/12
--service-node-port-range=30000-32767
--enable-admission-plugins=NamespaceLifecycle,NodeRestriction,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota,MutatingAdmissionWebhook,ValidatingAdmissionWebhook
--authorization-mode=Node,RBAC
--audit-log-path=/var/log/kubernetes/audit.log
--audit-policy-file=/etc/kubernetes/audit-policy.yaml
--encryption-provider-config=/etc/kubernetes/encryption-config.yamletcd
etcd is a distributed key-value store that holds all cluster state. Every object in Kubernetes — every pod, service, secret, config map — is stored in etcd as a key-value pair.
What it stores:
/registry/pods/default/nginx-abc123
/registry/services/default/kubernetes
/registry/deployments/default/web
/registry/secrets/kube-system/admin-token
/registry/configmaps/default/app-config
/registry/events/default/nginx-abc123.event1Why etcd matters:
- If etcd loses data, you lose your entire cluster state
- If etcd is slow, the API server is slow, and everything feels broken
- etcd uses the Raft consensus algorithm — it needs a quorum (majority) of members to function
- For HA, run 3 or 5 etcd members (always odd numbers for quorum)
Performance considerations:
- etcd is I/O bound — use SSDs, always
- Default storage limit is 2 GB (configurable to 8 GB)
- Compact and defragment regularly
- Monitor
etcd_server_slow_apply_totalandetcd_disk_wal_fsync_duration_seconds
Backup and restore:
# Snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd/ca.crt \
--cert=/etc/etcd/server.crt \
--key=/etc/etcd/server.key
# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-20240115.db
# Restore (stop kube-apiserver first)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-20240115.db \
--data-dir=/var/lib/etcd-restoredkube-scheduler
The scheduler watches for newly created pods that have no node assigned and selects a node for them to run on.
Scheduling algorithm:
New Pod (no node assigned)
│
▼
┌────────────────────────┐
│ 1. Filtering │ Eliminate nodes that cannot run the pod
│ - Enough CPU/memory │
│ - Node selector │
│ - Affinity rules │
│ - Taints/tolerations│
│ - PV availability │
├────────────────────────┤
│ 2. Scoring │ Rank remaining nodes
│ - Least requested │
│ - Balance resource │
│ - Spread topology │
│ - Node affinity │
│ - Image locality │
├────────────────────────┤
│ 3. Binding │ Assign pod to highest-scored node
└────────────────────────┘Filtering predicates (must all pass):
| Predicate | What It Checks |
|---|---|
PodFitsResources | Node has enough CPU and memory |
PodFitsHostPorts | Required host ports are available |
PodMatchNodeSelector | Node labels match pod's nodeSelector |
PodToleratesNodeTaints | Pod tolerates the node's taints |
CheckVolumeBinding | Required PersistentVolumes can be bound |
NoVolumeZoneConflict | Volumes are in the same zone as the node |
Scoring priorities (weighted):
| Priority | What It Favors |
|---|---|
LeastRequestedPriority | Nodes with the most free resources |
BalancedResourceAllocation | Nodes where CPU/memory usage is balanced |
SelectorSpreadPriority | Spreading pods across nodes |
InterPodAffinityPriority | Nodes matching inter-pod affinity rules |
ImageLocalityPriority | Nodes that already have the container image |
kube-controller-manager
The controller manager runs a set of controllers, each responsible for maintaining one aspect of the cluster's desired state. A controller is a control loop that watches the state of the cluster through the API server and makes changes to move the current state toward the desired state.
Key controllers:
| Controller | What It Does |
|---|---|
| Deployment controller | Creates/updates ReplicaSets to match the Deployment spec |
| ReplicaSet controller | Creates/deletes Pods to match the ReplicaSet's replicas count |
| StatefulSet controller | Manages ordered pod creation/deletion for StatefulSets |
| DaemonSet controller | Ensures one pod per node (or per matching node) |
| Job controller | Creates pods for Jobs, tracks completion |
| CronJob controller | Creates Jobs on schedule |
| Node controller | Monitors node health, marks nodes as NotReady |
| Service account controller | Creates default ServiceAccounts for namespaces |
| Endpoint controller | Populates the Endpoints object for Services |
| Namespace controller | Cleans up resources when a namespace is deleted |
| Garbage collector | Deletes dependent objects (e.g., pods when ReplicaSet is deleted) |
The control loop pattern:
for {
desired := getDesiredState() // Read from API server (Deployment spec)
current := getCurrentState() // Read from API server (actual pods)
diff := compare(desired, current)
if diff != empty {
makeChanges(diff) // Create/delete/update via API server
}
sleep(syncPeriod)
}Every controller follows this pattern. The Deployment controller does not create pods directly — it creates a ReplicaSet, and the ReplicaSet controller creates the pods. This separation of concerns is what makes Kubernetes extensible.
cloud-controller-manager
The cloud controller manager runs controllers that interact with the underlying cloud provider. It is separated from the core controller manager so that Kubernetes core code does not need cloud-specific logic.
Cloud controllers:
| Controller | What It Does |
|---|---|
| Node controller | Checks if a node still exists in the cloud after it stops responding |
| Route controller | Sets up routes in the cloud infrastructure |
| Service controller | Creates, updates, deletes cloud load balancers for Service type=LoadBalancer |
In managed Kubernetes (EKS, GKE, AKS), the cloud controller manager is handled by the provider.
Node Components
Every worker node runs these components:
kubelet
The kubelet is the primary node agent. It ensures that containers described in PodSpecs are running and healthy.
What it does:
- Registers the node with the API server
- Watches for pods assigned to its node
- Pulls container images (via the container runtime)
- Starts and monitors containers
- Reports pod and node status back to the API server
- Runs liveness, readiness, and startup probes
- Manages pod volumes and secrets
- Handles pod eviction when resources are low
Pod lifecycle from kubelet's perspective:
API Server assigns pod to this node
│
▼
kubelet receives pod spec
│
▼
Pull container images (if not cached)
│
▼
Create sandbox (pause container) for network namespace
│
▼
Set up pod networking (CNI plugin)
│
▼
Mount volumes
│
▼
Run init containers (sequentially)
│
▼
Start main containers (in parallel)
│
▼
Run startup probe (if defined)
│
▼
Run readiness/liveness probes continuously
│
▼
Report status to API serverNode allocatable resources:
Node Capacity
├── kube-reserved (kubelet, container runtime)
├── system-reserved (OS processes, sshd, etc.)
├── eviction-threshold (memory pressure buffer)
└── allocatable (available for pods)# Check allocatable resources
kubectl describe node <node-name> | grep -A 8 "Allocatable"kube-proxy
kube-proxy maintains network rules on nodes that allow network communication to pods from inside or outside the cluster. It implements the Service abstraction.
Modes:
| Mode | How It Works | Performance |
|---|---|---|
| iptables (default) | Programs iptables rules for each Service | Good for < 5,000 services |
| IPVS | Uses Linux IPVS for load balancing | Better for > 5,000 services |
| nftables | Uses nftables (newer Linux kernels) | Best performance, newest |
How a Service works (iptables mode):
Client Pod → ClusterIP (10.96.0.1:80)
│
▼
iptables DNAT rule randomly selects a backend pod
│
▼
Traffic forwarded to pod IP (10.244.1.5:8080)kube-proxy watches the API server for Service and Endpoints changes. When a new Service is created or a pod is added/removed, kube-proxy updates the iptables rules on every node.
Container Runtime
The container runtime is responsible for pulling images and running containers. Kubernetes supports any runtime that implements the Container Runtime Interface (CRI).
Common runtimes:
| Runtime | Description |
|---|---|
| containerd | Default in most distributions, extracted from Docker |
| CRI-O | Lightweight, designed specifically for Kubernetes |
| Docker (via cri-dockerd) | Legacy, adds unnecessary overhead |
Since Kubernetes 1.24, Docker is no longer directly supported as a runtime. containerd is the standard.
Admission Controllers
Admission controllers intercept requests to the API server after authentication and authorization but before the object is persisted. They can mutate (change) or validate (accept/reject) requests.
Built-in Admission Controllers
# Commonly enabled admission controllers
NamespaceLifecycle # Prevents operations on terminating namespaces
LimitRanger # Enforces LimitRange constraints
ServiceAccount # Creates default ServiceAccount for pods
DefaultStorageClass # Sets default StorageClass for PVCs
ResourceQuota # Enforces namespace resource quotas
NodeRestriction # Restricts what kubelets can modify
MutatingAdmissionWebhook # Calls external mutating webhooks
ValidatingAdmissionWebhook # Calls external validating webhooks
PodSecurity # Enforces Pod Security StandardsMutating Admission Webhooks
Mutating webhooks can modify objects before they are created:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: inject-sidecar
webhooks:
- name: inject-sidecar.example.com
clientConfig:
service:
name: sidecar-injector
namespace: system
path: /inject
caBundle: <base64-encoded-ca-cert>
rules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE"]
resources: ["pods"]
namespaceSelector:
matchLabels:
sidecar-injection: enabled
admissionReviewVersions: ["v1"]
sideEffects: None
failurePolicy: FailThis pattern is used by Istio (injecting Envoy sidecars), Vault (injecting secret sidecars), and many other tools.
ValidatingAdmissionPolicy (Kubernetes 1.26+)
In-process validation without external webhooks:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: require-resource-limits
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["pods"]
validations:
- expression: "object.spec.containers.all(c, has(c.resources) && has(c.resources.limits))"
message: "All containers must have resource limits defined"
- expression: "object.spec.containers.all(c, c.resources.limits['memory'] != '' && c.resources.limits['cpu'] != '')"
message: "All containers must have both CPU and memory limits"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: require-resource-limits-binding
spec:
policyName: require-resource-limits
validationActions:
- Deny
matchResources:
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: ["kube-system", "kube-public"]What Happens When You Run kubectl apply
Here is the complete flow when you run kubectl apply -f deployment.yaml:
1. kubectl reads deployment.yaml and sends HTTP PUT to API server
2. API server authenticates (who is this?) via cert/token/OIDC
3. API server authorizes (can they do this?) via RBAC
4. Mutating admission webhooks run (might inject sidecars, add labels)
5. Object schema validation
6. Validating admission webhooks run (might reject if policy violated)
7. Object written to etcd
8. API server confirms creation to kubectl
9. Deployment controller sees new Deployment, creates ReplicaSet
10. ReplicaSet controller sees new ReplicaSet, creates Pod objects
11. Scheduler sees unscheduled pods, assigns nodes
12. kubelet on assigned node sees new pod, pulls image
13. kubelet starts containers via container runtime
14. kubelet runs startup probe, then readiness/liveness probes
15. Endpoints controller adds pod IP to Service endpoints
16. kube-proxy updates iptables rules on all nodes
17. Traffic can now reach the pod through the ServiceThis entire flow happens in seconds for a simple deployment. Understanding each step is what lets you debug the 1% of cases where something goes wrong.
Cluster Sizing Guidelines
| Cluster Size | etcd | Control Plane | Workers |
|---|---|---|---|
| Small (< 100 pods) | 3 members | 2 API servers | 3-10 nodes |
| Medium (< 1,000 pods) | 3-5 members | 3 API servers | 10-50 nodes |
| Large (< 5,000 pods) | 5 members | 3+ API servers | 50-200 nodes |
| Very Large (< 150,000 pods) | 5+ members | 5+ API servers | 200-5,000 nodes |
Kubernetes officially supports up to 5,000 nodes, 150,000 pods, and 300,000 containers per cluster.
What to Learn Next
- Pod Lifecycle — understand what happens inside each pod from creation to termination
- Services & Ingress — how networking works in Kubernetes
- Troubleshooting — use your architecture knowledge to debug real problems