The Kubernetes Scheduler places pods on optimal nodes efficiently. It watches new pods lacking node assignment and runs a scheduling cycle – filtering, scoring, and binding. This article unmasks its architecture, plugin model, policy config, and tuning tactics. You will see detailed code snippets, a Mermaid workflow diagram, and real-world scenarios. You will master scheduler internals and customisation by end.
TL;DR
- Kube Scheduler watches unscheduled pods and places them on best-fit nodes.
- It runs plugins in filter, score, reserve, permit, prebind, bind phases.
- Supports node selectors, affinity, taints, tolerations, topology constraints.
- Customise via KubeSchedulerConfiguration YAML with plugin chains.
- Monitor metrics with Prometheus and tune retries and backoff.
- Use event-driven permit plugins for CI/CD gating and canary rollouts.
Kube Scheduler Overview
The Kube Scheduler runs in the control plane as a separate process. It connects to the API server and watches pods without node assignments. It also watches nodes, persistent volumes, and snapshot objects to maintain a stateful cache. It uses shared informers and indexers for efficient event handling. The scheduling queue orders pods by creation timestamp and priority. High-priority pods sit ahead of low-priority ones.
The scheduler enforces resource constraints. It honours CPU and memory requests and limits. It also supports ephemeral storage requests. You can shape decisions via nodeSelector
, nodeAffinity
, podAffinity
, podAntiAffinity
, and topology spread constraints. Taints and tolerations isolate workloads. Pods skip nodes they cannot tolerate automatically.
When no suitable node exists, scheduler applies preemption. It selects victim pods on candidate nodes. The preemption algorithm uses MinimumCandidatePods
to evict the smallest set that frees enough resources. After eviction, the scheduler re-evaluates the cycle. If pods still fail to schedule, backoff prevents tight loops by delaying retries.
How Kube Scheduler Works
The scheduling cycle comprises six main extension points: filter, score, reserve, permit, prebind, and bind. In filter, the scheduler removes nodes that violate pod constraints or lack resources. In score, it assigns numerical values to nodes based on plugin weights. Reserve temporarily marks resources on the chosen node. Permit can delay or reject binding based on external signals. Prebind runs final checks. Bind issues a binding call to the API server.
The scheduler loop runs concurrently via worker threads. It uses the client-go library for API calls. Node updates trigger re-checks for waiting pods. This ensures fresh state. Metrics like scheduling_duration_seconds
and scheduling_attempts_total
track performance.
Scheduling respects PodPriority
. Higher-priority pods can evict lower-priority pods during filter phase if needed. The scheduler uses strike-one logic to minimise victim pods. This ensures critical workloads launch promptly .
Kube Scheduler Policies and Extensibility
You manage scheduler behaviour via the KubeSchedulerConfiguration
API object. You define profiles, plugin chains, bind timeouts, and backoff parameters. SchedulerConfiguration
supports fields such as bindTimeoutSeconds
, leaderelection
, and clientConnection
settings.
Sample policy YAML:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
filter:
enabled:
- name: NodeResourcesFit
- name: NodeAffinity
- name: TaintToleration
score:
enabled:
- name: NodeResourcesBalancedAllocation
- name: InterPodAffinity
disabled:
- name: ImageLocality
pluginConfig:
- name: NodeResourcesFit
args:
shape: Balanced
leaderelection:
leaderElect: true
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
You can disable default plugins or replace them. PluginConfig
supports fine-grained arguments. For example, you can adjust scoring functions for resource balancing. Backoff tuning uses initialDelaySeconds
and maxDelaySeconds
in the queue settings. This helps under heavy load by spacing retry attempts.
Scheduler Plugins
The scheduler framework uses Go-based plugins implementing the SchedulerPlugin
interface. Each plugin registers for one or more extension points. You compile custom plugins into a scheduler binary. This binary can use multiple profiles with different plugin lists.
// myplugin.go
package myplugin
import (
"context"
v1 "k8s.io/api/core/v1"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
type MyPlugin struct{}
func (pl *MyPlugin) Name() string { return "MyPlugin" }
func (pl *MyPlugin) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
// allow pods only on nodes with label "high-speed"
if _, ok := nodeInfo.Node().Labels["high-speed"]; ok {
return framework.NewStatus(framework.Success)
}
return framework.NewStatus(framework.Unschedulable)
}
Deploy custom-scheduler:
kubectl apply -f custom-scheduler-config.yaml
kubectl create deployment custom-scheduler \
--image=custom-scheduler:latest \
--port=10251
Use this scheduler by setting spec.schedulerName
in your pod YAML. The scheduler plugin chain applies filter, score, and bind logic based on your code. You can integrate external systems in the permit phase to pause binding until manual approval or CI tests finish.
Real-world Use Cases
Pod priority classes ensure critical services start first. You can define classes via PriorityClass
objects. The scheduler picks pods based on their priority value. It can preempt lower-priority pods when resources are scarce.
In hybrid cloud setups, use multiple profiles. Each profile maps to distinct VM types or GPU nodes. Pods specify schedulerName
accordingly. This isolates AI workloads from general compute tasks. It prevents resource contention and optimises cost.
Event-driven scheduling integrates permit plugins. For example, pause binding until approval from a downstream tool. Or integrate with cloud APIs to provision new nodes on demand before binding. This dynamic flow enables just-in-time capacity allocation.
For stateful workloads, scheduler cooperates with StatefulSet controller. It respects podManagementPolicy
and volume topology. Scheduler checks volume binding mode: Immediate or WaitForFirstConsumer
. Under WaitForFirstConsumer
, volumes only provision when scheduler binds pods. This ensures persistent volumes attach to correct zone.
Monitoring and Performance Tuning
Prometheus can scrape scheduler metrics. Key metrics include:
# Prometheus alert rule
alert: HighSchedulingLatency
expr: rate(scheduler_scheduling_duration_seconds_sum[5m]) / rate(scheduler_scheduling_attempts_total[5m]) > 0.5
for: 10m
labels:
severity: critical
annotations:
summary: "Scheduler latency high for 10 minutes"
Enable debug logs with --v=4
. Logs show filter and score plugin decisions. This helps troubleshoot unexpected placements. Reduce plugin list to speed up scheduling under load. Each plugin adds CPU and memory overhead. Remove unused plugins to improve cycle time.
For high-scale clusters, run multiple scheduler replicas. Use leader election to maintain one active scheduler. Additional replicas stand by, reducing downtime during upgrades. Use external autoscaling to adjust scheduler instances based on cluster size.
References
Suggested Reading
PostHashID: 50c77ececb2b47f1afbc06a1655ae5629578b0a03014ff6d973bccc8ea1f6228