- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Use Cases
- Test Plans
- Graduation Criteria
- (R) kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
- (R) KEP approvers have set the KEP status to
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This document describes the Kubernetes Scheduling Framework. The scheduling framework is a new set of "plugin" APIs being added to the existing Kubernetes Scheduler. Plugins are compiled into the scheduler, and these APIs allow many scheduling features to be implemented as plugins, while keeping the scheduling "core" simple and maintainable.
Note: Previous versions of this document proposed replacing the existing scheduler with a new implementation.
Many features are being added to the Kubernetes Scheduler. They keep making the code larger and the logic more complex. A more complex scheduler is harder to maintain, its bugs are harder to find and fix, and those users running a custom scheduler have a hard time catching up and integrating new changes. The current Kubernetes scheduler provides webhooks to extend its functionality. However, these are limited in a few ways:
- The number of extension points are limited: "Filter" extenders are called after default predicate functions. "Prioritize" extenders are called after default priority functions. "Preempt" extenders are called after running default preemption mechanism. "Bind" verb of the extenders are used to bind a Pod. Only one of the extenders can be a binding extender, and that extender performs binding instead of the scheduler. Extenders cannot be invoked at other points, for example, they cannot be called before running predicate functions.
- Every call to the extenders involves marshaling and unmarshalling JSON. Calling a webhook (HTTP request) is also slower than calling native functions.
- It is hard to inform an extender that scheduler has aborted scheduling of a Pod. For example, if an extender provisions a cluster resource and scheduler contacts the extender and asks it to provision an instance of the resource for the pod being scheduled and then scheduler faces errors scheduling the pod and decides to abort the scheduling, it will be hard to communicate the error with the extender and ask it to undo the provisioning of the resource.
- Since current extenders run as a separate process, they cannot use scheduler's cache. They must either build their own cache from the API server or process only the information they receive from the default scheduler.
The above limitations hinder building high performance and versatile scheduler features. We would ideally like to have an extension mechanism that is fast enough to allow existing features to be converted into plugins, such as predicate and priority functions. Such plugins will be compiled into the scheduler binary. Additionally, authors of custom schedulers can compile a custom scheduler using (unmodified) scheduler code and their own plugins.
- Make scheduler more extendable.
- Make scheduler core simpler by moving some of its features to plugins.
- Propose extension points in the framework.
- Propose a mechanism to receive plugin results and continue or abort based on the received results.
- Propose a mechanism to handle errors and communicate them with plugins.
- Solve all scheduler limitations, although we would like to ensure that the new framework allows us to address known limitations in the future.
- Provide implementation details of plugins and call-back functions, such as all of their arguments and return values.
The Scheduling Framework defines new extension points and Go APIs in the Kubernetes Scheduler for use by "plugins". Plugins add scheduling behaviors to the scheduler, and are included at compile time. The scheduler's ComponentConfig will allow plugins to be enabled, disabled, and reordered. Custom schedulers can write their plugins "out-of-tree" and compile a scheduler binary with their own plugins included.
Each attempt to schedule one pod is split into two phases, the scheduling cycle and the binding cycle. The scheduling cycle selects a node for the pod, and the binding cycle applies that decision to the cluster. Together, a scheduling cycle and binding cycle are referred to as a "scheduling context". Scheduling cycles are run serially, while binding cycles may run concurrently. (See Concurrency)
A scheduling cycle or binding cycle can be aborted if the pod is determined to be unschedulable or if there is an internal error. The pod will be returned to the queue and retried. If a binding cycle is aborted, it will trigger the Unreserve method in the Reserve plugin.
The following picture shows the scheduling context of a pod and the extension points that the scheduling framework exposes. In this picture "Filter" is equivalent to "Predicate" and "Scoring" is equivalent to "Priority function". Plugins are registered to be called at one or more of these extension points. In the following sections we describe each extension point in the same order they are called.
One plugin may register at multiple extension points to perform more complex or stateful tasks.
These plugins are used to sort pods in the scheduling queue. A queue sort plugin essentially will provide a "less(pod1, pod2)" function. Only one queue sort plugin may be enabled at a time.
These plugins are used to pre-process info about the pod, or to check certain conditions that the cluster or the pod must meet. A pre-filter plugin should implement a PreFilter function, if PreFilter returns an error, the scheduling cycle is aborted. Note that PreFilter is called once in each scheduling cycle.
A Pre-filter plugin can implement the optional PreFilterExtensions
interface which
define AddPod and RemovePod methods to incrementally modify its pre-processed info.
The framework guarantees that those functions will only be called after PreFilter, possibly
on a cloned CycleState, and may call those functions more than once before calling Filter on
a specific node.
These plugins are used to filter out nodes that cannot run the Pod. For each node, the scheduler will call filter plugins in their configured order. If any filter plugin marks the node as infeasible, the remaining plugins will not be called for that node. Nodes may be evaluated concurrently, and Filter may be called more than once in the same scheduling cycle.
These plugins are called after Filter phase, but only when no feasible nodes were found for the pod. Plugins are called in their configured order. If any PostFilter plugin marks the node as Schedulable, the remaining plugins will not be called. A typical PostFilter implementation is preemption, which tries to make the pod schedulable by preempting other Pods.
This is an informational extension point for performing pre-scoring work. Plugins will be called with a list of nodes that passed the filtering phase. A plugin may use this data to update internal state or to generate logs/metrics.
These plugins have two phases:
- The first phase is called "score" which is used to rank nodes that have passed
the filtering phase. The scheduler will call
Score
of each scoring plugin for each node. - The second phase is "normalize scoring" which is used to modify scores before
the scheduler computes a final ranking of Nodes, and each score plugin receives
scores given by the same plugin to all nodes in "normalize scoring" phase.
NormalizeScore
is called once per plugin per scheduling cycle right after "score" phase. Note thatNormalizeScore
is optional, and can be provided by implementing theScoreExtensions
interface.
The output of a score plugin must be an integer in range of [MinNodeScore, MaxNodeScore]. if not, the scheduling cycle is aborted. This is the output after running the optional NormalizeScore function of the plugin. If NormalizeScore is not provided, the output of Score must be in this range. After the optional NormalizeScore, the scheduler will combine node scores from all plugins according to the configured plugin weights.
For example, suppose a plugin BlinkingLightScorer
ranks Nodes based on how
many blinking lights they have.
func (*BlinkingLightScorer) Score(state *CycleState, _ *v1.Pod, nodeName string) (int, *Status) {
return getBlinkingLightCount(nodeName)
}
However, the maximum count of blinking lights may be small compared to
MaxNodeScore
. To fix this, BlinkingLightScorer
should also implement NormalizeScore
.
func (*BlinkingLightScorer) NormalizeScore(state *CycleState, _ *v1.Pod, nodeScores NodeScoreList) *Status {
highest := 0
for _, nodeScore := range nodeScores {
highest = max(highest, nodeScore.Score)
}
for i, nodeScore := range nodeScores {
nodeScores[i].Score = nodeScore.Score*MaxNodeScore/highest
}
return nil
}
If either Score
or NormalizeScore
returns an error, the scheduling cycle is aborted.
A plugin that implements the Reserve extension has two methods, namely Reserve and Unreserve, that back two informational scheduling phases called Reserve and Unreserve, respectively. Plugins which maintain runtime state (aka "stateful plugins") should use these phases to be notified by the scheduler when resources on a node are being reserved and unreserved for a given Pod.
The Reserve phase happens before the scheduler actually binds a Pod to its designated node. It exists to prevent race conditions while the scheduler waits for the bind to succeed. The Reserve method of each Reserve plugin may succeed or fail; if one Reserve method call fails, subsequent plugins are not executed and the Reserve phase is considered to have failed. If the Reserve method of all plugins succeed, the Reserve phase is considered to be successful and the rest of the scheduling cycle and the binding cycle are executed.
The Unreserve phase is triggered if the Reserve phase or a later phase fails. When this happens, the Unreserve method of all Reserve plugins will be executed in the reverse order of Reserve method calls. This phase exists to clean up the state associated with the reserved Pod.
Caution: The implementation of the Unreserve method in Reserve plugins must be idempotent and may not fail.
These plugins are used to prevent or delay the binding of a Pod. A permit plugin can do one of three things.
-
approve
Once all permit plugins approve a pod, it is sent for binding. -
deny
If any permit plugin denies a pod, it is returned to the scheduling queue. This will trigger Unreserve method in Reserve plugin. -
wait (with a timeout)
If a permit plugin returns "wait", then the pod is kept in the permit phase until a plugin approves it. If a timeout occurs, wait becomes deny and the pod is returned to the scheduling queue, triggering unreserve method in Reserve phase.
Permit plugins are executed as the last step of a scheduling cycle, however waiting in the permit phase happens at the beginning of a binding cycle, before PreBind plugins are executed.
Approving a pod binding
While any plugin can receive the list of reserved pods from the cache and
approve them (see FrameworkHandle
) we expect only the permit
plugins to approve binding of reserved Pods that are in "waiting" state. Once a
pod is approved, it is sent to the pre-bind phase.
These plugins are used to perform any work required before a pod is bound. For example, a pre-bind plugin may provision a network volume and mount it on the target node before allowing the pod to run there.
If any PreBind plugin returns an error, the pod is rejected and returned to the scheduling queue.
These plugins are used to bind a pod to a Node. Bind plugins will not be called until all PreBind plugins have completed. Each bind plugin is called in the configured order. A bind plugin may choose whether or not to handle the given Pod. If a bind plugin chooses to handle a Pod, the remaining bind plugins are skipped.
This is an informational extension point. PostBind plugins are called after a pod is successfully bound. This is the end of a binding cycle, and can be used to clean up associated resources.
There are two steps to the plugin API. First, plugins must register and get configured, then they use the extension point interfaces. Extension point interfaces have the following form.
type Plugin interface {
Name() string
}
type QueueSortPlugin interface {
Plugin
Less(*PodInfo, *PodInfo) bool
}
type PreFilterPlugin interface {
Plugin
PreFilter(CycleState, *v1.Pod) *Status
}
// ...
Most* plugin functions will be called with a CycleState
argument. A
CycleState
represents the current scheduling context.
A CycleState
will provide APIs for accessing data whose scope is the
current scheduling context. Because binding cycles may execute concurrently,
plugins can use the CycleState
to make sure they are handling the right
request.
The CycleState
also provides an API similar to
context.WithValue
that can be used to
pass data between plugins at different extension points. Multiple plugins can
share the state or communicate via this mechanism. The state is preserved only
during a single scheduling context. It is worth noting that plugins are assumed
to be trusted. The scheduler does not prevent one plugin from accessing or
modifying another plugin's state.
* The only exception is for queue sort plugins.
WARNING: The data available through a CycleState
is not valid after a
scheduling context ends, and plugins should not hold references to that data
longer than necessary.
While the CycleState
provides APIs relevant to a single scheduling context,
the FrameworkHandle
provides APIs relevant to the lifetime of a plugin. This
is how plugins can get a client (kubernetes.Interface
) and
SharedInformerFactory
, or read data from the scheduler's cache of cluster
state. The handle will also provide APIs to list and approve or reject
waiting pods.
WARNING: FrameworkHandle
provides access to both the kubernetes API server
and the scheduler's internal cache. The two are not guaranteed to be in sync
and extreme care should be taken when writing a plugin that uses data from both
of them.
Providing plugins access to the API server is necessary to implement useful
features, especially when those features consume object types that the scheduler
does not normally consider. Providing a SharedInformerFactory
allows plugins
to share caches safely.
Each plugin must define a constructor and add it to the hard-coded registry. For more information about constructor args, see Optional Args.
Example:
type PluginFactory = func(runtime.Unknown, FrameworkHandle) (Plugin, error)
type Registry map[string]PluginFactory
func NewRegistry() Registry {
return Registry{
fooplugin.Name: fooplugin.New,
barplugin.Name: barplugin.New,
// New plugins are registered here.
}
}
It is also possible to add plugins to a Registry
object and inject that into a
scheduler. See Custom Scheduler Plugins
There are two steps to plugin initialization. First, plugins are registered. Second, the scheduler uses its configuration to decide which plugins to instantiate. If a plugin registers for multiple extension points, it is instantiated only once.
When a plugin is instantiated, it is passed config args and a
FrameworkHandle
.
There are two types of concurrency that plugin writers should consider. A plugin might be invoked several times concurrently when evaluating multiple nodes, and a plugin may be called concurrently from different scheduling contexts.
Note: Within one scheduling context, each extension point is evaluated serially.
In the main thread of the scheduler, only one scheduling cycle is processed at a time. Any extension point up to and including permit will be finished before the next scheduling cycle begins*. After the permit extension point, the binding cycle is executed asynchronously. This means that a plugin could be called concurrently from two different scheduling contexts, provided that at least one of the calls is to an extension point after permit. Stateful plugins should take care to handle these situations.
Finally, Unreserve method in Reserve plugins may be called from either the main thread or the bind thread, depending on how the pod was rejected.
* The queue sort extension point is a special case. It is not part of a scheduling context and may be called concurrently for many pod pairs.
The scheduler's component configuration will allow for plugins to be enabled, disabled, or otherwise configured. Plugin configuration is separated into two parts.
- A list of enabled plugins for each extension point (and the order they should run in). If one of these lists is omitted, the default list will be used.
- An optional set of custom plugin arguments for each plugin. Omitting config args for a plugin is equivalent to using the default config for that plugin.
The plugin configuration is organized by extension points. A plugin that registers with multiple points must be included in each list.
type KubeSchedulerConfiguration struct {
// ... other fields
Plugins Plugins
PluginConfig []PluginConfig
}
type Plugins struct {
QueueSort []Plugin
PreFilter []Plugin
Filter []Plugin
PostFilter []Plugin
PreScore []Plugin
Score []Plugin
Reserve []Plugin
Permit []Plugin
PreBind []Plugin
Bind []Plugin
PostBind []Plugin
}
type Plugin struct {
Name string
Weight int // Only valid for Score plugins
}
type PluginConfig struct {
Name string
Args runtime.Unknown
}
Example:
{
"plugins": {
"preFilter": [
{
"name": "PluginA"
},
{
"name": "PluginB"
},
{
"name": "PluginC"
}
],
"score": [
{
"name": "PluginA",
"weight": 30
},
{
"name": "PluginX",
"weight": 50
},
{
"name": "PluginY",
"weight": 10
}
]
},
"pluginConfig": [
{
"name": "PluginX",
"args": {
"favorite_color": "#326CE5",
"favorite_number": 7,
"thanks_to": "thockin"
}
}
]
}
When specified, the list of plugins for a particular extension point are the only ones enabled. If an extension point is omitted from the config, then the default set of plugins is used for that extension point.
When relevant, plugin evaluation order is specified by the order the plugins appear in the configuration. A plugin that registers for multiple extension points can have different ordering at each extension point.
Plugins may receive arguments from their config with arbitrary structure.
Because one plugin may appear in multiple extension points, the config is in a
separate list of PluginConfig
.
For example,
{
"name": "ServiceAffinity",
"args": {
"LabelName": "app",
"LabelValue": "mysql"
}
}
func NewServiceAffinity(args *runtime.Unknown, h FrameworkHandle) (Plugin, error) {
if args == nil {
return nil, errors.Errorf("cannot find service affinity plugin config")
}
if args.ContentType != "application/json" {
return nil, errors.Errorf("cannot parse content type: %v", args.ContentType)
}
var config struct {
LabelName, LabelValue string
}
if err := json.Unmarshal(args.Raw, &config); err != nil {
return nil, errors.Wrap(err, "could not parse args")
}
//...
}
The current KubeSchedulerConfiguration
kind has apiVersion: kubescheduler.config.k8s.io/v1alpha1
. This new config format will be either
v1alpha2
or v1beta1
. When a newer version of the scheduler parses a
v1alpha1
, the "policy" section will be used to construct an equivalent plugin
configuration.
Note: Moving KubeSchedulerConfiguration
to v1
is outside the scope of this
design, but see also
https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/wgs/783-component-base/README.md
and kubernetes/community#3008
The Cluster Autoscaler will have to be changed to run Filter plugins instead of predicates.
This can be done by creating a Framework instance and invoke RunFilterPlugins
.
These are just a few examples of how the scheduling framework can be used.
Functionality similar to kube-batch (sometimes called "gang scheduling") could be implemented as a plugin. For pods in a batch, the plugin would "accumulate" pods in the permit phase by using the "wait" option. Because the permit stage happens after reserve, subsequent pods will be scheduled as if the waiting pod is using those resources. Once enough pods from the batch are waiting, they can all be approved.
Topology-Aware Volume Provisioning can be (re)implemented as a plugin that registers for filter and pre-bind extension points. At the filtering phase, the plugin can ensure that the pod will be scheduled in a zone which is capable of provisioning the desired volume. Then at the PreBind phase, the plugin can provision the volume before letting scheduler bind the pod.
The scheduling framework allows people to write custom, performant scheduler
features without forking the scheduler's code. To accomplish this, developers
just need to write their own main()
wrapper around the scheduler. Because
plugins must be compiled with the scheduler, writing a wrapper around main()
is necessary in order to avoid modifying code in vendor/k8s.io/kubernetes
.
import (
scheduler "k8s.io/kubernetes/cmd/kube-scheduler/app"
)
func main() {
command := scheduler.NewSchedulerCommand(
scheduler.WithPlugin("example-plugin1", ExamplePlugin1),
scheduler.WithPlugin("example-plugin2", ExamplePlugin2))
if err := command.Execute(); err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
}
Note: The above code is an example, and might not match the latest implemented API.
The custom plugins would be enabled as normal plugins in the scheduler config, see Configuring Plugins.
The scheduling framework is expected to be backward compatible with the existing Kubernetes scheduler. As a result, we expect all the existing tests of the scheduler to pass during and after the framework is developed.
-
Unit Tests
- Each plugin developed for the framework is expected to have its own unit tests with reasonable coverage.
-
Integration Tests
- As we build extension points, we must add appropriate integration tests that ensure plugins registered at these extension points are invoked and the framework processes their return values correctly.
- If a plugin adds a new functionality that didn't exist in the past, it must be accompanied by integration tests with reasonable coverage.
-
End-to-end tests
- End-to-end tests should be added for new scheduling features and plugins that interact with external components of Kubernetes. For example, if a plugin needs to interact with the API server and Kubelets, end-to-end tests may be needed. End-to-end tests are not needed when integration tests can provided adequate coverage.
-
Alpha (1.16)
- Extension points for
Reserve
, andPrebind
are built. - Integration tests for these extension points are added.
- Extension points for
-
Beta (1.17)
- All the extension points listed in this KEP and their corresponding tests are added.
- Persistent dynamic volume binding logic is converted to a plugin.
-
Stable (1.19)
- Existing 'Predicate' and 'Priority' functions and preemption logic are converted to plugins.
- No major bug in the implementation of the framework is reported in the past three months.