SDD 0027 - Dynamic Cluster Facts
Author |
Fabian Fischer |
Owner |
Fabian Fischer |
Reviewers (SIG) |
Tobias Brunner, Simon Gerber |
Date |
2021-06-18 |
Status |
implemented |
Summary
This SDD documents the design of dynamic cluster facts exposed through the Lieutenant API |
Motivation
Dynamic facts are information about a Syn managed Cluster which can be determined from the state of the cluster and may change regularly. The main difference to static facts is that these facts aren’t configured manually but are determined by the system itself. These dynamic facts can be used when compiling Commodore Components, which allows us to write Components that are better adapted to the cluster they’re deployed on.
Typical dynamic facts are:
-
Kubernetes Version
-
Number of Nodes
-
Node details (Node labels, names, etc.)
-
Ingress objects (What hosts is it serving?)
Design Proposal
Dynamic facts are collected periodically by Steward and pushed to the Lieutenant API. The Lieutenant API stores these facts as part of the corresponding Cluster resource. The facts can then be accessed either through the REST API or directly from the Cluster resource.
Fact Collection
Steward peridoically collects all dynamic facts in its cluster. The way it collects these depends on the actual fact but usually involves reading different Kubernetes Resources. The collected facts are then pushed to the Lieutenant API.
To be able to read all facts, Steward might need additional RBAC permissions.
REST API
The Lieutenant API and Steward communicate through the exposed REST API.
The actual push is performed through a PATCH
of the Cluster resource.
Authorization is handled by Kubernetes RBAC and happens through the existing authentication method.
The current Cluster object in the API definition is extended to include dynamicFacts
which is of type object
and can include arbitrary structured data.
Fact Store
The Lieutenant API stores the dynamic facts as part of the Cluster resource. It will store them in the Cluster resource’s status. The status subresource was designed to hold the current state of the object, while the specification should contain the desired state of an object. Dynamic facts are a prime example of a cluster state and as such should be stored in the status subresource.
We extend the Cluster resource with a status field facts
, which contains a map of strings.
If the fact is structured, such as a list of nodes, it should be stored as JSON.
This approach is highly flexible and adding facts doesn’t cause a CRD change.
Alternatives
Strongly Typed Status
The main advantage of the key-value design described earlier is its flexibility. Its main disadvantage is its lack of structure. It allows us to add and change fact types with minimal code change. But it makes the API inherently less stable and the cluster resource status is a lot less readable and harder to process.
An alternative to this key value store design is to put the facts in a more rigid structure. By defining the structure of the dynamic facts as part of the Cluster resource status we get a clear API definition.
The following could be an definition of such a status.
status:
kubernetesVersion: v1.20.1
nodes:
- name: node1
labels:
foo: bar
- name: node2
labels:
foo: bazz
ingresses:
- foo.vshn.ch
- bar.vshn.net
facts:
foo: bar
buzz: vshn
The main disadvantage is that we need to change the Steward, the CRD, and possibly the Lieutenant API whenever we add a new dynamic fact type. While adding a field to a CRD is generally not an issue, changes still automatically get more involved.
Pulling Facts
The current design proposal uses a push model to get updated facts through the exposed REST API. Another approach would be to use a pull based model with a design inspired by Prometheus.
In this design approach, the Lieutenant Operator pulls the dynamic facts from its managed clusters.
Steward exposes a simple /facts
endpoint, which returns the facts as JSON.
Steward collects all necessary facts on demand when it’s called.
The JSON is marshaled into the Cluster resource.
The main reason to use this pull approach is that it better aligns with Kubernetes design. It’s generally cleaner when the controller itself fetches the status of its managed resources instead of an external API pushing state into its resources. Another advantage is that pull based systems are generally easier to debug.
The major disadvantage however is that pull based approach requires that Steward is accessible to the Lieutenant Operator. This isn’t always given. Currently we only require that the Lieutenant API is accessible to Steward. Firewalls often don’t allow us to access Steward directly.
TSDB
An alternative to storing the facts in the Cluster resource is to put them into a timeseries DB.
The main advantage of this approach is that we would keep the complete history of all facts and when they changed.
The main disadvantage is that it introduces a completely new system and splits the available information on the cluster and stores it in two different locations. Further the facts aren’t really cleanly representable as a timeserie. Finally the advantage of having the complete history of a fact is questionable and a simple last modified timestamp would give us most of the use.