SDD 0009 - Steward Cluster Agent
Author |
Marco Fretz |
Owner |
Simon Rüegg |
Reviewers (SIG) |
|
Date |
2019-10-18 |
Status |
implemented |
Summary
The Syn cluster agent is the central component which makes a Kubernetes cluster Syn managed. See SDD 0008 - Platform Configuration Management for what that means. This component allows us to import existing clusters to make them managed. It also serves as the "phone-home" component to have access to clusters without public API access. The agent should:
In the following the name flux could be substituted by another GitOps tool (for example ArgoCD). |
Motivation
We want to easily adopt new clusters into Syn. Additionally we need a component on each cluster to enable API access even without public access. We mainly got inspired by how Rancher and Weave Cloud do this.
Goals
-
Import new clusters to make them Syn managed
-
Provide K8s API access to non-public/non-exposed clusters
Non-Goals
-
Proxy outgoing requests for the catalog git repo
-
For now, a K8s service of type
ExternalName
is created to abstract the Git hostname away. In the future a proxy might be introduced through which this git repo can be accessed
-
-
Register public, non-registered clusters (adopting)
-
In the future a adoption process could be implemented to support the agent being hosted on public cloud marketplaces. For this scenario no credentials exist for the bootstrap process and a new cluster needs to be approved in an adoption process
-
-
Exact setup of the GitOps component
Design Proposal
Bootstrapping
To enable adoption of clusters without a public API the import of a new cluster consists of a generated YAML document which needs to be applied to the cluster. The YAML document includes all necessary resources (deployment, RBAC, API info & credentials) to start the agent on the cluster. Once the agent is running, it will connect to the Syn API to register itself. If the registration is successful the agent will bootstrap a GitOps tool pointing to the cluster’s catalog repository.
Proxying
The goal is to only require one outbound HTTPS (WebSocket) connection from the cluster to Syn API. This is helpful in enterprise environments where firewalls are involved.
Incoming
To enable access to each cluster’s Kubernetes API the agent is responsible to set up a connection to the Syn API which serves as a proxy. This could be achieved by using https://github.com/inlets/inlets[inlets] or by directly using the library behind it: https://github.com/rancher/remotedialer
Rancher Remote Dialer
The Rancher remote dialer is the component which is used by Rancher to solve this problem. Also the https://github.com/inlets/inlets[inlets] project is based on this implementation and generalises it more to act as a generic WebSocket proxy.
The client connects via WebSocket to the server and authenticates (plugable) himself. If the server allows the client a static WebSocket connection is kept open (ping/pong). The server keeps track of connected clients and allows others to make HTTP calls on behalf of the client.
Outgoing (Out of Scope)
It should be possible to access the cluster’s catalog repo from the cluster via the Syn API.
For now this is out of scope and only a service of type ExternalName
will be created in order to support a future implementation of such a proxy.
Ideas: WebSocket proxy like github.com/google/huproxy or github.com/jpillora/chisel/.
Risks and Mitigations
Flux Managing Itself
If flux is managing (update/deploy) itself, this creates a risk as a bad configuration might lead to flux failing and not recovering anymore. To mitigate this, the cluster agent should continuously monitor flux and in case of a failure it should try to fix it again. The same process as used during bootstrapping should be used to redeploy flux to a working state.
Security Considerations
As the GitOps tool takes an integral part of the configuration management system it needs to be secured appropriately. Due to the nature of how we use the tool, it needs to have high privileges on the cluster. To prevent any privilege escalation, access to the GitOps tool should be locked down.
The other critical aspect is the cluster’s catalog git repo. Controlling this repo gives the same privileges the GitOps tool has. Therefore the access to it needs to be properly secured. As a result only Git via SSH is used and strict host key checking must be enabled. During the bootstrap process, the cluster agent is responsible for the key exchange:
-
The public key of the GitOps tool needs to be configured on the catalog repo
-
The catalog repo’s host key needs to be configured in the tool
This guarantees strong authentication and authorization. To further harden the setup, https://docs.fluxcd.io/en/stable/references/git-gpg.html[git commit signing and verification] could be implemented in the future: the GitOps tool verifies all commits on the catalog repo to be signed by a trusted GPG key and all commits done by it (image updates, tags) are signed with a GPG key.
Drawbacks
One drawback of this approach is that there is a component (the agent) which needs to run in the cluster. If this component fails, it might break access to the whole cluster. If the cluster is accessed directly via it’s API this would not be the case.