SDD 0009 - Steward Cluster Agent

Author

Marco Fretz

Owner

Simon Rüegg

Reviewers (SIG)

Date

2019-10-18

Status

implemented

Summary

The Syn cluster agent is the central component which makes a Kubernetes cluster Syn managed. See SDD 0008 - Platform Configuration Management for what that means. This component allows us to import existing clusters to make them managed. It also serves as the "phone-home" component to have access to clusters without public API access. The agent should:

Run after the cluster is created in the Inventory (later the inventory part could also be done by the agent)
Registers the cluster with the Syn API (Inventory, Billing, etc.)
Provisions the Platform config management GitOps operator on the Git repo provisioned by the "Inventory"

In the following the name flux could be substituted by another GitOps tool (for example ArgoCD).

Motivation

We want to easily adopt new clusters into Syn. Additionally we need a component on each cluster to enable API access even without public access. We mainly got inspired by how Rancher and Weave Cloud do this.

Goals

Import new clusters to make them Syn managed
Provide K8s API access to non-public/non-exposed clusters

Non-Goals

Proxy outgoing requests for the catalog git repo
- For now, a K8s service of type ExternalName is created to abstract the Git hostname away. In the future a proxy might be introduced through which this git repo can be accessed
Register public, non-registered clusters (adopting)
- In the future a adoption process could be implemented to support the agent being hosted on public cloud marketplaces. For this scenario no credentials exist for the bootstrap process and a new cluster needs to be approved in an adoption process
Exact setup of the GitOps component

Design Proposal

Bootstrapping

To enable adoption of clusters without a public API the import of a new cluster consists of a generated YAML document which needs to be applied to the cluster. The YAML document includes all necessary resources (deployment, RBAC, API info & credentials) to start the agent on the cluster. Once the agent is running, it will connect to the Syn API to register itself. If the registration is successful the agent will bootstrap a GitOps tool pointing to the cluster’s catalog repository.

GitOps

The agent should bootstrap a GitOps tool and connect it to the cluster’s catalog repo. In order to do that, the agent needs to provide the public part of an SSH key pair to Syn API and receive a Git repo URL and host public key in return.

Manifests

The manifests which the agent applies should be as minimal as possible to only bootstrap the GitOps tool to run once. GitOps will then configure itself according to the catalog.

Proxying

The goal is to only require one outbound HTTPS (WebSocket) connection from the cluster to Syn API. This is helpful in enterprise environments where firewalls are involved.

Incoming

To enable access to each cluster’s Kubernetes API the agent is responsible to set up a connection to the Syn API which serves as a proxy. This could be achieved by using https://github.com/inlets/inlets[inlets] or by directly using the library behind it: https://github.com/rancher/remotedialer

Rancher Remote Dialer

The Rancher remote dialer is the component which is used by Rancher to solve this problem. Also the https://github.com/inlets/inlets[inlets] project is based on this implementation and generalises it more to act as a generic WebSocket proxy.

The client connects via WebSocket to the server and authenticates (plugable) himself. If the server allows the client a static WebSocket connection is kept open (ping/pong). The server keeps track of connected clients and allows others to make HTTP calls on behalf of the client.

Outgoing (Out of Scope)

It should be possible to access the cluster’s catalog repo from the cluster via the Syn API. For now this is out of scope and only a service of type ExternalName will be created in order to support a future implementation of such a proxy.

Ideas: WebSocket proxy like github.com/google/huproxy or github.com/jpillora/chisel/.

Risks and Mitigations

Flux Managing Itself

If flux is managing (update/deploy) itself, this creates a risk as a bad configuration might lead to flux failing and not recovering anymore. To mitigate this, the cluster agent should continuously monitor flux and in case of a failure it should try to fix it again. The same process as used during bootstrapping should be used to redeploy flux to a working state.

Security Considerations

As the GitOps tool takes an integral part of the configuration management system it needs to be secured appropriately. Due to the nature of how we use the tool, it needs to have high privileges on the cluster. To prevent any privilege escalation, access to the GitOps tool should be locked down.

The other critical aspect is the cluster’s catalog git repo. Controlling this repo gives the same privileges the GitOps tool has. Therefore the access to it needs to be properly secured. As a result only Git via SSH is used and strict host key checking must be enabled. During the bootstrap process, the cluster agent is responsible for the key exchange:

The public key of the GitOps tool needs to be configured on the catalog repo
The catalog repo’s host key needs to be configured in the tool

This guarantees strong authentication and authorization. To further harden the setup, https://docs.fluxcd.io/en/stable/references/git-gpg.html[git commit signing and verification] could be implemented in the future: the GitOps tool verifies all commits on the catalog repo to be signed by a trusted GPG key and all commits done by it (image updates, tags) are signed with a GPG key.

Drawbacks

One drawback of this approach is that there is a component (the agent) which needs to run in the cluster. If this component fails, it might break access to the whole cluster. If the cluster is accessed directly via it’s API this would not be the case.

Alternatives

Inspired by Banzai Cloud Pipeline instead of an agent component which is running on the cluster, the cluster could be imported directly via the cluster’s API. A prerequisite for this to work is, that the cluster’s API is publicly available.

References

Rancher agent - https://github.com/rancher/rancher/tree/master/pkg/agent/cluster
Rancher remote dialer - https://github.com/rancher/remotedialer
Weave Cloud launcher - https://github.com/weaveworks/launcher