Helm for Kubernetes. Datree for keeping cluster secure and healthy.
The current chapter is the continuation of Helm exploration journey started with previous ones:
The input for the current chapter is here. There is setup for two environments (dev and prod) each consisting of helmfile which installs two Helm charts: Argo CD to implement GitOps and Argo CD Deployment which is hand-crafted chart to instruct Argo CD to install three so-called “business important” Helm Charts (RabbitMQ is maintained by Bitnami, a handcrafted chart with naive REST API server communicating with RabbitMQ and a utility chart with TLS certificate used for HTTPS access via Ingress). All details are in the previous chapters.
With all tools, utilities etc., which Kubernetes ecosystem has at the moment, it is pretty easy to throw into cluster more and more seems to be useful components that lead to cluster inflation. On the other side, companies have a tendency to prioritize business-value features forgetting to allocate enough time for maintenance, tech-debts fixes, and proper solution refactoring. The implications are not too fun — best practices and security principles audit are too late to be a high priority after a disaster happened.
Datree is Kubernetes cluster health monitoring tool that tells how good is your cluster. Datree can also enforce the policy not to deploy misconfigured resources. Probably the most valuable feature of Datree — it suggests concrete steps that should be taken in the cluster to make it safer. And with this roadmap on Datree’s dashboard the team can see progress over time. The initial healthy score is gradually getting better when people implement suggested actions. Also, Datree’s monitoring jobs continue to observe cluster to discover newly introduced weaknesses.
The most convenient way to install Datree, based on documentation, is a Helm chart. As this utility better to be classified as non-functional one, the best place in my solution for it is helmfile. Below is the content of helmfile after adding Datree:
#4..#5: mention of Datree Helm repository
#26..#32: details of Datree installation. #31 has a reference to value file with few customization of Datree based on installation documentation
#2: unique Datree token used to send metrics to Datree’s backend. Getting it as a part of the initial signing-up to Datree.
#3: current cluster context from kubectl config.
There is everything in place to deploy new Helm chart with helmfile sync command which results in:
Following diagram is taken from Datree documentation and shows the way how metrics are collected:
For me as a user, this process is fully transparent and the conclusion about the current healthy score of the cluster can be found on Datree’s site under my account (https://app.datree.io/overview) :
Letter “B” on ditry-green background is my current healthy level (more on grades is here) which is not too bad indeed.
“Scanned namespaces” on the left side include every namespace, Kubernetes native(kube-public, kube-node-lease), minikube’s plugins (ingress-nginx), 3rd patries (argo) and business-value(prod-medum, dev-medium). It makes sense to limit the scope only to resources I have been developing. Luckily, Datree can easily be customized to skip some resources. Pressing “Ignore namespaces” buttons opens documentation. Let’s create ConfigMap and apply it to exclude a bigger part of namespaces:
Need to rescan cluster and “Scan cluster manually” command to the rescue. After it is applied, the dashboard looks like this:
Time to look at Failed Rules section.
The first violated rule is Prevent containers from escalating privileges. Click on it to find out this happens on namespaces dev-medium and prod-medium (they are sibling environments, differ only by secrets). Click on any to see details and expand one (webapp on the following screenshot):
And screen even has in right bottom corners quick fix suggestions!
Let’s open documentation by clicking button and get navigated to https://hub.datree.io/built-in-rules/prevent-escalating-privileges. It has an example of how to fix it into YAML file for Deployment.
Equipped with this example, improve webapp chart by adding two lines here:
#14..#15: fix as simple as adding two lines
For RabbitMQ, since Bitnami maintains this chart, the search into documentation gives a hint that its value files need to be amended (as we have it for two environments, it is needed to be modified in two folders):
#12..#13: fix as simple as adding two lines.
Need to commit all changes to git repo and Argo CD will reconcile changes to k8s cluster in a couple of minutes. Inserted lines can be found in Argo CD UI:
Let’s force Datree to recollect refresh metrics with:
Back to Datree dashboard and this violation has gone (there are only 6 left):
Score trend is ‘increased’ looking at green arrow up!
On the right side of arrow, there is Policy — Starter block giving a hint that only very basic rules are under check.
=========> might need to delete it ≤==================
Let’s make one more rule ‘Ensure each container has a read-only root filesystem’ to pass. The Datre’s documentation suggests setting one more property in the very same section as previous fix:
#14: only this line needs to be added.
Let’s fix both (RabbitMQ and Webapp charts ), do the next git commit and Argo CD reconcile cycle.
In a couple minutes, we observe in Argo CD that Webapp deployment has been successfully updated, but RabbitMQ is having problem (it is stays in Progressing status) to synchronize:
Let’s look into Lens to get Pod status and logs:
Seems to be read-only filesystem fix for RabbitMQ is not what works straightaway (logs has line rabbitmq 22:39:01.34 ERROR ==> The variable RABBITMQ_COMBINED_CERT_PATH must be set to either an existant file or a non-existant file in a writable directory. with suggestion)
For Datree demo chapter, it is quicker to revert RabbitMQ change only.
Again force Datree to recollect refresh metrics by recreating Datree job in cluster.
There is no more Ensure each container has a read-only root filesystem rule violation for webapp deployment (it is still true for rabbit-mq as it is a complex chart and fix is out of the scope)
=========> might need to delete it ≤==================
Let’s fix one more rule violation ‘Ensure Deployment has more than one replica configured’. Clicking this violation in Datree shows that only webapp Pod is not ok:
Check there is indeed one webapp Pod on each of the environment namespaces:
This time the fix is as simple as going to this Helm chart, finding Kubernetes Deployment YAML definition and increasing this value from one to two. Also need to commit modification to git repo and wait couple minutes until ArgoCD applies this to cluster.
In a couple minutes (if patience is low, it is possible to force ArgoCD to reconcile from UI or CLI):
Last step is to manually push Datree to evaluate cluster health:
In couple minutes Datree dashboard reports one less rule violation (before was six, now is five) and one more passed rule:
That was the second step toward a healthier and more resilient Kubernetes cluster. There is a certain element of the game or even a competition to get to the stage with a minimum amount of violations, to see score ‘A’ eventually, to evaluate the cluster against non-starter policy and improve situation even more.
All sources can be found on github.
It is important to cover non-functional requirements of Kubernetes ecosystem. More and more solutions appear, getting mature and adding more and more features. Of course, a cool idea is valuable in itself, but the simplicity of use is a good investment as well. Datree is one of the very good service in this area for sure.