You are probably excited about containers just like I did when I first heard about them a few years ago. In the past, we would deploy our multi-container apps using Docker compose to a docker swarm and have it managed our containerized apps and this worked but also added a lot of complexity. I started looking for new ways to manage my containers and I came across Kubernetes. I just went … wow! I had a few posts previously talking about Kubernetes but they were mostly general discussions. This post is going to be a long one, from the planning step to the final way out. I will address why I choose such an infrastructure as we are moving through.
When it comes to Kubernetes for the enterprise, a thorough plan is required, let us walk through them first.
Contents
Planning
Cluster size
Depending on the workload your organization requires, the size of your cluster may vary. The decision is based on 2 options:
- Many nodes but each has lesser resources
- Fewer nodes but each has more resources
Let’s say your workload requires 48 GB of RAM, 12 Cores of CPU and 1 TB of storage, your decision is whether to split your cluster into many nodes, say 6 nodes (each has 8GB/RAM and 2 Cores), Or 12 Nodes (each has 4GB/RAM and 1 Core).
Having many nodes is harder to manage but gives you the flexibility and high availability. If one of your nodes is down, you still have the others to run your workload. The same thing applies to pods and containers inside the cluster.
This post helps you make a better decision regarding your cluster size. It gives you a more in-depth on why choosing one over the other.
I am going to deploy a 2-node-pools cluster to Azure:
- The default node pool is a Linux based, mainly used by the cluster’s internal components and it has a single node in it (with Standard_D2s_v3 in VM size), I will also add monitoring tools into this node.
- A Windows node pool (also Standard_D2s_v3), mainly for Windows containers.
Both node pools are set to virtual machine scale set, allowing me to scale the node pool up to 5 nodes. You can configure your node pool to a higher number at deployment or upgrade it later.
Since we are using CNI networking type, it will be easy for us to add another node pool when needed.
Networking
I wrote a post a few months ago talking about deploying a Kubernetes cluster to the existing Azure Virtual Network using Kubenet networking type. This met my requirements but then I realized that it also had some limitations. One of them was the flow of the network traffic. Because pods had to communicate to the outside world via NAT, it was slower and challenging.
It was also difficult to scale and impossible to add a Windows node pool to the cluster since AKS does not support Windows when using Kubenet. This is the main reason why I wanted to take the CNI approach.
Here is the summary based on my experience:
(Some were taken from Microsoft docs)
Basic Networking (kubenet)
The Kubenet networking option is the default configuration for AKS cluster creation. Nodes get an IP address from the Azure virtual network subnet. Pods receive an IP address from a logically different address space to the Azure virtual network subnet of the nodes. Network address translation (NAT) is then configured so that the pods can reach resources on the Azure virtual network. The source IP address of the traffic is NAT’d to the node’s primary IP address.
Nodes use the Kubenet Kubernetes plugin. Only nodes receive a routable IP address, and pods use NAT to communicate with other resources outside the AKS cluster. This approach greatly reduces the number of IP addresses that you need to reserve in your network space for pods to use.
My previous setup used this model which introduces several disadvantages:
- Nodes and pods are placed on different IP subnets. User Defined Routing (UDR) and IP forwarding is used to route traffic between pods and nodes. This additional routing may reduce network performance.
- Connections to existing on-premises networks or peering to other Azure virtual networks is very complex.
- Difficult to expand as it joins our on-prem network
Advanced Networking (CNI)
With Azure Container Network Interface (CNI):
- Every pod gets an IP address from the subnet and can be accessed directly.
- These IP addresses must be unique across your network space and must be planned in advance.
- Each node has a configuration parameter for the maximum number of pods that it supports. The equivalent number of IP addresses per node are then reserved up front for that node.
- Requires more planning, as can otherwise lead to IP address exhaustion or the need to rebuild clusters in a larger subnet as application demands grow.
My preference is the Advanced Networking model, which leads to a clear design in networking that allows:
- Communication between Kubernetes nodes, pods and containers
- Communication from Kubernetes to the on-prem resources
- Because CNI requires more IPs to allocate to AKS components, a separate Virtual Network should be created, rather than using the existing one, unless more address spaces added to it.
- Current cluster run in a subnet that has a max of 28 IPs. This makes it impossible to have a windows node pool joining the cluster as it requires Advanced Networking.
- This cluster was configured to use the Basic model (due to the limitation in IP address space).
In general, there are 4 components we need consider when configuring our network for the cluster and these are mentioned in the documentation from Microsoft website:
- Virtual network
- Subnet
- Kubernetes service address range (Kubernetes CIDR)
- Kubernetes DNS service IP address
- Docker bridge address (Docker CIDR)
I started networking from scratch. I went to the Azure portal and created a Virtual Network with an IP address space of 10. 240.0.0/16. A class B-type network. Then I asked the Network guys in my organization to allow the traffic flow from this network to the on-prem. This solved my IP space issue. It allows me to scale the cluster to as big as /16 components (that is 65,534 in total, each with their own IP) but I am sure I will not use that many. I ended up splitting the network into a number of smaller chunks starting from the right most, that is 10.240.240.0/20. If you take that 10. 240.0.0/16 and split it into many /20’s, it will give you 16 subnets in total, each has a maximum of 4096 (4096 – network IP – broadcast IP = 4094 usable ones). I took the last one, the 16th – 10.240.240.0/20. This ensures no one can overlap my IP range.
The 16th subnet is mainly consumed by our own cluster pieces – nodes, pods, services, balancers and so on. We also need to cover the IP address range for Kubernetes’s internal components.
Source: https://kubernetes.io/docs/concepts/overview/components/
Anything that are not our definitions are Kubernetes’s internal. Kubernetes Control Pane is a good example for this. It needs its own IP range to run its internal tasks. So, let’s subnet our network further. My plan is to give it around ~ 500 IPs – 10.240.238.0/23 which is more than enough.
For the Docker bridge, I’ll just use the default – 172.17.0.1/16
Monitoring
Kubernetes comes with a default dashboard. This is a starting point when it comes to monitoring your cluster components but it is minimal. When you make the decision for your cluster size, allow it to have enough resources so you can set up better monitoring tools like Prometheus. Here are just to name a few:
- Azure Monitor, Log Analytics workspaces and Application Insights
- Prometheus for scrawling cluster beats and Grafana to visualize those beats in a graphical layout
- ELK stack (ElasticSearch, Logstash & Kibana)
- Prometheus with EK (ElasticSearch, Kibana)
I am sure there are many others, but these are the popular ones.
If, for some reason, you want to have a dedicated NFS for your monitoring tools, you can also deploy a dedicated NFS server. This is a typical case for Prometheus where you mount the persistence disk of the pod that runs it to a shared network file system. When this happens, the storage can be shared across all pods in the cluster.
Remember, our nodes in AKS use a temporary disk and data is lost when the node is down or destroyed. Many configure their pods to mount to a persistence storage such as Azure Disk or File System.
Another approach is to use a single or multiple separate nodes (rather than a replica of pods within the default node) inside the cluster, just for the monitoring purpose. This is a good example if you plan to use ELK stack. You have dedicated ElasticSearch database for storing your cluster beats, a beat scrawler (filebeats, metricbeats, etc) – and Kibana for visualization. This solely depends on the size of your cluster. ELK built on Java and Java JVM is a resource-hungry engine.
Administration
Enterprise cluster requires high level of security. Security in term of networking also relates to accessing the cluster’s components themselves. For my AKS cluster, I am going to config it to use Role-Base Access Control (RBAC), integrated with Azure Active Directory. So only users within a certain security group can have access to the cluster via kubectl.
My cluster components will not have their public IP address, the only way we can access them publicly is through an ingress controller with a load balancer that allows traffic from the outside of the organization. I can of course ping the nodes if I am inside my company.
Other security factors we need to consider such as network security groups, firewalls, user define routes. These can be done at a later stage as well.
Deployment
Complex cluster deployment needs a great tool to automate our job. I am using Terraform to deploy the cluster. The nice thing about this tool is that it saves our deployment state. So, you can add more resources to the cluster without the need to mess around with the portal in Azure. Terraform is a great candidate for something called Infrastructure as Code. Few others such as Ansible, Puppet and Saltstack.
With Terraform, we can use version control to keep track of our deployments. One off rollback is as easy as “terraform destroy”, and hit enter when using this tool.
We can group our cluster resources into many modules.
For my cluster, I am setting up the following and they will execute in that order. There are a few modules that are placeholders only, I will deploy them in a later stage.
- aks_networking
- aks_azure_monitoring
- aks_cluster
- aks_azuread_rbac
- aks_namespaces
- aks_dashboard
- aks_elk_provisioning (later stage)
- aks_ingress_controller (later stage)
The folder structure should like this:
D:.
├───config
│ │ main.tf
│ │ out.plan
│ │ provider.tf
│ │
│ └───.terraform
│ │ terraform.tfstate
│ │
│ ├───modules
│ │ │ modules.json
│ │ │
│ │ └───kubernetes_dashboard
│ │ └───terraform-kubernetes-dashboard-0.9.0
│ │ │ LICENSE.md
│ │ │ local.tf
│ │ │ main.tf
│ │ │ output.tf
│ │ │ README.md
│ │ │ variables.tf
│ │ │ versions.tf
│ │ │
│ │ └───resources
│ │ origin.yaml
│ │
│ └───plugins
│ └───windows_amd64
│ lock.json
│ terraform-provider-azurerm_v2.14.0_x5.exe
│ terraform-provider-kubernetes_v1.11.3_x4.exe
│
├───create
│ │ main.tf
│ │ out.plan
│ │ output.tf
│ │ provider.tf
│ │ variables.tf
│ │
│ └───.terraform
│ │ terraform.tfstate
│ │
│ ├───modules
│ │ modules.json
│ │
│ └───plugins
│ └───windows_amd64
│ lock.json
│ terraform-provider-azurerm_v2.14.0_x5.exe
│ terraform-provider-random_v2.2.1_x4.exe
│
└───modules
├───aks_azuread_rbac
│ main.tf
│ variables.tf
│
├───aks_azure_monitoring
│ main.tf
│ output.tf
│ variables.tf
│
├───aks_cluster
│ main.tf
│ output.tf
│ variables.tf
│
├───aks_elk_provisioning
│ main.tf
│ output.tf
│ variables.tf
│
├───aks_ingress_controller
│ main.tf
│ output.tf
│ variables.tf
│
├───aks_namespaces
│ main.tf
│ output.tf
│
└───aks_networking
main.tf
output.tf
variables.tf
Please pay attention to the running parts. I split the run into 2 parts. One for the cluster creation and the other for the configuration. You can add another part into it. Remember I talked about maintaining deployment state so please bare that in mind.
You can find the source code for these deployments via my GitHub repo here.
Configuration
You may want to list other configurations on your task list, eg. creating namespaces and secret keys when integrating with Azure Container Registry (ACR).
These configurations can be done later, as long as you keep your Terraform deployment state consistent. This is why we use version control tool (GIT) to deal with. You can also use Azure storage to maintain your deployment. I am using both.
What’s next
If you have your monitoring tools set up properly, depending on the workload requirements, you may want to scale your cluster up and down accordingly. The nice thing about using CNI with a clear networking plan is that we can add new nodes to the cluster easily. Visualization tools such as Grafana or Kibana can tell how much RAM/CPU your nodes are consuming, and send out alert emails when things go wrong. Therefore, you can adjust your cluster as your demand grows without too much of a trouble. At this point, our pods are using the temporary disk from their nodes. If you want to persist data in case your pods may die, you could go ahead and add another Terraform module. Or you can just use the normal way to interact with Kubernetes, using kubectl.
Having a cluster set up in Azure, our development and deployment can now be automated. I can start setting up CI/CD pipelines in any of my Azure DevOps repos and have them deployed to the cluster automatically when new updates are made.