Enabling logs and alerting in AWS EKS cluster - CloudWatch and fluent-bit
In this post I will share my experience enabling and configuring logging in an EKS cluster, creating alerts to send a notification when a specific event appears in the logs.
Logs are a fundamental component in our environments, these provide useful information that helps to debug in the case of any issue or to identify events that can affect the security of the application. Logs should be enabled in each component, from the infrastructure level to the application level. This brings some challenges like where to store the logs, what kind of events log, how to search events, and what to do with the logs.
This is the part #1 in which I show how to enable control plane logging and container logging in an EKS cluster, in the part #2 I will show you how to enable some alerts using the logs groups created.
Before to start is important to mention that logs can contain private data like user information, keys, passwords, etc. For this reason, logs should be encrypted at rest and enable restrictions to access them.
Kubernetes Control Plane logging
Kubernetes architecture can be divided into a control plane and worker nodes, control plane contains the components that manage the cluster, components like etcd, API Server, Scheduler, and Controller Manager. Almost every action done in the cluster pass through the API Server that logs each event.
AWS EKS manages the control plane for us, deploying and operating the necessary components. By default, EKS doesn't have logging enabled and actions from our side are required. Enabling EKS control plane logging is an easy task, you need to know what component log and enable it. You can enable logs for the API server, audit, authenticator, control manager, and scheduler.
In my opinion, audit logs and authenticator are useful because records actions done in our cluster and help us to understand the origin of the actions and requests generated by the IAM authenticator.
By terraform you can use the following code to create a simple cluster and enabling audit,api,authenticator, and scheduler logs.
resource "aws_eks_cluster" "kube_cluster" {
name = "test-cluster"
role_arn = aws_iam_role.role-eks.arn
version = "1.22"
enabled_cluster_log_types = ["audit", "api", "authenticator","scheduler"]
vpc_config {
subnet_ids = ["sub-1234","sub-5678"]
endpoint_private_access = true
endpoint_public_access = true
}
}
Logs are stored in AWS CloudWatch logs and the log group is created automatically following this name structure /aws/eks/<cluster-name>/cluster
, inside the group you can find the log stream for each component that you enabled
authenticator-123abcd
kube-apiserver-123abcd
kube-apiserver-123abcd
By default, the Log group created by AWS doesn't have encryption and retention days enabled, I recommend creating the logs group by yourself and specifying and KMS Key, and setting some time to expiry the logs, Kubernetes generates a considerable number of logs that will increase the size of the group that can impact the billing.
Kubernetes Containers logging
The steps mentioned above were to enable just logging in the control plane, to send logs generated by the applications running in the containers a log aggregator is necessary, in this case, I will use Fluent-Bit
Fluent-Bit runs as a daemonSet in the cluster and sends logs to CloudWatch Logs. Fluent-Bit creates the log groups using the configuration specified in the kubernetes manifests.
Here is important to mention that AWS has created a docker image for the daemonSet, this can be found in this link.
AWS describes the steps to run the daemonSet, this is done by some commands, but I will use a Kubernetes manifest that can be stored in our repository and then use Argo or Fluxcd to automate deployments.
The following steps show the manifests to create the objects that Kubernetes needs to send containers logs to CloudWatch, you must have access to the cluster and by kubeclt command create the resources (kubeckt apply -f manifest-name.yml
).
1. Namespace creation
A K8 namespace is necessary, amazon-cloudwatch name will use for this, you can change the name but make sure to use the same in the following steps.
apiVersion: v1
kind: Namespace
metadata:
name: amazon-cloudwatch
labels:
name: amazon-cloudwatch
2. ConfigMap for aws-fluent-bit general configs
This configMap is necessary to specify some configurations for fluent-bit and for AWS, for instance, the cluster-name AWS use to create the logs group. In this case, I don't want to create an HTTP server for fluent-bit and I will read the logs from the tail, more information about this can be found here.
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-general-configs ## you can use a different name, make sure to use the same in the following steps
namespace: amazon-cloudwatch
data:
cluster.name: ${CLUSTERNAME}
http.port: ""
http.server: "Off"
logs.region: ${AWS_REGION}
read.head: "Off"
read.tail: "On"
3.Service Account, Cluster Role and Role Binding
Some permissions are required to send logs from daemonSet to Cloudwatch, you can attach a role to the worker-nodes or use a service account with an IAM role, in this case, I will create an IAM role and associate it with a service account. The following Terraform code creates a policy and the role.
resource "aws_iam_role" "iam-role-fluent-bit" {
name = "role-fluent-bit-test"
force_detach_policies = true
max_session_duration = 3600
path = "/"
assume_role_policy = jsonencode({
{
Version= "2012-10-17"
Statement= [
{
Effect= "Allow"
Principal= {
Federated= "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${EKS_OIDCID}"
}
Action= "sts:AssumeRoleWithWebIdentity"
Condition= {
StringEquals= {
"oidc.eks.${REGION}.amazonaws.com/id/${EKS_OIDCID}:aud": "sts.amazonaws.com",
"oidc.eks.${REGION}.amazonaws.com/id/${EKS_OIDCID}:sub": "system:serviceaccount:${AWS_CLOUDWATCH_NAMESPACE}:${EKS-SERVICE_ACCOUNT-NAME}"
}
}
}
]
}
})
}
- EKS_OIDCID: is the OpenID Connect for your cluster, you can get it in the cluster information or by terraform outputs.
- AWS_CLOUDWATCH_NAMESPACE: is the namespace create in the step 1, in this case amazon-cloudwatch.
- ACCOUNT_ID: is the AWS account number where the cluster was created.
The role needs a policy with permissions to create and put logs in cloudwatch, you can use the following code to create the policy and attach it to IAM Role created.
resource "aws_iam_policy" "policy_sa_logs" {
name = "policy-sa-fluent-bit-logs"
path = "/"
description = "policy for EKS Service Account fluent-bit "
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"ec2:DescribeVolumes",
"ec2:DescribeTags",
"logs:PutLogEvents",
"logs:DescribeLogStreams",
"logs:DescribeLogGroups",
"logs:CreateLogStream",
"logs:CreateLogGroup",
"logs:PutRetentionPolicy"
],
"Resource": "arn:aws:logs:${REGION}:${ACCOUNT_ID}:*:*"
}
]
}
EOF
}
######## Policy attachment to IAM role ########
resource "aws_iam_role_policy_attachment" "policy-attach" {
role = aws_iam_role.iam-role-fluent-bit.name
policy_arn = aws_iam_policy.policy_sa_logs.arn
}
Once the role has been created the Service account can be created, you can use the following k8 manifest for that, you should replace the IAM_ROLE variable for the ARN of the role created previously.
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluent-bit
namespace: amazon-cloudwatch
annotations:
eks.amazonaws.com/role-arn: "${IAM_ROLE}"
With the SA ready, you need to create a cluster role and associate that to the SA created, the following manifests can be used for that.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fluent-bit-role
rules:
- nonResourceURLs:
- /metrics
verbs:
- get
- apiGroups: [""]
resources:
- namespaces
- pods
- pods/logs
- nodes
- nodes/proxy
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: fluent-bit-role-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fluent-bit-role
subjects:
- kind: ServiceAccount
name: fluent-bit
namespace: amazon-cloudwatch
4.ConfigMap for fluent-bit configurations
A ConfigMap is used to specify a detailed configuration for Fluent-bit, AWS already defines a configuration, but you can add custom configs, The following link, shows the configurations defined by AWS, if you see, the first objects created in the YAML are the manifest defined in previous steps, in this step you just need to define the ConfigMap with name fluent-bit-config, I don't want to put here all the manifest because is a little long and can complicate the lecture of this post.
With this ConfigMap, Fluent Bit will create the log groups in the below table, you also have the option to create by terraform and specify encryption and retention period (i recommend this way).
CloudWatch Log Group Name | Source of the logs(Path inside the Container) |
---|---|
aws/containerinsights/Cluster_Name/application | All log files in /var/log/containers |
/aws/containerinsights/Cluster_Name/host | Logs from /var/log/dmesg, /var/log/secure, and /var/log/messages |
/aws/containerinsights/Cluster_Name/dataplane | The logs in /var/log/journal for kubelet.service, kubeproxy.service, and docker.service. |
If you analyze the ConfigMap you can see the INPUTS for each source mentioned in the table.
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: amazon-cloudwatch
labels:
k8s-app: fluent-bit
... OTHER CONFIGS
### here is the INPUT configurations for application logs
application-log.conf: |
[INPUT]
Name tail
Tag application.*
Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
Path /var/log/containers/*.log
... OTHER CONFIGS
[OUTPUT]
Name cloudwatch_logs
Match application.*
region $${AWS_REGION}
log_group_name /aws/containerinsights/$${CLUSTER_NAME}/application
log_stream_prefix $${HOST_NAME}-
auto_create_group false
extra_user_agent container-insights
log_retention_days ${logs_retention_period}
The OUTPUT in the previous manifest defines the CloudWatch log Group configuration that fluent bit will create, as you can see you can specify if the log groups should be created, the prefix for the stream, the name, and the retention period for the logs. If you are using Terraform you should set to false the option auto_create_group
5.DaemonSet Creation
This is the last step, AWS also provides the manifest to create the DaemonSet, in this link you can find it in the bottom of the file. As I mentioned, I don't want to put the whole file here, you can copy and paste the content or edit the file if you have custom configurations.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: amazon-cloudwatch
labels:
k8s-app: fluent-bit
... OTHER CONFIGS
Once you have run the above steps, you can validate that the daemonSet is running well and if everything is ok you should be the Logs groups in the AWS console with some events passed by fluent-bit DaemonSet