Running Kubernetes in production: lessons learned and best practices

Kubernetes has evolved from an emerging technology to the operating system of the cloud. According to the 2025 CNCF Annual Cloud Native Survey, 82% of container users now run Kubernetes in production, up from 66% in 2023. Overall adoption has reached 96%, meaning nearly all organisations either use, pilot, or evaluate Kubernetes. Two out of every three clusters are now hosted in the cloud, and 66% of organisations hosting generative AI models use Kubernetes to manage their inference workloads.

Yet the gap between deploying Kubernetes and running it well in production remains significant. Many organisations underestimate the operational complexity involved in security hardening, observability, cost management, and day-two operations. This guide distils the critical lessons learned from real-world production deployments and provides actionable best practices for teams at every stage of their Kubernetes journey.

Container fundamentals and Kubernetes architecture

Before diving into production practices, it is essential to understand the foundational building blocks. Containers, popularised by Docker, package an application with all its dependencies into a standardised unit that runs consistently across any environment. Unlike virtual machines, containers share the host operating system kernel, making them lightweight and fast to start. This efficiency is what makes running hundreds or thousands of containers on a single cluster practical.

Kubernetes orchestrates these containers across a cluster of machines. Its architecture consists of a control plane (comprising the API server, etcd for state storage, the scheduler, and controller managers) and worker nodes (running the kubelet, container runtime, and kube-proxy). Pods, the smallest deployable units, contain one or more containers that share network and storage resources. Deployments manage the desired state of pod replicas, Services provide stable network endpoints, and Ingress controllers handle external traffic routing.

Understanding these primitives is crucial because production issues frequently trace back to misunderstandings about how Kubernetes schedules pods, manages networking, or handles storage. Teams that invest in deep Kubernetes knowledge, rather than treating it as a black box, are consistently more successful at running production workloads.

Managed Kubernetes: EKS vs AKS vs GKE

Most organisations opt for managed Kubernetes services rather than running self-managed clusters. The three major providers each bring distinct advantages. Amazon EKS dominates in AWS-centric environments, integrating deeply with IAM for authentication and authorisation, Fargate for serverless pod execution, CloudWatch for logging, and the broader AWS ecosystem. EKS supports the widest range of third-party integrations and is the most commonly referenced in job postings.

Azure AKS is the natural choice for Microsoft-ecosystem organisations. It integrates with Azure Active Directory for identity management, supports RBAC and private clusters, and offers Standard and Premium tiers with SLA-backed reliability and two-year support cycles. AKS provides strong DevOps integration through Azure DevOps and GitHub Actions, and its pricing model includes a free control plane tier that can reduce costs for smaller deployments.

Google GKE benefits from Google's heritage as the birthplace of Kubernetes. It offers the purest Kubernetes experience with features like Binary Authorization for supply chain security, Shielded GKE Nodes for hardened virtual machines, GKE Dataplane V2 with eBPF-based networking, and GKE Autopilot, which automatically manages the underlying infrastructure. GKE is often considered the most innovative of the three, with new Kubernetes features appearing first on the platform.

The optimal choice depends on your existing cloud relationships, team expertise, operational preferences, and compliance requirements. All three platforms are mature, production-ready, and support enterprise compliance certifications including SOC 2, ISO 27001, HIPAA, and PCI DSS. Many organisations adopt a multi-cloud strategy, running clusters on two or more providers for resilience and avoiding vendor lock-in.

Security hardening for production clusters

Kubernetes security requires a defence-in-depth approach spanning the cluster, the workloads, and the supply chain. Pod Security Standards (replacing the deprecated Pod Security Policies) define three levels: Privileged, Baseline, and Restricted. Production workloads should run at the Restricted level wherever possible, which enforces running as non-root, dropping all capabilities, using read-only root filesystems, and preventing privilege escalation.

Network policies are essential for implementing microsegmentation within the cluster. By default, all pods can communicate with all other pods, which violates the principle of least privilege. Implementing network policies to restrict pod-to-pod communication to only what is explicitly required dramatically reduces the blast radius of a compromised workload. The use of Kubernetes security tools has surged from under 35% in 2022 to over 50% recently, reflecting growing maturity in this area.

Role-Based Access Control (RBAC) governs who can perform what actions within the cluster. Follow the principle of least privilege rigorously: use namespace-scoped roles rather than cluster-wide roles, avoid granting wildcard permissions, and regularly audit RBAC configurations for excessive access. Integrate with your organisation's identity provider for authentication, and implement just-in-time access for privileged operations.

Supply chain security is increasingly critical. Scan container images for vulnerabilities in your CI/CD pipeline, sign images using tools like Cosign, and enforce signature verification at admission time using Binary Authorization or Kyverno policies. Use minimal base images like distroless or Alpine to reduce the attack surface, and keep images updated with the latest security patches. Encrypt secrets at rest using KMS integration and consider external secrets management solutions like HashiCorp Vault.

Monitoring, observability, and GitOps deployment

Production Kubernetes demands comprehensive observability across three pillars: metrics, logs, and traces. Prometheus has become the standard for metrics collection, running as a time-series database that scrapes metrics from pods and cluster components. Grafana provides visualisation through dashboards and alerting, while tools like Alertmanager route notifications to the appropriate teams. For distributed tracing, Jaeger or Tempo help teams understand request flows across microservices, identifying latency bottlenecks and failure points.

Key metrics to monitor include pod CPU and memory utilisation, request latencies at the service level, error rates, pod restart counts, node health and capacity, persistent volume usage, and certificate expiration dates. Implement alerts for critical conditions like pods in CrashLoopBackOff, nodes NotReady, persistent volume pressure, and certificate expiration warnings. Use the four golden signals of site reliability engineering, namely latency, traffic, errors, and saturation, as the foundation for your monitoring strategy.

GitOps has emerged as the preferred deployment methodology for Kubernetes, with Git serving as the single source of truth for the desired state of your cluster. ArgoCD, a CNCF-graduated project, provides a rich web interface for visualising and managing application deployments, making it accessible to operations teams and developers alike. Flux CD, also CNCF-graduated, takes a Kubernetes-native approach where everything is defined as Custom Resource Definitions and reconciled continuously.

Both tools support automated synchronisation from Git repositories, drift detection, and rollback capabilities. ArgoCD is generally preferred for its visual interface and application-centric workflow, while Flux excels in multi-tenancy scenarios and platform engineering use cases. Regardless of the tool, GitOps provides audit trails through Git history, repeatable deployments, and the ability to recover from failures by simply reverting a Git commit.

Cost optimisation and when Kubernetes is overkill

Kubernetes cost management is a persistent challenge, particularly in cloud environments where over-provisioned resources translate directly to wasted spend. Start with right-sizing: set resource requests and limits on every pod based on actual usage data, not guesswork. Tools like Kubecost, the Kubernetes Vertical Pod Autoscaler, and cloud provider cost explorers help identify over-provisioned workloads. Implement the Horizontal Pod Autoscaler to scale workloads based on demand and the Cluster Autoscaler to adjust node counts accordingly.

Use spot or preemptible instances for fault-tolerant workloads to achieve savings of 60 to 90 percent compared to on-demand pricing. Implement namespace resource quotas to prevent runaway consumption. Schedule non-critical workloads during off-peak hours and consider shutting down development and staging environments outside business hours. For stateful workloads, carefully evaluate whether Kubernetes operators like those for PostgreSQL, Redis, or Elasticsearch provide sufficient reliability compared to managed database services.

Despite its widespread adoption, Kubernetes is not the right choice for every workload. Simple applications with predictable traffic, small teams without dedicated platform expertise, or workloads that can run efficiently on serverless platforms like AWS Lambda or Cloud Run may not benefit from the operational overhead of Kubernetes. The infrastructure complexity, learning curve, and ongoing maintenance costs are significant, and organisations should honestly assess whether the benefits of container orchestration justify these investments for their specific use cases.

As a general guideline, Kubernetes becomes increasingly valuable when you have multiple teams deploying independently, microservices architectures with many components, workloads requiring auto-scaling and self-healing, or a need for consistent deployment practices across environments. For organisations running fewer than ten services with a small engineering team, simpler alternatives often deliver better outcomes.

How Shady AS can help

Running Kubernetes in production requires expertise spanning infrastructure, security, observability, and application architecture. At Shady AS SRL, our Brussels-based team has deep experience helping organisations design, deploy, and operate production Kubernetes environments on all major cloud providers. From initial architecture design and managed service selection to security hardening, GitOps implementation, and cost optimisation, we provide the guidance and hands-on support needed to run Kubernetes reliably.

Whether you are migrating your first workloads to Kubernetes, implementing a platform engineering practice, or optimising an existing production environment, our consultants bring the practical experience to accelerate your journey and avoid common pitfalls. Contact us today to discuss your containerisation strategy and learn how we can help you build a production-grade Kubernetes platform.

Back to blog