AWS (Recap)

what we built, why we built it in this order, and what honest practice looks like

Proud of myself

Disclaimers:

  1. Opinions expressed in this post (and in any of all my posts) are solely, unless otherwise specified, those of the authors, me. Those opinions absolutely do not reflect the views, policies, positions of any organizations, employers, affiliated groups.

  2. I've strived for accuracy throughout this piece. If you catch any errors, please reach out — I'd be grateful for the feedback and happy to make updates!



Hook

Structural truth about intensive training formats: you can cover a lot of ground in a week, but covering ground is not the same as owning it. There is a fundamental difference between understanding the concepts and having debugged a deployment issue at 11pm before a production deployment. The gap between "I understand this" and "I can do this under pressure" is filled only by practice — and eight days of excellent training doesn't close it.

10 days of training, a lot of diagrams drawn on whiteboards, 5 instructors who clearly knew their material. I would like to thank them very much; I have learnt a great deal thanks to them.

With this article, I would like to take a step back to look at what we built, what was, imho, missing and what is the next step.... Music please!



ToC

  1. The platform we built
  2. Why in this order
  3. Ansible: the piece that's missing
  4. What we didn't cover
  5. Where to go from here
  6. Conclusion



The platform we built

This 10 days training series actually produced a concrete thing: how to build Django REST APIs handling patient analyses and clinical consultation data, running on a K8s cluster on AWS, secured at every layer, deployable in any environment in minutes, and fully observable.

Here is what that means in practice:

  • A private VPC in a region with public subnets for load balancers and private subnets for everything else
  • An EKS cluster provisioned with managed node groups, with etcd encrypted at rest using a KMS key
  • IAM roles for every component that needs AWS access, IRSA or EKS Pod Identity for pods
  • RBAC scoped to namespaces: the genomics team can only touch the genomics namespace, the clinical trials team theirs
  • Network policies enforcing zero-trust between pods: no pod talks to another unless explicitly allowed
  • Secrets stored in AWS Secrets Manager, synced into Kubernetes at runtime by the External Secrets Operator
  • The entire infrastructure provisioned by Terraform — reproducible in any environment
  • The Django Rest APIs packaged as a Helm charts
  • The charts stored in ECR alongside the Docker images, accessible to any teammate with the right IAM permissions
  • Platform instrumented with OpenTelemetry: metrics sent to Prometheus, logs to Loki, traces to Jaeger, all visualised in Grafana
  • SLOs defined for the Rest APIs, with Alertmanager firing when the error budget burns too fast



What we built, why in this order

Layers on layers on layers. Each layer depended on the one before it.

Foundations — VPC, compute, storage, regions. Before anything else, you need to understand the environment your workloads will run in. Network topology is a first-class architectural decision.

Identity and access — IAM, OIDC, KMS, CloudTrail. IAM roles and encryption are established before the cluster exists, because the cluster will inherit them from day one. It's difficult to retroactively secure a system that was built without security in mind.

Orchestration — Kubernetes, EKS, pods, deployments, services. To run workloads at scale, self-heal them, and expose them, without managing individual servers.

Workload security — RBAC, network policies, pod security, secrets management. You have a running cluster. Now lock down what runs inside it. The defaults are permissive by design; the hardening is your responsibility.

Automation — Terraform, Helm. You have built & secured a cluster you understand. Now make it reproducible. The goal is to never click through the console again — for infrastructure or for deployments.

Observability — Prometheus, Loki, Jaeger, OpenTelemetry, SLOs. The platform runs, is secured, and is automated. Now make it visible. You cannot maintain what you cannot see.

Mental model: build → orchestrate → secure → automate → observe. Each verb assumes the previous one is done. You cannot automate what you have not secured. And you do not observe what you have not deployed. Phew!



Ansible: the piece that's missing

Somewhere during the training, we covered Ansible. I chose to not talk about it as part of this series.

Yes, Ansible is useful and complementary to Terraform, Helm & Kubernetes, but not strictly essential. If you are using fully managed Kubernetes services (like AWS EKS or GKE) where the cloud provider manages the underlying node configuration, the need for Ansible is reduced.

On Kubernetes, the configuration management layer is largely handled by Helm values, ConfigMaps, and Secrets. Ansible's strengths shine on EC2 instances, bastion hosts, and hybrid environments — which were not the primary focus of this training.

In all the cases, I'm not saying Ansible is worthless, it is absolutely worth learning. The official documentation and the Ansible for DevOps book by Jeff Geerling are the right starting points.



What we didn't cover

While I agree the instructors gave the best they could, while I agree we cannot cover everything in 10 days, the following topics are the ones I wish were addressed:

CI/CD pipelines — how does a git push become a Docker image in ECR and a Helm upgrade on the cluster? I know 1 or 2 things about this topic but I wish the instructors had said a few words about that.

GitOpsArgoCD and Flux take Helm one step further: Git becomes the single source of truth, and the cluster continuously reconciles itself against what's in the repository. This is how most mature platform teams deploy today.

Service meshIstio, Linkerd, and Cilium add mTLS between services, fine-grained traffic management, and advanced observability at the network layer. Once your platform has dozens of services, a service mesh stops being optional.

Backup and restoreVelero backs up Kubernetes resources and persistent volumes. What happens when someone deletes the wrong namespace in production?

Multi-region and disaster recovery — what happens when eu-west-3 has an incident? For a platform handling patient data, this is a compliance question as much as a technical one.

Image scanning in CITrivy and Snyk scan Docker images for known vulnerabilities before they ever reach the cluster. Security in the build pipeline, not just at runtime.



Where to go from here

Then, explore ArgoCD. Point it at an Helm chart repository in ECR, and let it manage deployments declaratively. That's the missing link between Day 6 and a production-grade deployment workflow.

Finally, set up the observability stack from Day 8 and use it to actually find out why/when something broke. Today, I'll be honest, if something break, The only thing I have to help me sort this out are the logs.

Planning the next step



Conclusion

Ten days of excellent content, delivered by people who clearly knew what they were talking about, thanks a million to them. I know the vocabulary, I still need to pratice, a lot, to own the skills. Writing these articles was the first step in that direction. Now I need a credit card.



More on this topic

You know the drill.... see ya o/

CI/CD and GitOps:

Service mesh:

Cost and autoscaling:

Backup and resilience:

Going deeper on what was covered:



Modern infrastructure