Senior Cloud & Platform Engineer

(DevOps / SRE)

Let’s Work Together
I design and operate

Reliable, observable platforms

and help teams ship safely at scale

Let’s Work Together

Brief History

Cloud Platform Engineer with 10+ years of experience designing and operating scalable, secure, and reliable cloud platforms across AWS, Azure, and GCP.

Focused on building platforms that scale teams, not just systems.

I specialize in platform engineering and Kubernetes ecosystems — building the shared infrastructure, tooling, and standards that let engineering teams move fast and operate reliably at scale.

Over the years, I have worked across e-commerce and data-intensive platforms, shaping platform architecture and strategy, driving engineering standardization, and helping organizations navigate cloud adoption, large-scale migrations, and operational maturity. I focus on turning complex, loosely defined problems into scalable, automated, production-ready platforms — and on influencing how engineering organizations build and operate those platforms long-term.

Here are some highlights of my profile:

Cloud Platform Engineer (10+ years)
Multi-cloud: AWS, Azure, GCP
Kubernetes & cloud-native platforms
Infrastructure as Code: Terraform, Pulumi & AWS CDK (Python)
Platform leadership, technical strategy & cross-org standardization

Observability: Datadog, Dynatrace, OpenTelemetry
Reliability engineering & SRE practices
Technical mentoring & driving engineering standards across teams
MSc in Cybersecurity
Writing about DevOps & cloud on Medium

Know more

Experience

10+ Years Building Cloud Platforms at Scale

Senior Cloud Engineer

European Online Retail Platform

Designed and evolved a Kubernetes-based cloud platform for a large European e-commerce organisation, enabling teams to build and operate services at scale.

Owned platform architecture and roadmap for a shared AWS EKS platform used across 70+ teams.
Scaled platform to support ~60% workload growth while maintaining 99.99% uptime.
Reduced cloud spend by ~20% through FinOps practices and platform optimisation.

Senior Site Reliability Engineer

Online Auctions & Automotive Platform

Drove reliability engineering and platform modernisation for a high-traffic auctions and automotive platform, improving system stability and engineering practices org-wide.

Defined and rolled out observability standards (monitoring, logging, alerting) across critical services.
Led cloud migration and containerisation initiatives, reducing operational toil across teams.
Established incident response and post-incident review practices to drive continuous reliability improvements.

Senior DevOps Engineer / Associate Technical Lead – DevOps

Foodservice & Supply Chain Technology

Led platform and DevOps engineering for large-scale foodservice and supply chain systems, driving architectural standards and mentoring engineers across multiple teams.

Led containerisation and cloud-native migration across multiple services, coordinating with product and engineering teams.
Defined and enforced infrastructure-as-code standards using Terraform and CI/CD across the engineering organisation.
Mentored engineers on DevOps practices and drove reliability improvements across production environments.

DevOps Engineer / Senior DevOps Engineer

Analytics & Machine Learning Products

Designed and operated cloud infrastructure for analytics and machine learning products serving customers across multiple industries, growing from engineer to senior ownership.

Established CI/CD pipelines, environment management, and configuration standards for multiple product teams.
Introduced SRE practices — monitoring, alerting, and on-call processes — from the ground up.
Partnered closely with data science and engineering teams to accelerate delivery and improve platform stability.

Systems Engineer

Travel & Enterprise Solutions

Supported mission-critical enterprise applications in the travel and hospitality space, maintaining high availability and contributing to release engineering improvements.

Managed application deployments and production operations for high-availability travel systems.
Streamlined release processes by collaborating with development and QA teams on deployment automation.
Built strong foundations in Linux, networking, and scripting — applied across all subsequent platform engineering roles.

Associate Application Support Engineer

Capital Markets & Trading Platforms

Started my career supporting capital markets and trading platforms, working closely with customers and engineering teams.

Provided application support and troubleshooting for trading systems.
Helped investigate incidents and performance issues.
Built a foundation in financial technology and market data.

Projects & Case Studies

Selected impact stories

A few examples of real projects where I designed, debugged, and improved cloud platforms with measurable impact.

Fixing EKS Cluster Autoscaler after AL2023 migration (IRSA + RBAC)

Role: Senior Cloud Engineer for a high-traffic e-commerce EKS platform.

During our migration from Amazon Linux 2 (AL2) to Amazon Linux 2023 (AL2023), the EKS Cluster Autoscaler suddenly stopped scaling: pods were stuck in Pending and logs showed “Failed to get nodes from apiserver: Unauthorized”. The tighter metadata behaviour on AL2023 broke our previous assumption that the autoscaler could “borrow” the node IAM role.

Identified that the autoscaler was implicitly using the node IAM role via EC2 instance metadata, which no longer worked with AL2023 defaults.
Moved the autoscaler to a dedicated IAM Role for Service Accounts (IRSA) with least-privilege AWS permissions.
Created a Kubernetes service account + RBAC role so the autoscaler had exactly the cluster permissions it needed (nodes, pods, leases, etc.).
Cleaned up legacy permissions on the node IAM role to remove hidden dependency on metadata and reduce blast radius.

Impact: Restored safe, predictable autoscaling on AL2023 in non-production before touching production, and created a reusable IRSA + RBAC pattern for other controllers (Cluster Autoscaler, ExternalDNS, load balancer controllers) across the organisation.

Read the full story on Medium »

Key lessons & tech stack

Move critical controllers (like Cluster Autoscaler) to IRSA or Pod Identity before changing AMIs.
Separate concerns: IRSA for AWS APIs, Kubernetes RBAC for what the pod can do inside the cluster.
Treat AMI upgrades as application changes: test in non-production with cordon/drain and synthetic scale-up/scale-down runs.

Why IRSA here: For this incident we used IRSA as the fastest safe fix: the cluster already had an OIDC provider, the Helm chart supported the “service account + annotation” pattern, and our AWS CDK stack had IRSA helpers. Pod Identity stays on the roadmap for new clusters where we can design the model from day one.

Tech stack: AWS EKS, Amazon Linux 2 & Amazon Linux 2023, Kubernetes Cluster Autoscaler, IAM Roles for Service Accounts (IRSA), Kubernetes RBAC, EKS OIDC, Terraform / AWS CDK (Python), Datadog.

Tags-based log retention in Datadog – giving ownership back to teams

Role: Senior Cloud & DevOps Engineer, leading Datadog governance for a multi-team engineering organisation.

Our Datadog logs setup “worked”, but ownership and costs were blurry. Dozens of teams shipped logs with inconsistent tags and ad-hoc indexes, making it hard to see who owned which volume, how long data stayed, and why costs kept creeping up.

Designed a tags-based, retention-first indexing strategy where teams choose their retention while platform enforces guardrails.
Defined mandatory tags on every log: team, costcenter, appgroup, env and retention.
Replaced per-team indexes with shared “retention lanes” (3 / 7 / 15 / 30 / 90 days) driven entirely by tags.
Introduced a temporary “punishment lane” with short retention and a quota for untagged or badly tagged logs.

Impact: Made log retention an explicit product team decision instead of a central bottleneck, improved cost transparency and paved the way for full IaC ownership of Datadog log indexes and enforcement rules.

Read the full story on Medium »

Index model, guardrails & automation

Created strict retention indexes such as index-retention-period-03, -07, -15, -30, -90 matching only fully tagged logs with allowed retention values.
Added a 7-day temporary index with a daily quota for logs missing mandatory tags, giving teams short-term visibility but a strong incentive to fix tagging.
Built a global Datadog monitor that detects logs with missing tags or invalid retention and alerts the platform team.
Implemented a LogsIndexManager module in Pulumi (Python) to manage indexes, routing rules and (optionally) index order and enforcement.

Tech stack: Datadog logs & monitors, tag-based routing and indexes, Pulumi (pulumi-datadog), AWS workloads (EKS/Lambda/EC2), shared tagging model for logs, metrics and traces across 70+ engineering teams.

Migrating container images from GCP to AWS ECR safely and repeatably

Role: Platform/DevOps Engineer leading a registry migration from Google Artifact Registry / Container Registry to AWS ECR.

As part of a wider platform move to AWS, dozens of image repositories had to move from GCP (with hierarchical paths like eu.gcr.io/project/app/service) to AWS ECR, which uses flatter repositories and tags. A naive “pull & push” risked overwriting tags or losing the original structure.

Designed a deterministic mapping from GCP’s hierarchical image paths to ECR repositories and tags (for example project-app-service:1.2.3).
Built a Python CLI that discovers tags, pulls from GCP, retags, pushes to ECR and validates digests to ensure images are identical.
Added tag filtering (semver awareness, prefix filters, --limit) and a safe dry-run mode with clear logging.
Included retry logic and optional cleanup so teams could migrate repositories one by one with confidence.

Impact: Enabled teams to migrate image repositories without accidentally overwriting tags or losing traceability, and produced a reusable migration tool that can be shared or open-sourced for similar GCP → ECR moves.

Read the full story on Medium »

Migration workflow & tech stack

Discover images and tags from GCP, then normalise hierarchical image names into ECR-compatible repository + tag pairs using a deterministic mapping.
For each selected tag: pull from GCP → retag to the mapped ECR repository/tag → push to ECR → compare digests.
On digest match, optionally clean up local images and, if desired, the source images in GCP.
Log every action (discovery, mapping, pull, push, validation) with clear, human-readable output so teams can audit exactly how GCP paths were translated into ECR.

Tech stack: Python, Docker CLI, Google Artifact Registry / Container Registry, AWS ECR, AWS CLI, bash automation and CI integration where needed. The tool encapsulates the GCP hierarchical naming model and the flatter AWS ECR repository/tag model so teams don’t have to think about it on every migration.

Writing & Talks

From my blog & podcast

I write about Cloud, DevOps and platform engineering on Medium, and occasionally join podcasts to share lessons from real-world migrations and incidents.