Skip to main content
Brief History

Cloud Platform Engineer with 10+ years of experience designing and operating scalable, secure, and reliable cloud platforms across AWS, Azure, and GCP.

Focused on building platforms that scale teams, not just systems.

I specialize in platform engineering and Kubernetes ecosystems — building the shared infrastructure, tooling, and standards that let engineering teams move fast and operate reliably at scale.

Over the years, I have worked across e-commerce and data-intensive platforms, shaping platform architecture and strategy, driving engineering standardization, and helping organizations navigate cloud adoption, large-scale migrations, and operational maturity. I focus on turning complex, loosely defined problems into scalable, automated, production-ready platforms — and on influencing how engineering organizations build and operate those platforms long-term.

Here are some highlights of my profile:


  • Cloud Platform Engineer (10+ years)
  • Multi-cloud: AWS, Azure, GCP
  • Kubernetes & cloud-native platforms
  • Infrastructure as Code: Terraform, Pulumi & AWS CDK (Python)
  • Platform leadership, technical strategy & cross-org standardization
  • Observability: Datadog, Dynatrace, OpenTelemetry
  • Reliability engineering & SRE practices
  • Technical mentoring & driving engineering standards across teams
  • MSc in Cybersecurity
  • Writing about DevOps & cloud on Medium


Know more
Experience

10+ Years Building Cloud Platforms at Scale

Senior Cloud Engineer

European Online Retail Platform

Designed and evolved a Kubernetes-based cloud platform for a large European e-commerce organisation, enabling teams to build and operate services at scale.

  • Owned platform architecture and roadmap for a shared AWS EKS platform used across 70+ teams.
  • Scaled platform to support ~60% workload growth while maintaining 99.99% uptime.
  • Reduced cloud spend by ~20% through FinOps practices and platform optimisation.

Senior Site Reliability Engineer

Online Auctions & Automotive Platform

Drove reliability engineering and platform modernisation for a high-traffic auctions and automotive platform, improving system stability and engineering practices org-wide.

  • Defined and rolled out observability standards (monitoring, logging, alerting) across critical services.
  • Led cloud migration and containerisation initiatives, reducing operational toil across teams.
  • Established incident response and post-incident review practices to drive continuous reliability improvements.

Senior DevOps Engineer / Associate Technical Lead – DevOps

Foodservice & Supply Chain Technology

Led platform and DevOps engineering for large-scale foodservice and supply chain systems, driving architectural standards and mentoring engineers across multiple teams.

  • Led containerisation and cloud-native migration across multiple services, coordinating with product and engineering teams.
  • Defined and enforced infrastructure-as-code standards using Terraform and CI/CD across the engineering organisation.
  • Mentored engineers on DevOps practices and drove reliability improvements across production environments.

DevOps Engineer / Senior DevOps Engineer

Analytics & Machine Learning Products

Designed and operated cloud infrastructure for analytics and machine learning products serving customers across multiple industries, growing from engineer to senior ownership.

  • Established CI/CD pipelines, environment management, and configuration standards for multiple product teams.
  • Introduced SRE practices — monitoring, alerting, and on-call processes — from the ground up.
  • Partnered closely with data science and engineering teams to accelerate delivery and improve platform stability.

Systems Engineer

Travel & Enterprise Solutions

Supported mission-critical enterprise applications in the travel and hospitality space, maintaining high availability and contributing to release engineering improvements.

  • Managed application deployments and production operations for high-availability travel systems.
  • Streamlined release processes by collaborating with development and QA teams on deployment automation.
  • Built strong foundations in Linux, networking, and scripting — applied across all subsequent platform engineering roles.

Associate Application Support Engineer

Capital Markets & Trading Platforms

Started my career supporting capital markets and trading platforms, working closely with customers and engineering teams.

  • Provided application support and troubleshooting for trading systems.
  • Helped investigate incidents and performance issues.
  • Built a foundation in financial technology and market data.
Projects & Case Studies

Selected impact stories

A few examples of real projects where I designed, debugged, and improved cloud platforms with measurable impact.

Fixing EKS Cluster Autoscaler after AL2023 migration (IRSA + RBAC)

Role: Senior Cloud Engineer for a high-traffic e-commerce EKS platform.

During our migration from Amazon Linux 2 (AL2) to Amazon Linux 2023 (AL2023), the EKS Cluster Autoscaler suddenly stopped scaling: pods were stuck in Pending and logs showed “Failed to get nodes from apiserver: Unauthorized”. The tighter metadata behaviour on AL2023 broke our previous assumption that the autoscaler could “borrow” the node IAM role.

  • Identified that the autoscaler was implicitly using the node IAM role via EC2 instance metadata, which no longer worked with AL2023 defaults.
  • Moved the autoscaler to a dedicated IAM Role for Service Accounts (IRSA) with least-privilege AWS permissions.
  • Created a Kubernetes service account + RBAC role so the autoscaler had exactly the cluster permissions it needed (nodes, pods, leases, etc.).
  • Cleaned up legacy permissions on the node IAM role to remove hidden dependency on metadata and reduce blast radius.

Impact: Restored safe, predictable autoscaling on AL2023 in non-production before touching production, and created a reusable IRSA + RBAC pattern for other controllers (Cluster Autoscaler, ExternalDNS, load balancer controllers) across the organisation.

Read the full story on Medium »

Key lessons & tech stack
  • Move critical controllers (like Cluster Autoscaler) to IRSA or Pod Identity before changing AMIs.
  • Separate concerns: IRSA for AWS APIs, Kubernetes RBAC for what the pod can do inside the cluster.
  • Treat AMI upgrades as application changes: test in non-production with cordon/drain and synthetic scale-up/scale-down runs.

Why IRSA here: For this incident we used IRSA as the fastest safe fix: the cluster already had an OIDC provider, the Helm chart supported the “service account + annotation” pattern, and our AWS CDK stack had IRSA helpers. Pod Identity stays on the roadmap for new clusters where we can design the model from day one.

Tech stack: AWS EKS, Amazon Linux 2 & Amazon Linux 2023, Kubernetes Cluster Autoscaler, IAM Roles for Service Accounts (IRSA), Kubernetes RBAC, EKS OIDC, Terraform / AWS CDK (Python), Datadog.

Tags-based log retention in Datadog – giving ownership back to teams

Role: Senior Cloud & DevOps Engineer, leading Datadog governance for a multi-team engineering organisation.

Our Datadog logs setup “worked”, but ownership and costs were blurry. Dozens of teams shipped logs with inconsistent tags and ad-hoc indexes, making it hard to see who owned which volume, how long data stayed, and why costs kept creeping up.

  • Designed a tags-based, retention-first indexing strategy where teams choose their retention while platform enforces guardrails.
  • Defined mandatory tags on every log: team, costcenter, appgroup, env and retention.
  • Replaced per-team indexes with shared “retention lanes” (3 / 7 / 15 / 30 / 90 days) driven entirely by tags.
  • Introduced a temporary “punishment lane” with short retention and a quota for untagged or badly tagged logs.

Impact: Made log retention an explicit product team decision instead of a central bottleneck, improved cost transparency and paved the way for full IaC ownership of Datadog log indexes and enforcement rules.

Read the full story on Medium »

Index model, guardrails & automation
  • Created strict retention indexes such as index-retention-period-03, -07, -15, -30, -90 matching only fully tagged logs with allowed retention values.
  • Added a 7-day temporary index with a daily quota for logs missing mandatory tags, giving teams short-term visibility but a strong incentive to fix tagging.
  • Built a global Datadog monitor that detects logs with missing tags or invalid retention and alerts the platform team.
  • Implemented a LogsIndexManager module in Pulumi (Python) to manage indexes, routing rules and (optionally) index order and enforcement.

Tech stack: Datadog logs & monitors, tag-based routing and indexes, Pulumi (pulumi-datadog), AWS workloads (EKS/Lambda/EC2), shared tagging model for logs, metrics and traces across 70+ engineering teams.

Migrating container images from GCP to AWS ECR safely and repeatably

Role: Platform/DevOps Engineer leading a registry migration from Google Artifact Registry / Container Registry to AWS ECR.

As part of a wider platform move to AWS, dozens of image repositories had to move from GCP (with hierarchical paths like eu.gcr.io/project/app/service) to AWS ECR, which uses flatter repositories and tags. A naive “pull & push” risked overwriting tags or losing the original structure.

  • Designed a deterministic mapping from GCP’s hierarchical image paths to ECR repositories and tags (for example project-app-service:1.2.3).
  • Built a Python CLI that discovers tags, pulls from GCP, retags, pushes to ECR and validates digests to ensure images are identical.
  • Added tag filtering (semver awareness, prefix filters, --limit) and a safe dry-run mode with clear logging.
  • Included retry logic and optional cleanup so teams could migrate repositories one by one with confidence.

Impact: Enabled teams to migrate image repositories without accidentally overwriting tags or losing traceability, and produced a reusable migration tool that can be shared or open-sourced for similar GCP → ECR moves.

Read the full story on Medium »

Migration workflow & tech stack
  • Discover images and tags from GCP, then normalise hierarchical image names into ECR-compatible repository + tag pairs using a deterministic mapping.
  • For each selected tag: pull from GCP → retag to the mapped ECR repository/tag → push to ECR → compare digests.
  • On digest match, optionally clean up local images and, if desired, the source images in GCP.
  • Log every action (discovery, mapping, pull, push, validation) with clear, human-readable output so teams can audit exactly how GCP paths were translated into ECR.

Tech stack: Python, Docker CLI, Google Artifact Registry / Container Registry, AWS ECR, AWS CLI, bash automation and CI integration where needed. The tool encapsulates the GCP hierarchical naming model and the flatter AWS ECR repository/tag model so teams don’t have to think about it on every migration.

Writing & Talks

From my blog & podcast

I write about Cloud, DevOps and platform engineering on Medium, and occasionally join podcasts to share lessons from real-world migrations and incidents.

Podcast / YouTube talk

A recent conversation where I talk about my work, platform engineering and lessons learned.

foggy mountains
Tech stack

Tools I Use

All
Cloud Platforms
Microservices & Orchestration
Incident & Change Management
Security
Ops tools
aws
AWS
Cloud Platform
azure
Azure
Cloud Platform
gcp
GCP
Cloud Platform
docker
Docker
Containers
kubernetes
Kubernetes
Container Orchestration
servicenow
ServiceNow
Incident & Change Management
jira
Jira Service Management
Incident & Change Management
xmatters
Xmatters
Incident Escalation
crowdstrike
CrowdStrike
Security Scan
git
CI/CD
SCM, CI/CD
jenkins
Jenkins
Automation / CI-CD
datadog
Datadog
Monitoring
terraform
Terraform
Infrastructure as Code
Education

School & University


School Education – 2011

Completed primary and secondary education at Royal College, Colombo 7, Sri Lanka .



Bachelor of Science – 2016

BSc (Hons) in Information Technology from Sri Lanka Institute of Information Technology (SLIIT) .



Master of Science – 2019

MSc in Information Technology – Cyber Security from Sri Lanka Institute of Information Technology (SLIIT) .


contact background

Get in touch

Skill Set


Contact Me


Location:
Rotterdam, South Holland, Netherlands

Email:

info@dilshanwijesooriya.me