Brief History

I am a Senior Cloud & Platform Engineer (DevOps & SRE) with 9+ years of experience building and operating platforms on AWS, GCP and Azure. I focus on reliable cloud-native platforms and Kubernetes ecosystems, backed by Infrastructure as Code and strong observability – helping product teams ship faster, stay resilient and keep cloud costs under control.

Over the last decade, I have worked with e-commerce, fintech and analytics teams, designing and operating cloud platforms that power real-world products. Recently my work has spanned AWS, Azure and GCP, using Terraform and AWS CDK in Python, Kubernetes where it fits, and observability tooling such as Datadog and Dynatrace.

I enjoy turning loosely defined problems into automated, reliable platforms – from greenfield designs to migrations, cost optimisation and incident response.

Here are some highlights of my profile:


  • Senior Cloud & DevOps Engineer (9+ years)
  • Hands-on with AWS, Azure & GCP
  • Infrastructure as Code: Terraform & AWS CDK (Python)
  • Writes about DevOps & cloud on Medium
  • Strong incident management and on-call experience
  • MSc in Cybersecurity
  • DevOps mindset – close collaboration with developers
  • Microservices & Kubernetes platforms
  • Change & release management in regulated environments
  • CI/CD automation and platform governance


Know more
Experience

Who I am and what I do

Senior Cloud Engineer

European Online Retail Platform

Designed and operated a Kubernetes-based cloud platform for a large European e-commerce organisation, hosting high-traffic web and backend services.

  • Deployed and evolved AWS EKS using Infrastructure as Code (Terraform and AWS CDK in Python).
  • Supported ~60% workload growth while maintaining 99.99% uptime.
  • Reduced cloud spend by ~20% through rightsizing, cleanup and autoscaling improvements.

Site Reliability Engineer

Online Auctions & Automotive Platform

Worked as an SRE for an online auctions and automotive platform, focusing on reliability, performance and modernising legacy systems.

  • Implemented monitoring, logging and alerting for critical services.
  • Helped drive cloud migration and containerisation efforts.
  • Supported incident response and post-incident reviews.

Associate Technical Lead – DevOps

Global Payments & Fintech

Led DevOps initiatives for a global payments and fintech organisation, building secure, scalable infrastructure for payment and merchant services.

  • Designed CI/CD pipelines and environments for payment services.
  • Strengthened security and compliance posture of cloud workloads.
  • Collaborated with engineering and product on deployment strategies.

Senior DevOps Engineer

Foodservice & Supply Chain Technology

Supported large-scale foodservice and supply chain systems, modernising infrastructure and improving deployment workflows.

  • Migrated services to containerised and cloud-native architectures.
  • Automated infrastructure and deployments with Terraform and CI/CD.
  • Improved reliability and observability across multiple environments.

DevOps Engineer / Senior DevOps Engineer

Analytics & Machine Learning Products

Built and operated cloud infrastructure for analytics and machine learning products used by customers across multiple industries.

  • Managed CI/CD, environments and configuration for product teams.
  • Introduced monitoring, logging and alerting as part of SRE practices.
  • Worked closely with data science and engineering teams.

Systems Engineer

Travel & Enterprise Solutions

Worked on enterprise systems in the travel and hospitality space, supporting mission-critical applications and infrastructure.

  • Supported application deployments and production operations.
  • Collaborated with developers and QA on release processes.
  • Gained strong foundations in Linux, networking and automation.

Associate Application Support Engineer

Capital Markets & Trading Platforms

Started my career supporting capital markets and trading platforms, working closely with customers and engineering teams.

  • Provided application support and troubleshooting for trading systems.
  • Helped investigate incidents and performance issues.
  • Built a foundation in financial technology and market data.
Projects & Case Studies

Selected impact stories

A few examples of real projects where I designed, debugged and improved cloud platforms with measurable impact.

Fixing EKS Cluster Autoscaler after AL2023 migration (IRSA + RBAC)

Role: Senior Cloud Engineer for a high-traffic e-commerce EKS platform.

During our migration from Amazon Linux 2 (AL2) to Amazon Linux 2023 (AL2023), the EKS Cluster Autoscaler suddenly stopped scaling: pods were stuck in Pending and logs showed “Failed to get nodes from apiserver: Unauthorized”. The tighter metadata behaviour on AL2023 broke our previous assumption that the autoscaler could “borrow” the node IAM role.

  • Identified that the autoscaler was implicitly using the node IAM role via EC2 instance metadata, which no longer worked with AL2023 defaults.
  • Moved the autoscaler to a dedicated IAM Role for Service Accounts (IRSA) with least-privilege AWS permissions.
  • Created a Kubernetes service account + RBAC role so the autoscaler had exactly the cluster permissions it needed (nodes, pods, leases, etc.).
  • Cleaned up legacy permissions on the node IAM role to remove hidden dependency on metadata and reduce blast radius.

Impact: Restored safe, predictable autoscaling on AL2023 in non-production before touching production, and created a reusable IRSA + RBAC pattern for other controllers (Cluster Autoscaler, ExternalDNS, load balancer controllers) across the organisation.

Read the full story on Medium »

Key lessons & tech stack
  • Move critical controllers (like Cluster Autoscaler) to IRSA or Pod Identity before changing AMIs.
  • Separate concerns: IRSA for AWS APIs, Kubernetes RBAC for what the pod can do inside the cluster.
  • Treat AMI upgrades as application changes: test in non-production with cordon/drain and synthetic scale-up/scale-down runs.

Why IRSA here: For this incident we used IRSA as the fastest safe fix: the cluster already had an OIDC provider, the Helm chart supported the “service account + annotation” pattern, and our AWS CDK stack had IRSA helpers. Pod Identity stays on the roadmap for new clusters where we can design the model from day one.

Tech stack: AWS EKS, Amazon Linux 2 & Amazon Linux 2023, Kubernetes Cluster Autoscaler, IAM Roles for Service Accounts (IRSA), Kubernetes RBAC, EKS OIDC, Terraform / AWS CDK (Python), Datadog.

Tags-based log retention in Datadog – giving ownership back to teams

Role: Senior Cloud & DevOps Engineer, leading Datadog governance for a multi-team engineering organisation.

Our Datadog logs setup “worked”, but ownership and costs were blurry. Dozens of teams shipped logs with inconsistent tags and ad-hoc indexes, making it hard to see who owned which volume, how long data stayed, and why costs kept creeping up.

  • Designed a tags-based, retention-first indexing strategy where teams choose their retention while platform enforces guardrails.
  • Defined mandatory tags on every log: team, costcenter, appgroup, env and retention.
  • Replaced per-team indexes with shared “retention lanes” (3 / 7 / 15 / 30 / 90 days) driven entirely by tags.
  • Introduced a temporary “punishment lane” with short retention and a quota for untagged or badly tagged logs.

Impact: Made log retention an explicit product team decision instead of a central bottleneck, improved cost transparency and paved the way for full IaC ownership of Datadog log indexes and enforcement rules.

Read the full story on Medium »

Index model, guardrails & automation
  • Created strict retention indexes such as index-retention-period-03, -07, -15, -30, -90 matching only fully tagged logs with allowed retention values.
  • Added a 7-day temporary index with a daily quota for logs missing mandatory tags, giving teams short-term visibility but a strong incentive to fix tagging.
  • Built a global Datadog monitor that detects logs with missing tags or invalid retention and alerts the platform team.
  • Implemented a LogsIndexManager module in Pulumi (Python) to manage indexes, routing rules and (optionally) index order and enforcement.

Tech stack: Datadog logs & monitors, tag-based routing and indexes, Pulumi (pulumi-datadog), AWS workloads (EKS/Lambda/EC2), shared tagging model for logs, metrics and traces across 70+ engineering teams.

Migrating container images from GCP to AWS ECR safely and repeatably

Role: Platform/DevOps Engineer leading a registry migration from Google Artifact Registry / Container Registry to AWS ECR.

As part of a wider platform move to AWS, dozens of image repositories had to move from GCP (with hierarchical paths like eu.gcr.io/project/app/service) to AWS ECR, which uses flatter repositories and tags. A naive “pull & push” risked overwriting tags or losing the original structure.

  • Designed a deterministic mapping from GCP’s hierarchical image paths to ECR repositories and tags (for example project-app-service:1.2.3).
  • Built a Python CLI that discovers tags, pulls from GCP, retags, pushes to ECR and validates digests to ensure images are identical.
  • Added tag filtering (semver awareness, prefix filters, --limit) and a safe dry-run mode with clear logging.
  • Included retry logic and optional cleanup so teams could migrate repositories one by one with confidence.

Impact: Enabled teams to migrate image repositories without accidentally overwriting tags or losing traceability, and produced a reusable migration tool that can be shared or open-sourced for similar GCP → ECR moves.

Read the full story on Medium »

Migration workflow & tech stack
  • Discover images and tags from GCP, then normalise hierarchical image names into ECR-compatible repository + tag pairs using a deterministic mapping.
  • For each selected tag: pull from GCP → retag to the mapped ECR repository/tag → push to ECR → compare digests.
  • On digest match, optionally clean up local images and, if desired, the source images in GCP.
  • Log every action (discovery, mapping, pull, push, validation) with clear, human-readable output so teams can audit exactly how GCP paths were translated into ECR.

Tech stack: Python, Docker CLI, Google Artifact Registry / Container Registry, AWS ECR, AWS CLI, bash automation and CI integration where needed. The tool encapsulates the GCP hierarchical naming model and the flatter AWS ECR repository/tag model so teams don’t have to think about it on every migration.

Writing & Talks

From my blog & podcast

I write about Cloud, DevOps and platform engineering on Medium, and occasionally join podcasts to share lessons from real-world migrations and incidents.

Medium · Blog post

Tags-Based Retention in Datadog: How We Gave Log Ownership Back to Teams (Without Blowing the…

11/24/2025

Tags-Based Retention in Datadog: How We Gave Log Ownership Back to Teams (Without Blowing the Budget) For a long time, our Datadog logs setup looked… fine. Log...

Read on Medium

Medium · Blog post

Broadcom’s Bitnami Restrictions: Why It Matters — and What We Can Do About It

9/8/2025

Broadcom’s Bitnami Restrictions: Why It Matters — and What We Can Do About It Bitnami: A Quick Trip Down Memory Lane If you’ve been in the DevOps or deployment...

Read on Medium

Medium · Blog post

We Broke Our EKS Cluster Autoscaler During Amazon AL2023 Migration (and Fixed It)— Here’s What We…

7/15/2025

We Broke Our EKS Cluster Autoscaler During Amazon AL2023 Migration (and Fixed It)— Here’s What We Learned When we set out to migrate our EKS nodes from Amazon ...

Read on Medium

Podcast / YouTube talk

A recent conversation where I talk about my work, platform engineering and lessons learned.

foggy mountains
Tech stack

Tools I Used

All
13
Cloud Platforms
3
Microservices & Orchestration
2
Incident & Change Management
3
Security
1
Ops tools
4
aws
AWS
Cloud Platform
azure
Azure
Cloud Platform
gcp
GCP
Cloud Platform
docker
Docker
Containers
kubernetes
Kubernetes
Container Orchestration
servicenow
ServiceNow
Incident & Change Management
jira
Jira Service Management
Incident & Change Management
xmatters
Xmatters
Incident Escalation
crowdstrike
CrowdStrike
Security Scan
git
CI/CD
SCM, CI/CD
jenkins
Jenkins
Automation / CI-CD
datadog
Datadog
Monitoring
terraform
Terraform
Infrastructure as Code
Education

School & University


School Education – 2011

Completed primary and secondary education at Royal College, Colombo 7, Sri Lanka .



Bachelor of Science – 2016

BSc (Hons) in Information Technology from Sri Lanka Institute of Information Technology (SLIIT) .



Master of Science – 2019

MSc in Information Technology – Cyber Security from Sri Lanka Institute of Information Technology (SLIIT) .


contact background

Get in touch

Skill Set


Contact Me


Location:
Rotterdam, South Holland, Netherlands

Email:

info@dilshanwijesooriya.me