Site Reliability Engineer with Google-scale training and hands-on implementation leadership. 8 years in technical operations, progressing from Google's IT Resident program to SRE, now designing greenfield AWS infrastructure for an early-stage AI platform. Expert in translating large-scale platform concepts (Borg, RAPID) into industry-standard implementations (Kubernetes, GitHub Actions, ECS Fargate). Proven track record of end-to-end infrastructure ownership, cloud migrations, and building resilient systems that deliver measurable business impact.
Designing and building AWS infrastructure from the ground up for an early-stage AI-powered real estate platform
- Designed and built multi-account AWS infrastructure from scratch using Pulumi (Python), with layered project isolation for networking, identity, operations, and service deployment.
- Created a convention-driven service platform that auto-provisions IAM roles, container registries, secrets, and databases from a single service declaration, reducing per-service onboarding to minimal configuration.
- Built factory-based ECS orchestration connecting compute, database, and identity layers through shared naming contracts and validated configuration with resource tier presets.
- Migrated multiple services from Lambda to ECS Fargate using a standardized factory pattern, establishing the deployment model for staging and production environments.
- Built cross-account release pipeline with OIDC federation, build-once-deploy-by-digest promotion, automated rollback, and composite GitHub Actions.
- Deployed AWS WAF with managed rule groups, tuned from count to block mode after traffic analysis. Implemented per-service IAM with zero-trust deploy roles — no wildcard permissions.
- Integrated Datadog observability with golden signals dashboards and auto-provisioned monitors per service.
Applying SRE principles to lead a complete infrastructure transformation and modernization
- Led enterprise cloud migration from hybrid AWS (ECS/EC2) to fully containerized Azure AKS, migrating business-critical CPA applications serving 700+ users with zero data loss.
- Designed and implemented end-to-end infrastructure across three Azure AKS environments (staging, production, monitoring) using Infrastructure as Code principles.
- Built a comprehensive Kubernetes deployment framework with standardized manifests for service configuration management.
- Established DevSecOps practices implementing CI/CD pipelines, security scanning, and automated deployment workflows using GitHub Actions and Terraform.
- Built enterprise-grade monitoring infrastructure using Thanos and Grafana, achieving a 90% reduction in incident resolution time (from days to <3 hours).
- Delivered 99.95% uptime during peak tax season with only 45 minutes total downtime across 4 months, supporting mission-critical financial applications.
- Created a comprehensive observability platform providing the first-ever complete visibility into application performance and business metrics.
- Successfully adapted Google internal practices (Borg → Kubernetes, RAPID → GitHub Actions, Automon → Prometheus/Grafana).
- Migrated and operationalized real-time data pipeline architecture using Debezium CDC, Redpanda Kafka, and MongoDB, implementing full observability and monitoring across the streaming infrastructure.
Maintained reliability and performance for Google Cloud Platform's distributed logging infrastructure
- Orchestrated operations for Google's global Cloud Logging infrastructure, ensuring continuous availability for millions of GCP customers.
- Optimized SLO compliance through strategic alert tuning and efficient incident response, maintaining 99.95% SLO compliance.
- Led critical incident response, including on-call rotations, implementing mitigation strategies, and conducting comprehensive post-mortems.
- Managed large-scale deployments using Google's RAPID workflow, troubleshooting complex distributed system issues across global infrastructure.
- Utilized Borg cluster management to scale Cloud Logs microservices dynamically, optimizing resource allocation across thousands of machines.
- Collaborated with development teams to communicate technical challenges and implement long-term reliability improvements.
- Created standardization templates for monitoring dashboard migrations, enabling teams to meet compliance requirements while maintaining operational visibility.
- Modernized critical infrastructure by migrating batch processes to unified frameworks, improving maintainability and operational efficiency.
- Established testing protocols for service integrations, documenting dependencies, and creating simulation environments for complex distributed systems.
Google's structured program for developing technical talent, with a fixed 2-year term leading to internal career advancement
- Supported Google's global infrastructure of 100,000+ Linux, macOS, and Windows machines, maintaining enterprise productivity systems.
- Contributed to Armada fleet management system, automating deployment processes and improving operational efficiency.
- Created automation tooling in Golang to eliminate manual configuration steps in deployment processes, gaining hands-on programming experience while reducing operational overhead.
- Created Bash automation solutions for existing manual processes, reducing operational overhead and human error.