Site Reliability Engineer with Google-scale training and hands-on implementation leadership. 5+ years progressing from Google's IT Resident program to SRE, now applying enterprise-level reliability practices to business-critical infrastructure. Expert in translating large-scale platform concepts (Borg, RAPID) into industry-standard implementations (Kubernetes, GitHub Actions). Proven track record of end-to-end infrastructure ownership, cloud migrations, and building resilient systems that deliver measurable business impact.
Applying SRE principles to lead a complete infrastructure transformation and modernization
- Led enterprise cloud migration from hybrid AWS (ECS/EC2) to fully containerized Azure AKS, migrating business-critical CPA applications serving 700+ users with zero data loss.
- Designed and implemented end-to-end infrastructure across three Azure AKS environments (staging, production, monitoring) using Infrastructure as Code principles.
- Built a comprehensive Kubernetes deployment framework with standardized manifests for service configuration management.
- Established DevSecOps practices implementing CI/CD pipelines, security scanning, and automated deployment workflows using GitHub Actions and Terraform.
- Built enterprise-grade monitoring infrastructure using Thanos and Grafana, achieving a 90% reduction in incident resolution time (from days to <3 hours).
- Delivered 99.95% uptime during peak tax season with only 45 minutes total downtime across 4 months, supporting mission-critical financial applications.
- Created a comprehensive observability platform providing the first-ever complete visibility into application performance and business metrics.
- Successfully adapted Google internal practices (Borg → Kubernetes, RAPID → GitHub Actions, Automon → Prometheus/Grafana).
- Migrated and operationalized real-time data pipeline architecture using Debezium CDC, Redpanda Kafka, and MongoDB, implementing full observability and monitoring across the streaming infrastructure.
Maintained reliability and performance for Google Cloud Platform's distributed logging infrastructure
- Orchestrated operations for Google's global Cloud Logging infrastructure, ensuring continuous availability for millions of GCP customers.
- Optimized SLO compliance through strategic alert tuning and efficient incident response, maintaining sub-99.9% error budgets.
- Led critical incident response, including on-call rotations, implementing mitigation strategies, and conducting comprehensive post-mortems.
- Managed large-scale deployments using Google's RAPID workflow, troubleshooting complex distributed system issues across global infrastructure.
- Utilized Borg cluster management to scale Cloud Logs microservices dynamically, optimizing resource allocation across thousands of machines.
- Collaborated with development teams to communicate technical challenges and implement long-term reliability improvements.
- Created standardization templates for monitoring dashboard migrations, enabling teams to meet compliance requirements while maintaining operational visibility.
- Modernized critical infrastructure by migrating batch processes to unified frameworks, improving maintainability and operational efficiency.
- Established testing protocols for service integrations, documenting dependencies, and creating simulation environments for complex distributed systems.
Google's structured program for developing technical talent, with a fixed 2-year term leading to internal career advancement
- Supported Google's global infrastructure of 100,000+ Linux, macOS, and Windows machines, maintaining enterprise productivity systems.
- Contributed to Armada fleet management system, automating deployment processes and improving operational efficiency.
- Created automation tooling in Golang to eliminate manual configuration steps in deployment processes, gaining hands-on programming experience while reducing operational overhead.
- Created Bash automation solutions for existing manual processes, reducing operational overhead and human error.