Jean Haley

Site Reliability Engineer
413-727-9804|jean@haley.nyc|Brooklyn, NY 11217
Professional Summary
Cloud and DevOps Specialist

Site Reliability Engineer with Google-scale training and hands-on implementation leadership. 5+ years progressing from Google's IT Resident program to SRE, now applying enterprise-level reliability practices to business-critical infrastructure. Expert in translating large-scale platform concepts (Borg, RAPID) into industry-standard implementations (Kubernetes, GitHub Actions). Proven track record of end-to-end infrastructure ownership, cloud migrations, and building resilient systems that deliver measurable business impact.

Experience
DevSecOps Lead - DevTech
Kaufman Rossin, Miami, FLApr 2024 - Present

Applying SRE principles to lead a complete infrastructure transformation and modernization

99.95% uptime peak season 700+ users zero-downtime migration 90% faster incident resolution
Infrastructure and Leadership
  • Led enterprise cloud migration from hybrid AWS (ECS/EC2) to fully containerized Azure AKS, migrating business-critical CPA applications serving 700+ users with zero data loss.
  • Designed and implemented end-to-end infrastructure across three Azure AKS environments (staging, production, monitoring) using Infrastructure as Code principles.
  • Built a comprehensive Kubernetes deployment framework with standardized manifests for service configuration management.
  • Established DevSecOps practices implementing CI/CD pipelines, security scanning, and automated deployment workflows using GitHub Actions and Terraform.
Observability and Reliability
  • Built enterprise-grade monitoring infrastructure using Thanos and Grafana, achieving a 90% reduction in incident resolution time (from days to <3 hours).
  • Delivered 99.95% uptime during peak tax season with only 45 minutes total downtime across 4 months, supporting mission-critical financial applications.
  • Created a comprehensive observability platform providing the first-ever complete visibility into application performance and business metrics.
Technology Translation
  • Successfully adapted Google internal practices (Borg → Kubernetes, RAPID → GitHub Actions, Automon → Prometheus/Grafana).
  • Migrated and operationalized real-time data pipeline architecture using Debezium CDC, Redpanda Kafka, and MongoDB, implementing full observability and monitoring across the streaming infrastructure.
Site Reliability Engineer - Cloud Logging
Google, New York, NYMar 2020 - Apr 2023

Maintained reliability and performance for Google Cloud Platform's distributed logging infrastructure

Millions of GCP customers served <99.9% error budget compliance
Platform Operations
  • Orchestrated operations for Google's global Cloud Logging infrastructure, ensuring continuous availability for millions of GCP customers.
  • Optimized SLO compliance through strategic alert tuning and efficient incident response, maintaining sub-99.9% error budgets.
  • Led critical incident response, including on-call rotations, implementing mitigation strategies, and conducting comprehensive post-mortems.
System Reliability
  • Managed large-scale deployments using Google's RAPID workflow, troubleshooting complex distributed system issues across global infrastructure.
  • Utilized Borg cluster management to scale Cloud Logs microservices dynamically, optimizing resource allocation across thousands of machines.
  • Collaborated with development teams to communicate technical challenges and implement long-term reliability improvements.
Process Improvement
  • Created standardization templates for monitoring dashboard migrations, enabling teams to meet compliance requirements while maintaining operational visibility.
  • Modernized critical infrastructure by migrating batch processes to unified frameworks, improving maintainability and operational efficiency.
  • Established testing protocols for service integrations, documenting dependencies, and creating simulation environments for complex distributed systems.
IT Resident - Corporate Engineering Support
Google, New York, NYMar 2018 - Mar 2020

Google's structured program for developing technical talent, with a fixed 2-year term leading to internal career advancement

100K+ machines global fleet
Enterprise Fleet Management
  • Supported Google's global infrastructure of 100,000+ Linux, macOS, and Windows machines, maintaining enterprise productivity systems.
  • Contributed to Armada fleet management system, automating deployment processes and improving operational efficiency.
Development and Automation
  • Created automation tooling in Golang to eliminate manual configuration steps in deployment processes, gaining hands-on programming experience while reducing operational overhead.
  • Created Bash automation solutions for existing manual processes, reducing operational overhead and human error.