Jean Haley

Professional Summary

Cloud and DevOps Specialist

Site Reliability Engineer with Google-scale training and hands-on implementation leadership. 8 years in technical operations, progressing from Google's IT Resident program to SRE, now designing greenfield AWS infrastructure for an early-stage AI platform. Expert in translating large-scale platform concepts (Borg, RAPID) into industry-standard implementations (Kubernetes, GitHub Actions, ECS Fargate). Proven track record of end-to-end infrastructure ownership, cloud migrations, and building resilient systems that deliver measurable business impact.

Experience

Site Reliability Engineer - Infrastructure

House-IQJan 2026 - Present

Designing and building AWS infrastructure from the ground up for an early-stage AI-powered real estate platform

Infrastructure and Platform Engineering

Designed and built multi-account AWS infrastructure from scratch using Pulumi (Python), with layered project isolation for networking, identity, operations, and service deployment.
Created a convention-driven service platform that auto-provisions IAM roles, container registries, secrets, and databases from a single service declaration, reducing per-service onboarding to minimal configuration.
Built factory-based ECS orchestration connecting compute, database, and identity layers through shared naming contracts and validated configuration with resource tier presets.

CI/CD and Security

Migrated multiple services from Lambda to ECS Fargate using a standardized factory pattern, establishing the deployment model for staging and production environments.
Built cross-account release pipeline with OIDC federation, build-once-deploy-by-digest promotion, automated rollback, and composite GitHub Actions.
Deployed AWS WAF with managed rule groups, tuned from count to block mode after traffic analysis. Implemented per-service IAM with zero-trust deploy roles — no wildcard permissions.
Integrated Datadog observability with golden signals dashboards and auto-provisioned monitors per service.

DevSecOps Lead - DevTech

Kaufman Rossin, Miami, FLApr 2024 - Dec 2025

Applying SRE principles to lead a complete infrastructure transformation and modernization

99.95% uptime peak season 90% faster incident resolution

Infrastructure and Leadership

Led enterprise cloud migration from hybrid AWS (ECS/EC2) to fully containerized Azure AKS, migrating business-critical CPA applications serving 700+ users with zero data loss.
Designed and implemented end-to-end infrastructure across three Azure AKS environments (staging, production, monitoring) using Infrastructure as Code principles.
Built a comprehensive Kubernetes deployment framework with standardized manifests for service configuration management.
Established DevSecOps practices implementing CI/CD pipelines, security scanning, and automated deployment workflows using GitHub Actions and Terraform.

Observability and Reliability

Built enterprise-grade monitoring infrastructure using Thanos and Grafana, achieving a 90% reduction in incident resolution time (from days to <3 hours).
Delivered 99.95% uptime during peak tax season with only 45 minutes total downtime across 4 months, supporting mission-critical financial applications.
Created a comprehensive observability platform providing the first-ever complete visibility into application performance and business metrics.

Technology Translation

Successfully adapted Google internal practices (Borg → Kubernetes, RAPID → GitHub Actions, Automon → Prometheus/Grafana).
Migrated and operationalized real-time data pipeline architecture using Debezium CDC, Redpanda Kafka, and MongoDB, implementing full observability and monitoring across the streaming infrastructure.

Site Reliability Engineer - Cloud Logging

Google, New York, NYMar 2020 - Apr 2023

Maintained reliability and performance for Google Cloud Platform's distributed logging infrastructure

99.95% SLA Cloud Logging Exabyte scale global infrastructure

Platform Operations

Orchestrated operations for Google's global Cloud Logging infrastructure, ensuring continuous availability for millions of GCP customers.
Optimized SLO compliance through strategic alert tuning and efficient incident response, maintaining 99.95% SLO compliance.
Led critical incident response, including on-call rotations, implementing mitigation strategies, and conducting comprehensive post-mortems.

System Reliability

Managed large-scale deployments using Google's RAPID workflow, troubleshooting complex distributed system issues across global infrastructure.
Utilized Borg cluster management to scale Cloud Logs microservices dynamically, optimizing resource allocation across thousands of machines.
Collaborated with development teams to communicate technical challenges and implement long-term reliability improvements.

Process Improvement

Created standardization templates for monitoring dashboard migrations, enabling teams to meet compliance requirements while maintaining operational visibility.
Modernized critical infrastructure by migrating batch processes to unified frameworks, improving maintainability and operational efficiency.
Established testing protocols for service integrations, documenting dependencies, and creating simulation environments for complex distributed systems.

IT Resident - Corporate Engineering Support

Google, New York, NYMar 2018 - Mar 2020

Google's structured program for developing technical talent, with a fixed 2-year term leading to internal career advancement

100K+ machines global fleet

Enterprise Fleet Management

Supported Google's global infrastructure of 100,000+ Linux, macOS, and Windows machines, maintaining enterprise productivity systems.
Contributed to Armada fleet management system, automating deployment processes and improving operational efficiency.

Development and Automation

Created automation tooling in Golang to eliminate manual configuration steps in deployment processes, gaining hands-on programming experience while reducing operational overhead.
Created Bash automation solutions for existing manual processes, reducing operational overhead and human error.

Programming

Databases

Version Control

Containers

Monitoring

CI/CD

Pub/Sub

Security

Practices

AI Tooling

Personal Projects