Senior Site Reliability Engineer, Tenant Services: Geo
Verified EmployerGitLab
Job Description
Site Reliability Engineers (SREs) at GitLab are a blend of pragmatic operators and software craftspeople who apply sound engineering principles and mature automation to our operating environments.
In this role, you will join the Tenant Services, Geo team. Geo is a feature that replicates data to a warm-standby for migrations and disaster recovery. You will be responsible for supporting GitLab Dedicated customer migrations and Geo-related escalations. You will help evolve a low-risk cutover model while improving tooling and observability to make migrations faster, safer, and more predictable.
What You’ll Do
Migration Execution: Execute Dedicated Geo migrations and cutovers end-to-end, including planning, validation, execution, and post-cutover cleanup.
On-Call & Coverage: Join the team’s shift and weekend rotation for Dedicated cutovers (EMEA/US hours) and participate in the SaaS SRE on-call rotation for GitLab.com.
Operational Improvement: Prepare environments, perform data hygiene checks, and handle Geo-related escalations from Support and internal partners.
Automation & Tooling: Design and maintain automation, tooling, and runbooks using Ansible, Chef, Terraform, GitLab CI/CD, and Kubernetes.
Observability: Build monitoring, alerting, and dashboards in Prometheus and Grafana to detect symptoms early and track migration success rates/SLOs.
Collaboration: Partner with the core Geo team, Support, and Infrastructure teams on capacity planning and reliability improvements.
Incident Management: Contribute to readiness reviews and root cause analyses (RCAs), turning learnings into automated solutions.
Toil Reduction: Proactively identify and automate repetitive operational tasks to simplify workflows.
What You’ll Bring
Distributed Systems: Experience operating highly-available distributed systems at scale, ideally in a SaaS environment with customer-facing SLAs.
Cloud Expertise: Hands-on experience with GCP or AWS, including networking, storage, and managed services.
Containerization: Experience with Kubernetes and its ecosystem (e.g., Helm).
Infrastructure as Code: Proficiency with Terraform, Ansible, or Chef.
Software Engineering: Strong programming skills in Go or Ruby and proficiency with scripting (Shell, Python).
Observability: Experience with Prometheus, Grafana, and logging stacks to troubleshoot performance issues.
Data Integrity: Exposure to data replication, backup/restore, or migration scenarios where downtime risk must be carefully managed.
Communication: Ability to engage directly with enterprise customers and provide clear written updates in an asynchronous environment.
It’s a Plus If You Have:
Experience with disaster recovery technologies.
Experience with managed/hosted environments (GitLab Dedicated) and compliance-sensitive customers (SOC2, ISO).
Hands-on experience with PostgreSQL replication and cutover workflows.
Compensation & Benefits
How GitLab Supports Full-Time Employees:
Flexible Paid Time Off
Equity Compensation & Employee Stock Purchase Plan
Growth and Development Fund
Parental Leave & Home Office Support
Team Member Resource Groups
Required Skills
Experience Level
Senior Level