Senior DevOps / SRE Engineer (Contractor)

SeniorContract - 6 Months
LocationFully remote (US / LATAM)
Complete this form to apply

About the Engagement

We are seeking a Senior DevOps / SRE Engineer to take ownership of CI/CD pipelines, GitOps infrastructure, Kubernetes operations, and reliability engineering practices supporting a multi-service production platform. This role is critical in enabling safe, frequent deployments and ensuring rapid, structured recovery when incidents occur.

You will work closely with Platform Engineering, Data Platform, and Product squads to ensure teams can ship confidently and operate their services without unnecessary operational burden.

Role Breakdown

  • Bullet pointCI/CD and GitOps — 80%: GitHub Actions, ArgoCD, deployment safety and rollback patterns
  • Bullet pointKubernetes and Infrastructure Operations — 80%: EKS cluster operations, Terraform, Atlantis
  • Bullet pointReliability and Observability — 80%: SLOs, Grafana, incident response, on-call
  • Bullet pointSecurity and Platform Integrity — 80%: HashiCorp Vault, supply chain security, OPA
  • Bullet pointOperational Enablement — 20%: Runbook automation, cross-team reliability practices, AI-augmented delivery

Core Responsibilities

  • Bullet pointDesign, build, and maintain CI/CD pipelines across all repositories using reusable GitHub Actions workflows
  • Bullet pointOwn ArgoCD GitOps configuration and manage application promotion from staging to production
  • Bullet pointImplement deployment safety mechanisms, environment protections, and automated rollback patterns
  • Bullet pointOperate and upgrade the EKS cluster, including node groups, Karpenter provisioners, and cluster add-ons
  • Bullet pointMaintain Terraform infrastructure across all environments via Atlantis PR-driven workflows
  • Bullet pointDefine and maintain SLOs, alerting rules, and Grafana dashboards across platform services
  • Bullet pointLead incident response and drive structured post-incident review processes
  • Bullet pointOperate and maintain HashiCorp Vault, including auth backends, policies, and secret engines
  • Bullet pointImplement supply chain security controls: image scanning, signing, SBOM generation, and OPA policy enforcement
  • Bullet pointPartner with Security Engineering on network policy, egress controls, and compliance requirements

Operational & Enablement Responsibilities

  • Bullet pointAutomate repeatable operational work and eliminate manual remediation through tooling and runbook automation
  • Bullet pointProactively document and maintain runbooks as systems evolve
  • Bullet pointUse AI tooling to draft infrastructure code and runbook content, validating outputs against security and compliance standards before merging
  • Bullet pointPartner with product and engineering teams to strengthen reliability practices and reduce developer workflow friction
  • Bullet pointCommunicate clearly and effectively during incidents in a calm, factual, and action-oriented manner

Required Experience

  • Bullet pointProven ownership of production-grade CI/CD, GitOps, and Kubernetes operations for multi-service platforms
  • Bullet pointExperience operating and upgrading Kubernetes clusters (EKS preferred) including autoscaling with Karpenter
  • Bullet pointStrong experience managing infrastructure-as-code at scale using Terraform, including PR-driven workflows with Atlantis
  • Bullet pointDemonstrated track record in SLO definition, alert tuning, dashboard design, incident response, and post-incident reviews
  • Bullet pointExperience operating HashiCorp Vault and implementing security controls in delivery pipelines
  • Bullet pointStrong cross-functional collaboration skills enabling multiple squads to deploy safely and independently

Technical Skills

  • Bullet pointKubernetes: Expert cluster operations, node group management, Karpenter, RBAC, PodDisruptionBudgets, topology spread constraints
  • Bullet pointGitOps: ArgoCD Application/Project management, sync policies, drift detection, automated rollback
  • Bullet pointCI/CD: GitHub Actions — reusable workflows, matrix builds, secrets handling, environment protection rules, deployment gates
  • Bullet pointInfrastructure as Code: Terraform at production scale — module design, S3 state + DynamoDB locking, Atlantis apply workflows
  • Bullet pointService Mesh: Istio traffic management, mTLS, AuthorizationPolicy, circuit breaking, observability integration
  • Bullet pointAutoscaling: KEDA and Karpenter — event-driven autoscaling, Spot instance management, bin-packing, interruption handling
  • Bullet pointObservability: Prometheus, Grafana (dashboard-as-code), Loki, Tempo, Alertmanager
  • Bullet pointSecrets Management: HashiCorp Vault — auth backends, dynamic secrets, PKI, audit logs
  • Bullet pointSupply Chain Security: Trivy, Cosign, SBOM generation, OPA/Gatekeeper, Cilium network policy
  • Bullet pointScripting: Strong Python and Bash for automation, tooling, and runbook automation

Generative AI & Agentic Systems

  • Bullet pointIntegrates AI-powered quality gates into CI/CD pipelines, including automated code review bots, LLM-assisted security scanning, and AI-generated change risk summaries
  • Bullet pointUses AI agents to accelerate Terraform modules, Kubernetes manifests, and Helm chart scaffolding — validating all outputs against security and compliance standards
  • Bullet pointApplies AI-assisted techniques in incident response: log correlation, runbook step suggestions, and drafting post-incident reports from structured incident data
  • Bullet pointContributes to Prompt Execution Sandbox and Agent Gateway infrastructure requirements from a reliability and security perspective
  • Bullet pointUses AI tooling to enhance SLO analysis, alert tuning, and capacity planning modelling

Ways of Working

  • Bullet pointAutomation-First: Automates all repeatable work, prioritising reliability and eliminating manual fixes wherever possible
  • Bullet pointProactive Documentation: Maintains current, structured runbooks and documentation before incidents occur
  • Bullet pointAI-Augmented Delivery: Leverages AI tooling to accelerate delivery while maintaining strict validation against security and compliance policies
  • Bullet pointSLO-Driven Reliability: Treats SLOs as firm commitments and raises reliability risks before they manifest as incidents
  • Bullet pointStructured Incident Communication: Communicates clearly during incidents and ensures disciplined follow-through via post-incident reviews

Interview Process

  • Bullet pointRound 1: Meet the founders
  • Bullet pointRound 2: Technical interview
  • Bullet pointRound 3: Short technical takehome exercise
  • Bullet pointRound 4: Interview with the customer

Why Adroit Cloud Consulting?

At Adroit, we are committed to fostering an environment where excellence is not just encouraged — it's expected. We offer the opportunity to work with a team of highly skilled professionals who are passionate about technology and innovation. With a flexible working environment and the support to grow your career, Adroit is the ideal place for ambitious engineers looking to make a significant impact.

Complete this form to apply