Senior Site Reliability Engineer
Offer summary

(Summary generated by AI based on the full job description)

The project focuses on monitoring and maintaining reliability of an infrastructure platform supporting AI services, Java APIs, and frontend applications. Key technologies include Kubernetes (AKS), Terraform, Azure (ACR, Key Vault, Virtual Networks), Prometheus, Grafana, GitHub Actions, ArgoCD. Main responsibilities cover defining and maintaining SLO/SLI, incident response, automation, Kubernetes infrastructure management, and development of observability and CI/CD tools. The project emphasizes toil reduction and production environment stability.

newyou can start ASAP

Senior Site Reliability Engineer

Company: Webellian Sp. z o.o.

from: 26 June 2026
to: 26 July 2026
salary not specifiedB2B contract (full-time)
salary not specifiedcontract of employment
Offer parameters
level:senior
working mode:hybrid
Warszawa, Mokotów
Warszawa, MokotówDomaniewska 45View on map

Requirements

Expected technologies

Microsoft Azure
Kubernetes
Terraform
Prometheus
Grafana
Python
GitHub Actions
ArgoCD

Optional technologies

Bicep

Operating system

Windows

Our requirements

  • 5+ years professional experience in site reliability engineering, DevOps, or platform engineering roles.
  • Strong Kubernetes experience: cluster operations, networking (Ingress, network policies), storage, autoscaling, and hands-on troubleshooting across production environments.
  • Solid Infrastructure as Code experience with Terraform; familiarity with Bicep or ARM templates is a plus.
  • Production experience with Azure cloud services: AKS, ACR, Key Vault, Azure Monitor, Application Insights, Virtual Networks, and Private Endpoints.
  • Strong observability experience: Prometheus, Grafana, centralized logging, alerting configuration, and distributed tracing instrumentation.
  • Working knowledge of SLO/SLI methodology: error budget principles, reliability target setting, and capacity planning.
  • Structured incident management experience: on-call ownership, blameless post-incident review, and runbook authorship.
  • Scripting and automation proficiency in Python or bash for toil elimination and operational tooling.
  • Strong CI/CD experience: GitHub Actions and ArgoCD or equivalent GitOps tooling.

Optional

  • Kubernetes certifications: CKA or CKAD.
  • Experience supporting AI or ML infrastructure workloads: GPU scheduling, model serving platforms, or inference pipeline operations.
  • Exposure to chaos engineering practices and fault injection testing.
  • FinOps experience: reserved capacity planning, resource right-sizing programs, and cost attribution per team or workload.
  • Service mesh experience (Istio, Linkerd) for traffic management and reliability patterns.
  • Experience in regulated industries (insurance, finance, healthcare) where auditability, change traceability, and secure-by-default operations are standard practice.

Your responsibilities

  • Define, instrument, and maintain SLOs and SLIs for platform components; own error budget tracking and produce regular reliability reports for hub leadership.
  • Serve on the on-call rotation as the infrastructure escalation tier; lead incident response for cluster-level, network-level, and storage failures; chair blameless post-incident reviews.
  • Implement and operate Kubernetes infrastructure (AKS): cluster lifecycle management, networking, resource quotas, autoscaling configuration, and multi-tenancy patterns across spoke namespaces.
  • Develop Infrastructure as Code (Terraform) to provision and manage Azure resources with consistency, auditability, and repeatable rollback capability.
  • Build and maintain observability infrastructure: Prometheus, Grafana, Azure Monitor, and Application Insights; own alerting rules, dashboards, and distributed tracing coverage across platform components.
  • Perform capacity planning and cost-aware resource management: right-size node pools, tune vertical and horizontal pod autoscalers, and identify resource waste across namespaces.
  • Identify and eliminate toil: automate repetitive operational tasks through scripting and tooling; measure and track toil reduction over time.
  • Maintain platform reliability procedures: rolling upgrades, backup and recovery testing, disaster recovery runbooks, and change freeze coordination.
  • Contribute to CI/CD pipelines and GitOps tooling (GitHub Actions, ArgoCD) from a reliability and deployment safety perspective; work with the Platform Team on release gates and rollback mechanisms.
  • Collaborate with the Run & Change team on incident SLA targets and operational procedures; work with Security Engineers on infrastructure hardening and vulnerability remediation.

About the project

As a Site Reliability Engineer within Advanced Analytics Team you will join the Infra team to own the reliability and operational health of the platform. You will define and maintain service level objectives, drive incident response at the infrastructure layer, and systematically eliminate operational toil through automation. You will work closely with Platform Engineers, Security Engineers, and the Run & Change team to ensure the platform meets its reliability commitments across production workloads spanning AI services, Java APIs, and frontend applications.
Ways of Working
  • Comfortable in agile, iterative delivery environments with personal ownership and accountability for platform reliability.
  • Clear communicator across global, cross-functional stakeholders; able to translate technical reliability metrics into business impact for non-technical audiences.
  • Proactive learner with pragmatic adoption of AI-assisted developer tools (e.g., GitHub Copilot, Claude Code) to improve automation coverage and delivery velocity.
  • This is how we organize our work

    This is how we work

    agile

    This is how we work on a project

    • Continuous Deployment
    • Continuous Integration
    • DevOps
    • issue tracking tools
    • testing environments
    Join a growing team of dedicated professionals! We love to pass on the knowledge to grow excellence, speak our minds without playing politics, and just enjoy hanging around together. If you share our passions - we want to meet you! So go ahead and apply.
    Company

    What we offer

    • Contract under Polish law: B2B or Umowa o Pracę
    • Benefits such as private medical care, group insurance, Multisport card
    • English classes available
    • Opportunity to work with excellent professionals
    • High standards of work and focus on the quality of code
    • New technologies in use
    • Continuously learning and growth
    • International team
    • Pinball, PlayStation & much more (on-site)

    Benefits

    • sharing the costs of sports activities
    • private medical care
    • life insurance
    • remote work opportunities
    • fruits
    • video games at work
    • coffee / tea
    • drinks
    • parking space for employees
    • leisure zone
    • Pinball, PlayStation & much more
    • English classes

    Recruitment stages

    • 1.
      📞A quick phone call with our Recruiter.
    • 2.
      📅Online technical interview, testing your skills.
    • 3.
      📅II face-to-face interview with your potential supervisor.
    • 4.
      🗒️Feedback.

    Webellian Sp. z o.o.

    Webellian is a well-established Digital Transformation and IT consulting company committed to creating a positive impact for our clients. We strive to make a meaningful difference in diverse sectors such as insurance, banking, healthcare, retail, and manufacturing. Our passion for cutting-edge and disruptive technologies, as well as our shared values and strong principles, are what motivate us. We are a community of engineers and senior advisors who work with our clients across industries, playing a deep and meaningful role in accelerating and realizing their vision and strategy.

    This is how we work

    Senior Site Reliability Engineer
    I apply to:
    Webellian Sp. z o.o.
    Warszawa, Mokotów
    Pracodawca zbiera zgłoszenia przez swój system.
    Przejdziesz na zewnętrzny formularz.

    By clicking "Aplikuj" you confirm that you've read and accepted our Terms and Conditions.



    This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

    Need more information?

    • Make sure the body of the offer doesn’t already include what you’re looking for.
    • Ask a question if you need more information you’re interested in.
    • We’ll forward your question to the employer and aim to provide a response within 3 business days.

    Share this offer