Site Reliability Engineer

Klaviyo

(Boston, Massachusetts)
Full Time
Job Posting Details
About Klaviyo
Klaviyo is email marketing reinvented for Ecommerce stores on Shopify, Bigcommerce and Magento. From personalized newsletters to automated emails like abandoned carts, order follow-ups and personalized thank you's, Klaviyo makes it easy for stores to setup great email marketing without the need for expensive systems and lots of people.
Summary
Site Reliability Engineering (SRE) is essentially what you get when you treat system operations as if it is a software problem. The mission of the Site Reliability Engineering team is to ensure uninterrupted service for Klaviyo customers and act as force multiplier for Klaviyo product teams to deliver better software faster. Klaviyo is a high growth technology driven company and is passionate about the user experience of its application and the well orchestrated operations of its service infrastructure. The SRE team works on its own initiatives to build foundational backend services but also builds tooling and automation to allow product teams to release and scale their software predictably. SREs are team players and embed themselves within product teams to advance the architecture and performance of software systems and to train their peers in topics such as debugging distributed systems, building self-healing capabilities or eking out every drop of performance possible. As a Site Reliability Engineer you will have ownership of foundational Klaviyo services and a big impact on our product teams. Klaviyo’s infrastructure, event processing, and team have grown 300% year over year so there are always new skills to learn and technical challenges to solve the right way. This position is full-time and based in Boston.
Responsibilities
* Design, write and deliver software to improve the availability, scalability, latency, and efficiency of Klaviyo’s services. * Perform quantitative analysis to understand high-impact events that break Klaviyo functionality and manage the cross-functional effort resolve those events * Solve problems relating to mission critical services and build automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions. * Engage in service capacity planning and demand forecasting, software performance analysis and system tuning. * Uncover and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies * Confidently make informed, data-driven decisions in a fast paced environment with competing priorities * Identify and drive opportunities to improve operational workflows * Conduct periodic on call duties * Educate other Klaviyo engineers on the best practices for building and operating highly reliable systems
Ideal Candidate
**Requirements** * BA or BS Degree in Computer Science, related field, or equivalent experience * Technical, Engineering or Quantitative background * Proven experience with Linux (we run Ubuntu) and all layers of the networking stack. You should be confident administering and debugging production Linux systems * Experience working on team software projects * Experience in one or more of: Python, Ruby, Go. * Familiarity with running and scaling distributed software systems (load balancing, high availability, systems monitoring, etc.) **Bonus Points:** * Expertise in designing, analyzing and troubleshooting high-traffic, large-scale distributed systems. * Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way. * Experience with Amazon Web Services (AWS) or similar cloud compute offerings * Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers, and load balancing). * Experience with building and scaling highly-reliable distributed Python systems (we use Django extensively) * Experience with instrumenting and monitoring production systems (Nagios, Statsd/Graphite, APM, etc.) * Systematic problem solving approach, coupled with a strong sense of ownership and drive

Questions

Answered by on
This question has not been answered
Answered by on

There are no answered questions, sign up or login to ask a question

Want to see jobs that are matched to you?

DreamHire recommends you jobs that fit your
skills, experiences, career goals, and more.