Site Reliability Engineer

Thumbtack

(San Francisco, California)

Full Time

Job Posting Details

About Thumbtack

Thumbtack is a local services marketplace that connects customers who need to get things done with local, skilled professionals who can help. From plumbers and painters to DJs and personal trainers, Thumbtack helps millions of customers find the right professional for their project in over 1,000 categories.

Summary

We're looking for an exceptionally talented engineer to help manage our growing infrastructure, ensuring our site stays up and performs well, and refining our processes for operating our production systems. Working closely with the rest of our engineering team, you'll have a great deal of authority in designing and implementing the hardware and software systems we use to host, manage, and monitor our production environment.

Thumbtack's infrastructure has always been managed by our small team and a single SRE, and while there haven't been any major disasters, we recognize it's time to take our operations to the next level. Our Python deploys could be much smoother, our monitoring could be more systematic and accessible, our and alerting could be much less noisy. We are actively moving our infrastructure from dedicated hardware to the AWS cloud to improve development speed and make our platform more scalable.

Here at Thumbtack, we’re building the easiest way for people to hire local pros for projects big and small. From house painting to personal training and everything in between, we bring customers the right pros for all of life’s projects. We're looking for an exceptionally talented engineer to help manage our growing infrastructure, ensuring our site stays up and performs well, and refining our processes for operating our production systems. Working closely with the rest of our engineering team, you'll have a great deal of authority in designing and implementing the hardware and software systems we use to host, manage, and monitor our production environment. Thumbtack's infrastructure has always been managed by our small team and a single SRE, and while there haven't been any major disasters, we recognize it's time to take our operations to the next level. Our Python deploys could be much smoother, our monitoring could be more systematic and accessible, our and alerting could be much less noisy. We are actively moving our infrastructure from dedicated hardware to the AWS cloud to improve development speed and make our platform more scalable.

Responsibilities

We're looking for someone to work with our nascent engineering operations team and push us forward. As an authority on operations, you'll be empowered to:

* Help plan and execute how we manage and monitor our platform as it grows
* Continually look for new ways to make our systems more reliable and easier to manage, incorporating third-party tools when available and writing software of your own when nothing else fits the bill
* Anticipate performance bottlenecks and provision new hardware as necessary
* Expand your skills and expertise as our systems continue to grow and develop plan and execute how we manage and monitor our platform as it grows

**Our current infrastructure**

* Our platform operates primarily on a few dozen dedicated Linux machines on RHEL, Ubuntu, and Debian, all managed via Puppet. We additionally run a small number of machines and services on AWS
* Our main data stores are Postgres (website backend) and Mongo (internal analytics). We also make use of DynamoDB, Riak, and Memcached
* We use DataDog, New Relic, Munin, Graphite and a handful of custom tools for monitoring and alerting
* We practice continuous deployment using a custom oneclick deployment system written in Python (Fabric). Auxiliary systems are deployed directly via Puppet

We're looking for someone to work with our nascent engineering operations team and push us forward. As an authority on operations, you'll be empowered to: * Help plan and execute how we manage and monitor our platform as it grows * Continually look for new ways to make our systems more reliable and easier to manage, incorporating third-party tools when available and writing software of your own when nothing else fits the bill * Anticipate performance bottlenecks and provision new hardware as necessary * Expand your skills and expertise as our systems continue to grow and develop plan and execute how we manage and monitor our platform as it grows **Our current infrastructure** * Our platform operates primarily on a few dozen dedicated Linux machines on RHEL, Ubuntu, and Debian, all managed via Puppet. We additionally run a small number of machines and services on AWS * Our main data stores are Postgres (website backend) and Mongo (internal analytics). We also make use of DynamoDB, Riak, and Memcached * We use DataDog, New Relic, Munin, Graphite and a handful of custom tools for monitoring and alerting * We practice continuous deployment using a custom oneclick deployment system written in Python (Fabric). Auxiliary systems are deployed directly via Puppet

Ideal Candidate

* You’re expert with Linux administration, security and configuration management * You have a deep knowledge of the steps involved in serving a web request, including a strong understand of TCP/IP, and experience dealing with the corresponding infrastructure components * You’re fanatic about monitoring * You enjoy diagnosing and fixing misbehaving and underperforming Linux servers * You’re fluent with the shell and comfortable writing tools in Python to automate our * operations and development processes * Experience with AWS is a plus * Experience tuning database performance is a plus * You’re comfortable working with a great deal of autonomy * You’re excited to continually learn, grow, and share knowledge

Similar Jobs

See other jobs at Thumbtack
See more engineering jobs in California

Questions

Answered by on

This question has not been answered

Answered by on

Ask a question!

There are no answered questions, sign up or login to ask a question

Site Reliability Engineer

Thumbtack

Questions

For Job Seekers

Contact Us

Site Reliability Engineer

Thumbtack

Questions

Want to see jobs that are matched to you?

Application Submitted

Login Here

Question Submitted

Thanks for submitting your question!