Site Reliability Engineer

Thumbtack

(San Francisco, California)
Full Time
Job Posting Details
About Thumbtack
Thumbtack is a local services marketplace that connects customers who need to get things done with local, skilled professionals who can help. From plumbers and painters to DJs and personal trainers, Thumbtack helps millions of customers find the right professional for their project in over 1,000 categories.
Summary
Here at Thumbtack, we’re building the easiest way for people to hire local pros for projects big and small. From house painting to personal training and everything in between, we bring customers the right pros for all of life’s projects. We're looking for an exceptionally talented engineer to help manage our growing infrastructure, ensuring our site stays up and performs well, and refining our processes for operating our production systems. Working closely with the rest of our engineering team, you'll have a great deal of authority in designing and implementing the hardware and software systems we use to host, manage, and monitor our production environment. Thumbtack's infrastructure has always been managed by our small team and a single SRE, and while there haven't been any major disasters, we recognize it's time to take our operations to the next level. Our Python deploys could be much smoother, our monitoring could be more systematic and accessible, our and alerting could be much less noisy. We are actively moving our infrastructure from dedicated hardware to the AWS cloud to improve development speed and make our platform more scalable.
Responsibilities
We're looking for someone to work with our nascent engineering operations team and push us forward. As an authority on operations, you'll be empowered to: * Help plan and execute how we manage and monitor our platform as it grows * Continually look for new ways to make our systems more reliable and easier to manage, incorporating third-­party tools when available and writing software of your own when nothing else fits the bill * Anticipate performance bottlenecks and provision new hardware as necessary * Expand your skills and expertise as our systems continue to grow and develop plan and execute how we manage and monitor our platform as it grows **Our current infrastructure** * Our platform operates primarily on a few dozen dedicated Linux machines on RHEL, Ubuntu, and Debian, all managed via Puppet. We additionally run a small number of machines and services on AWS * Our main data stores are Postgres (website backend) and Mongo (internal analytics). We also make use of DynamoDB, Riak, and Memcached * We use DataDog, New Relic, Munin, Graphite and a handful of custom tools for monitoring and alerting * We practice continuous deployment using a custom one­click deployment system written in Python (Fabric). Auxiliary systems are deployed directly via Puppet
Ideal Candidate
* You’re expert with Linux administration, security and configuration management * You have a deep knowledge of the steps involved in serving a web request, including a strong understand of TCP/IP, and experience dealing with the corresponding infrastructure components * You’re fanatic about monitoring * You enjoy diagnosing and fixing misbehaving and underperforming Linux servers * You’re fluent with the shell and comfortable writing tools in Python to automate our * operations and development processes * Experience with AWS is a plus * Experience tuning database performance is a plus * You’re comfortable working with a great deal of autonomy * You’re excited to continually learn, grow, and share knowledge

Questions

Answered by on
This question has not been answered
Answered by on

There are no answered questions, sign up or login to ask a question

Want to see jobs that are matched to you?

DreamHire recommends you jobs that fit your
skills, experiences, career goals, and more.