Senior Site Reliability Engineer for Transactions


(San Francisco, California)
Full Time
Job Posting Details
About Twilio
Twilio's mission is to fuel the future of communications. Developers and businesses use Twilio to make communications relevant and contextual by embedding messaging, voice and video capabilities directly into their software applications. Founded in 2008, Twilio has over 650 employees, with headquarters in San Francisco and other offices in Bogotá, Dublin, Hong Kong, London, Madrid, Mountain View, Munich, New York City, Singapore and Tallinn.
As a Senior SRE, you will be a core contributor and face some of the most complex challenges in distributed data systems at scale.
* Create a resilient and highly operable production environment with 24x7 availability, high performance, scalable and zero downtime releases in AWS environment. * Manage large MySQL database clusters and NoSQL systems such as Redis, DynamoDB, and Cassandra. * Manage regional deployments and set up disaster recovery of Kafka data pipelines, systems and stores in AWS environment. * Collaborate with Engineers to create a continuous delivery environment and processes. * Instrument and monitor the health and availability of services, with fault detection, alerting, triage and recovery (automated and manual). * Work closely with Twilio’s cloud infrastructure, orchestration, and security teams to help implement company-wide security and operability initiatives and to provide tooling requirements. * Performance manage (with benchmarking and monitoring of vital metrics), capacity plan, and resolve performance problems affecting service levels. * Write scripts and runbooks to automate procedures. * Enable auto-scaling.
Ideal Candidate
* Your background will be that of Senior Engineer who has had considerable experience in a highly-complex technical operations environment with cloud-based services. * Minimum 5+ years experience building complex distributed systems. In this role, you focused on reliability, high-availability, performance, scalability, capacity planning, backup and recovery, business continuity planning and automation of everything. * Strong Amazon AWS experience in a production environment. * Experience with managing and automating configuration of MySQL database clusters. * Hands-on experience with cloud infrastructure technologies, including continuous integration tools, configuration management, systems monitoring and alerting tools. * Experience with managing systems in distributed regions in the cloud or on-site. * Adept at troubleshooting and administering Linux systems, dealing with networking issues, and fine tuning instrumentation and alerting systems. * Demonstrated experience of agile processes, continuous integration, test automation and release management. * Significant development experience in at least one modern scripting language, preferably Python. * Exceptional communication and troubleshooting skills. * Preferably experience with operating a high load data pipeline and exposure to technologies such as Kafka, Kinesis, Spark, S3, and Redshift. * Preferably experience with managing NoSQL systems such as Redis, DynamoDB, and Cassandra. * Experience with securing distributed systems. You understand the purpose of reasonable security techniques and the tradeoff with operational efficiency.


Answered by on
This question has not been answered
Answered by on

There are no answered questions, sign up or login to ask a question

Want to see jobs that are matched to you?

DreamHire recommends you jobs that fit your
skills, experiences, career goals, and more.