Site Reliability Engineer (SRE) - X Command Center

101 X Corp.

Full-time

Remote friendly (US - CA - Palo Alto United States of America)

Are you prepared to join the X team and help build the ultimate real-time information-sharing app, revolutionizing how people connect? At X, we’re on a mission to become the trusted global digital public square, committed to protecting freedom of speech and building the future unlimited interactivity. Our goal is to empower every user to freely create and share ideas, fostering open public discourse without barriers. Join us in shaping this thrilling journey where your contribution will be invaluable to our success!

Location:

US: Palo Alto, New York or Los Angeles

Dublin (IE)

Salary Range:

$120,000 to $297,000

Who We Are:

X serves our community of users and customers by preserving free expression and choice, fostering limitless interactivity, and creating a marketplace for economic success.

We are a lean, high-impact team operating in a "reverse startup" mode – originally a very large tech company with a vast number of employees and users, we now operate with a fraction of the employee size while maintaining the same high level of impact. Our mission is to enhance the reliability and performance of our diverse service areas, including the core product, ads, video, and payments. We work cross-functionally and employ tools like large language models and distributed tracing to tackle a variety of Site Reliability Engineering challenges, from outsmarting bad actors to improving the resiliency of our services.

What You'll Do:

As a Site Reliability Engineer on the Command Center Team, you will:

Triage and Troubleshoot Complex Issues: Analyze and resolve critical system issues at massive scale, ensuring high availability and reliability of our services.
Continuous Improvement: Develop and implement strategies to continuously improve system performance, reliability, and resiliency. Apply techniques to detect and remediate bot and bad actor behavior. Optimize services for peak utilization and performance.
Software Development: Create and maintain software for load testing, failure detection, traffic management, and data analysis using Python, Go, Scala, JavaScript, and Superset.
Incident Management: We are the primary incident managers for the entire site. We provide clear and effective communication and lead engineering teams to mitigate impact and ensure timely resolution.
Enhance SLA/SLO Understanding: Continually refine service-level objectives (SLOs) across the stack, ensuring we meet/exceed our error budgets and user expectations.
User Experience Measurement: Implement high-fidelity metrics to more accurately measure and improve the user experience across our constantly evolving set of services.
Service Dependency Analysis: Use distributed tracing to understand and manage service dependencies, to facilitate debugging and to improve latencies.
Cross-Functional Collaboration: Work closely with various teams including Product, Infrastructure, and Safety. We leverage our lean structure to drive significant impact across all areas of the business.

Who You Are:

Highly self-motivated team player.
Enjoy approaching complex problems, thinking critically, and prototyping solutions in a dynamic and fast-paced environment without needing constant supervision.
Strong debugging, documentation and communication skills.
Availability for occasional travel visits to San Francisco HQ.

Qualifications:

Bachelor's degree or above in Computer Science, Engineering, or related field.
2+ years of experience in large-scale software development with a specific focus on site reliability engineering.
Profound understanding of computer science fundamentals, including data structures, algorithms, and concurrency principles.
Expertise with observability and monitoring, incident management, load testing, microservice architecture and design patterns, distributed systems, data visualization tools, and SQL-like query languages.
Proficiency in one or more object-oriented programming languages (e.g. Scala, Java, C++). Additional knowledge of Python or Golang will be considered a significant asset.
Strong knowledge of Unix/Linux system administration at scale.

This job is closed.