(Remote) Sr Site Reliability Engineer
Santa Ana, California-Remote; Boise, Idaho-Remote; Charlotte, North Carolina-Remote; Chicago, Illinois-Remote; Dallas, Texas-Remote; Des Moines, Iowa-Remote; Fort Myers, Florida-Remote; Houston, Texas-Remote; Irvine, California-Remote; Jacksonville, Florida-Remote; Madison, Wisconsin-Remote; Minneapolis, Minnesota-Remote; New York, New York-Remote; Phoenix, Arizona-Remote; Portland, Oregon-Remote; Sacramento, California-Remote; Salt Lake City, Utah-Remote; San Antonio, Texas-Remote; San Francisco, California-Remote; Seattle, Washington-Remote; South Orange, New Jersey
Who We AreJoin a team that puts its People First! Since 1889, First American (NYSE: FAF) has held an unwavering belief in its people. They are passionate about what they do, and we are equally passionate about fostering an environment where all feel welcome, supported, and empowered to be innovative and reach their full potential. Our inclusive, people-first culture has earned our company numerous accolades, including being named to the Fortune 100 Best Companies to Work For® list for eight consecutive years. We have also earned awards as a best place to work for women, diversity and LGBTQ+ employees, and have been included on more than 50 regional best places to work lists. First American will always strive to be a great place to work, for all. For more information, please visit www.careers.firstam.com.
What We Do
We are looking for a Senior Site Reliability Engineer to support the reliability of First American's mission-critical software systems. This transformative role involves automating IT infrastructure tasks and driving SRE best practices, tools, and processes. The ideal candidate should exhibit a growth mindset and proactively monitor and respond to incidents for optimal user experience.
What You’ll Do
Maintain and improve reliability of core software systems.
Prioritize customer satisfaction in all efforts.
Continuously learn and adapt to new technologies and methodologies.
Collaborate effectively with stakeholders and other Engineers.
Quickly respond to changes and resolve issues.
Take accountability for issue resolution and prevention.
Utilize automation tools to streamline processes and minimize manual intervention.
What You’ll Bring (At least 5-7 years' experience)
Bachelor's degree in Computer Science, Information Technology, or equivalent education and experience.
Expertise in application performance monitoring, observability, and proactive alert correlation, including monitoring containers and failure-based alerting.
Skilled in defining service level objectives, measuring service level indicators, and setting up error budgets.
Strong understanding of SRE practices: incident response, change/release management, capacity planning, infrastructure automation, elastic environments, chaos engineering and blameless postmortems.
Successful in improving CI/CD pipelines and build/release processes.
Experienced in creating SRE adoption framework and onboarding procedure.
Cloud Computing Platform: AWS (Lambda, EC2, ECS, EKS, Fargate, RDS, S3, Dynamo DB, SQS)
Monitoring and Logging Tools(s): AppDynamics, Splunk, ELK Stack, DataDog, Prometheus, AWS Cloudwatch/X-Ray
Networking Technology: Protocols, Load Balancers, Firewalls
Programming: C# .NET, PowerShell, Python, YAML
Code Repos: Azure Repos, GitHub
Infrastructure as code: Terraform, Ansible
Automation Tools: Jenkins, Chef, Puppet
Pay Range: $87,945 - $182,655 Annually
This hiring range is a reasonable estimate of the base pay range for this position at the time of posting. Pay is based on a number of factors which may include job-related knowledge, skills, experience, business requirements and geographic location.