(Remote) Senior Observability and Monitoring Engineer
Santa Ana, California-Remote; Boise, Idaho-Remote; Charlotte, North Carolina-Remote; Chicago, Illinois-Remote; Dallas, Texas-Remote; Des Moines, Iowa-Remote; Fort Myers, Florida-Remote; Houston, Texas-Remote; Irvine, California-Remote; Jacksonville, Florida-Remote; Madison, Wisconsin-Remote; Minneapolis, Minnesota-Remote; New York, New York-Remote; Phoenix, Arizona-Remote; Portland, Oregon-Remote; Sacramento, California-Remote; Salt Lake City, Utah-Remote; San Antonio, Texas-Remote; San Francisco, California-Remote; Seattle, Washington-Remote; South Orange, New Jersey
Who We AreJoin a team that puts its People First! Since 1889, First American (NYSE: FAF) has held an unwavering belief in its people. They are passionate about what they do, and we are equally passionate about fostering an environment where all feel welcome, supported, and empowered to be innovative and reach their full potential. Our inclusive, people-first culture has earned our company numerous accolades, including being named to the Fortune 100 Best Companies to Work For® list for eight consecutive years. We have also earned awards as a best place to work for women, diversity and LGBTQ+ employees, and have been included on more than 50 regional best places to work lists. First American will always strive to be a great place to work, for all. For more information, please visit www.careers.firstam.com.
What We Do
** Remote Work Welcome **
First American is seeking a Senior Observability and Monitoring Engineer who will play a pivotal role in ensuring reliability, robustness, and performance of First American's mission-critical software systems. This transformative role focuses on implementing, managing, and optimizing observability solutions to gain deep insight into system behavior, troubleshoot issues proactively, and enhance overall operational efficiency. The ideal candidate will exhibit a growth and automation mindset.
Measure application health and performance against baselines to anticipate failures.
Define service level objectives and supporting service level indicators to capture baselines.
Automate application observability and reporting wherever practical.
Improve predictive incident response, utilizing automated solutions for issue resolution when applicable, having a well-defined process flow for human intervention.
Instill repeatable patterns across First American portfolio of applications ensuring consistent practices are in place.
Influence and train software teams on observability and instrumentation, including adopting observability frameworks. Identify key processes that involve toil-based activities and develop plans to remediate through automation.
Document and implement incident response processes and procedures to drive consistent mitigation and remediation in case of failure.
Address application architecture needs, pushing towards solutions that are fault tolerant, resilient, and easy to manage.
What You’ll Bring
Proven experience with observability tooling, application performance monitoring, infrastructure monitoring and log management.
Proficient in configuring alerting rules and automated responses to trigger actions when predefined thresholds or anomalies are detected.
Scripting and automation skills for customizing and extending observability solutions.
Strong knowledge of cloud platforms and container orchestration.
Skilled in defining service level objectives, measuring service level indicators, and setting up error budgets.
Strong understanding of SRE practices: incident response, change/release management, capacity planning, infrastructure automation, elastic environments, chaos engineering and blameless postmortems.
Excellent problem-solving skills, attention to detail, and strong communication abilities.
Cloud Computing Platform: AWS (Lambda, EC2, ECS, EKS, Fargate, RDS, S3, Dynamo DB, SQS)
Observability: Open Telemetry, AppDynamics, Grafana, ELK Stack, AWS CloudWatch and X-Ray
Programming/Scripting: C# .NET, PowerShell, Python, YAML, BASH
Code Repos: Azure Repos, GitHub
Infrastructure as code: Terraform, Ansible
Pay Range: $108,240 - $191,125 Annually
This hiring range is a reasonable estimate of the base pay range for this position at the time of posting. Pay is based on a number of factors which may include job-related knowledge, skills, experience, business requirements and geographic location.