(Remote) Senior Observability and Monitoring Engineer
Santa Ana, California-Remote; Boise, Idaho-Remote; Charlotte, North Carolina-Remote; Chicago, Illinois-Remote; Dallas, Texas-Remote; Des Moines, Iowa-Remote; Fort Myers, Florida-Remote; Houston, Texas-Remote; Irvine, California-Remote; Jacksonville, Florida-Remote; Madison, Wisconsin-Remote; Minneapolis, Minnesota-Remote; New York, New York-Remote; Phoenix, Arizona-Remote; Portland, Oregon-Remote; Sacramento, California-Remote; Salt Lake City, Utah-Remote; San Antonio, Texas-Remote; San Francisco, California-Remote; Seattle, Washington-Remote; South Orange, New Jersey
Who We Are
Join a team that puts its People First! Since 1889, First American (NYSE: FAF) has held an unwavering belief in its people. They are passionate about what they do, and we are equally passionate about fostering an environment where all feel welcome, supported, and empowered to be innovative and reach their full potential. Our inclusive, people-first culture has earned our company numerous accolades, including being named to the Fortune 100 Best Companies to Work For® list for eight consecutive years. We have also earned awards as a best place to work for women, diversity and LGBTQ+ employees, and have been included on more than 50 regional best places to work lists. First American will always strive to be a great place to work, for all. For more information, please visit www.careers.firstam.com.What We Do
Job Summary
** Remote Work Welcome **
First American is seeking a Senior Observability and Monitoring Engineer who will play a pivotal role in ensuring reliability, robustness, and performance of First American's mission-critical software systems. This transformative role focuses on implementing, managing, and optimizing observability solutions to gain deep insight into system behavior, troubleshoot issues proactively, and enhance overall operational efficiency. The ideal candidate will exhibit a growth and automation mindset.
The Opportunity
Measure application health and performance against baselines to anticipate failures.
Define service level objectives and supporting service level indicators to capture baselines.
Automate application observability and reporting wherever practical.
Improve predictive incident response, utilizing automated solutions for issue resolution when applicable, having a well-defined process flow for human intervention.
Instill repeatable patterns across First American portfolio of applications ensuring consistent practices are in place.
Influence and train software teams on observability and instrumentation, including adopting observability frameworks. Identify key processes that involve toil-based activities and develop plans to remediate through automation.
Document and implement incident response processes and procedures to drive consistent mitigation and remediation in case of failure.
Address application architecture needs, pushing towards solutions that are fault tolerant, resilient, and easy to manage.
What You’ll Bring
Proven experience with observability tooling, application performance monitoring, infrastructure monitoring and log management.
Proficient in configuring alerting rules and automated responses to trigger actions when predefined thresholds or anomalies are detected.
Scripting and automation skills for customizing and extending observability solutions.
Strong knowledge of cloud platforms and container orchestration.
Skilled in defining service level objectives, measuring service level indicators, and setting up error budgets.
Strong understanding of SRE practices: incident response, change/release management, capacity planning, infrastructure automation, elastic environments, chaos engineering and blameless postmortems.
Excellent problem-solving skills, attention to detail, and strong communication abilities.
Technology Stack:
Cloud Computing Platform: AWS (Lambda, EC2, ECS, EKS, Fargate, RDS, S3, Dynamo DB, SQS)
Observability: Open Telemetry, AppDynamics, Grafana, ELK Stack, AWS CloudWatch and X-Ray
Programming/Scripting: C# .NET, PowerShell, Python, YAML, BASH
Code Repos: Azure Repos, GitHub
Infrastructure as code: Terraform, Ansible
Pay Range: $108,240 - $191,125 Annually
This hiring range is a reasonable estimate of the base pay range for this position at the time of posting. Pay is based on a number of factors which may include job-related knowledge, skills, experience, business requirements and geographic location.
#tcorpit
#techreferral
#LI-JC2
What We Offer
By choice, we don’t simply accept individuality – we embrace it, we support it, and we thrive on it! Our People First Culture celebrates diversity, equity and inclusion not simply because it’s the right thing to do, but also because it’s the key to our success. We are proud to foster an authentic and inclusive workplace For All. You are free and encouraged to bring your entire, unique self to work. First American is an equal opportunity employer in every sense of the term.Based on eligibility, First American offers a comprehensive benefits package including medical, dental, vision, 401k, PTO/paid sick leave and other great benefits like an employee stock purchase plan.Related Content
-
The REconomy Podcast
First American’s economic podcast examining the forces that influence real estate, housing and affordability, featuring First American Chief Economist Mark Fleming, Ph.D. and Deputy Chief Economist Odeta Kushi.
Learn More -
Fortune 100 List for 8 Straight Years
Proud to be ranked number 59 out of Fortune 100 2023 list.
Learn More -
Great Place To Work
We Are Proud to be a Great Place to Work Certified Company for 9 years straight.
Learn More