Blue Shield of California’s mission is to ensure all Californians have access to high-quality health care at a sustainably affordable price. We are transforming health care in a way that truly serves our nonprofit mission by lowering costs, improving quality, and enhancing the member and physician experience.
To fulfill our mission, we must ensure a diverse, equitable, and inclusive environment where all employees can be their authentic selves and fully contribute to meet the needs of the multifaceted communities we serve. Our comprehensive approach to diversity, equity, and inclusion combines a focus on our people, processes, and systems with a deep commitment to promoting social justice and health equity through our products, business practices, and presence as a corporate citizen.
Blue Shield has received awards and recognition for being a certified Great Place to Work, best place to work for LGBTQ equality, leading disability employer, one of the best companies for women to advance, Bay Area’s top companies in volunteering & giving, and one of the world’s most ethical companies. Here at Blue Shield of California, we are striving to make a positive change across our industry and the communities we live in – join us!
BSC’s healthcare apps support space has unique challenges needing your SRE experience as we bridge Dev and Ops together where you as an SRE thrive and can comfortably exist between the Dev and Ops space playing a critical role as the eyes, ears and technical glue for our Apps Support function. BSC is embarking into a new Cloud Services infrastructure space operating and delivering at scale across multiple geographically dispersed data centers which services millions of Blue Shield CA members, health providers and partners.
Our partners include Application Development teams, L3 App Support teams, business stakeholders, Incident Mgmt teams and architects to help ensure high quality products are developed, monitored and quickly recoverable in Production from code logic or system faults. We run majority of systems as a mix of open source, vendor licensed, and internally developed tools that perform functions with system configuration management, provisioning, software deployment, logging, and monitoring. You’ll learn these tools and have opportunities to improve them or build your own bombproof versions that best suit business need. Our team is collaborative. We work closely with the development teams we support to deliver the best results. Good ideas are heard and data-driven value results is what gets recognized and rewarded.
In this role, you will:
- Have the opportunity to build and own from scratch the end-to-end support function focused on application monitoring, availability, and performance to our mission critical application services. You will build automation to prevent problem incident occurrences and automate repairable responses to service incident occurrences. The idea of anything manually done through heavy manual process causes you to be physically ill and you’ll stop at nothing to continuously optimize the way we work in supporting monitoring and uptime with our applications on Production.
- Be solving for application support issues and develop risk prevention solutions using data analytics, partnering with development teams improving end-2-end apps monitoring through code and automation. You must be a mature and dynamic influencer working with developers bringing in best industry practices. You are an experimenter in your field expertise creating full infrastructure stacks to best support high applications availability and reliability for our Portfolios.
- Design and build an SRE function that owns application availability, performance and managing it through automation and proactive/predictive alerts using data analytical toolsets to identify areas of improvement for Dev and Ops teams. Implement comprehensive service monitoring to ensure uptime and performance, including synthetic, real user traffic, application performance, system level and dashboards.
- Define, measure and meet SLA/SLOs focusing on availability, performance, incidents and chronic quality issues. Arm developers with deeper insights into application performance and service health issues towards reducing MTTA & MTTR.
Your Knowledge and Experience
- Requires a bachelor’s degree in Computer Science or equivalent (software development or production operations experience in a large-scale environment) or equivalent experience
- Requires at least 10 years of prior relevant experience
- Solid understanding of DevOps Release Delivery flows having matured from the traditional waterfall monolithic operational models
- Experience operating in full Agile CI/CD DevOps pipeline environments (dev, test, build, merge, deployments) and familiarity with modernized toolsets that enable these areas
- Experience with network traffic monitoring, file transfer tracing and beaconing implementation is required
- Experience and exposure to telemetry/observability tools is required: Splunk, Datadog, App Dynamics, NewRelic, Prometheus, OpenTelemetry, Sentry, LogStash, Graphana
- Exposure to emerging technologies in ML/AIOps is a huge plus
- Experience in cloud app implementations on Azure or Google Cloud
- Experience with some amount of “Big Data” technologies: Kafka, ElasticSearch, NoSQL stores
- Ability to partner with release engineers/SCMs understanding impacts of branches and code merges
- Ability to understand and empathize technical/functional needs with quality engineers and values the need for continuous testing
- Excellent communication and audience-awareness
- Honest. We hold ourselves to the highest ethical and integrity standards. We build trust by doing what we say we’re going to do and by acknowledging and correcting where we fall short
- Human. We strive to be our authentic selves, listening and communicating effectively, and showing empathy towards others by walking in their shoes
- Courageous. We stand up for what we believe in and are committed to the hard work necessary to achieve our ambitious goals