SiriusXM

As part of the ERE (Ecosystem Reliability Engineering) team I worked on produce, deploy and maintain testing infrastructure and observability tools in order to ensure optimal functioning and uptime of all the SiriusXM services. Reduced platform service alerts by 83%, improving error log reduction by 2M per week, and driving developer actionability on alerts. Implemented performance bottleneck detection and anomaly detection mechanisms, alongside improved autoscaling resiliency.

SiriusXM is a huge project mostly hosted on AWS and managed via Typescript based CDK. Using Datadog as main Observability tool and an on-call rota for production deployments.