
RESPONSIBILITIES:
• Design and maintain end-to-end monitoring for critical services using Dynatrace (APM, Real User Monitoring, Synthetic, Davis AI, Smartscape) and GCP Cloud Monitoring (metrics, alerting policies, SLOs/SLIs, uptime checks, dashboards).
• Build service maps, dependency models, and problem detection in Dynatrace; tune Davis AI problem rules and reduce alert noise through thresholds, baselining, and tagging.
• Implement SLOs/SLIs with error budgets; continuously review burn rates and align alerting to customer impact.
• Partner with application teams to instrument code paths (e.g., Dynatrace OneAgent), trace distributed transactions, and validate golden signals (latency, traffic, errors, saturation).
Logging, Analytics & Insights (Splunk, Power BI)
• Create and optimize Splunk data models, indexes, sourcetypes, ingestion pipelines, and SPL searches; build actionable dashboards for NOC/SRE/Engineering.
• Develop operational analytics and executive reporting in Power BI (data modeling, DAX/Measures, scheduled refresh) to track reliability KPIs, incident trends, MTTR/MTTD, SLO compliance, and capacity signals.
• Establish governance for data quality, field extractions, and retention to ensure fast, accurate investigations.
Incident Management & Problem Management
• Lead incident response (Sev1/Sev2): run bridges, coordinate SMEs, communicate status/timelines, drive mitigation and customer updates.
• Maintain runbooks, decision trees, and standard operating procedures; ensure blameless post-incident reviews (PIRs) with clear RCA, corrective actions, and preventative measures.
• Track and close problem tickets tied to recurring failure modes; verify effectiveness of fixes via metrics and error budgets.
Reliability Engineering & Automation (Light Coding)
• Use light coding/scripting to automate recurring tasks: alert tuning, data enrichment, log parsing, playbook triggers, service health checks.
• Build small utilities or bots for on-call workflows (e.g., auto-triage, context gathering, incident timelines).
• Contribute to observability standards and best practices (naming, tags, SLIs, alert policies), and mentor teams on instrumenting for reliability.
Must Have Skills:
• 1+ years of experience with full stack development with Java.
• 2+ years of experience with Production Operations/Observability with Dynatrace and Splunk in high-availability environments.
• Exposure to SPL (Splunk) and Dynatrace (APM/RUM/Synthetic)—including alert design, dashboards, and noise reduction.
• Proven incident commander experience for Sev1/Sev2 with clear comms, stakeholder management, and PIR facilitation.
Nice-To-Have Skills:
• Coding/scripting for automation and data manipulation (e.g., Python or PowerShell; Go/Bash a plus).
• Hands-on recent experience with GCP operations: Cloud Monitoring, Cloud Logging, Alerting Policies, Uptime Checks, SLOs/SLIs; familiarity with Error Reporting/Trace is a plus.
• Solid understanding of service reliability concepts: golden signals, SLOs/error budgets, capacity and saturation, graceful degradation.
• Strong analytical mindset with a bias to measurable outcomes (MTTD/MTTR, alert volume, SLO compliance).
Soft Skills:
• Must possess excellent verbal and written communication skills, as well as strong problem solving and analytic skills.
• Must be a self-starter, reliable, highly motivated, results-oriented, customer-focused and attentive to details.
• Excellent analytical, documentation, and organizational skills.
• Ability to manage multiple priorities and deliver within deadlines.
• Strong communication and stakeholder engagement skills.