Jr. Full Stack Developer - Java, Splunk, Dynatrace at FP Inc.

RESPONSIBILITIES:

• Design and maintain end-to-end monitoring for critical services using Dynatrace (APM, Real User Monitoring, Synthetic, Davis AI, Smartscape) and GCP Cloud Monitoring (metrics, alerting policies, SLOs/SLIs, uptime checks, dashboards).

• Build service maps, dependency models, and problem detection in Dynatrace; tune Davis AI problem rules and reduce alert noise through thresholds, baselining, and tagging.

• Implement SLOs/SLIs with error budgets; continuously review burn rates and align alerting to customer impact.

• Partner with application teams to instrument code paths (e.g., Dynatrace OneAgent), trace distributed transactions, and validate golden signals (latency, traffic, errors, saturation).

Logging, Analytics & Insights (Splunk, Power BI)

• Create and optimize Splunk data models, indexes, sourcetypes, ingestion pipelines, and SPL searches; build actionable dashboards for NOC/SRE/Engineering.

• Develop operational analytics and executive reporting in Power BI (data modeling, DAX/Measures, scheduled refresh) to track reliability KPIs, incident trends, MTTR/MTTD, SLO compliance, and capacity signals.

• Establish governance for data quality, field extractions, and retention to ensure fast, accurate investigations.

Incident Management & Problem Management

• Lead incident response (Sev1/Sev2): run bridges, coordinate SMEs, communicate status/timelines, drive mitigation and customer updates.

• Maintain runbooks, decision trees, and standard operating procedures; ensure blameless post-incident reviews (PIRs) with clear RCA, corrective actions, and preventative measures.

• Track and close problem tickets tied to recurring failure modes; verify effectiveness of fixes via metrics and error budgets.

Reliability Engineering & Automation (Light Coding)

• Use light coding/scripting to automate recurring tasks: alert tuning, data enrichment, log parsing, playbook triggers, service health checks.

• Build small utilities or bots for on-call workflows (e.g., auto-triage, context gathering, incident timelines).

• Contribute to observability standards and best practices (naming, tags, SLIs, alert policies), and mentor teams on instrumenting for reliability.

Must Have Skills:

• 1+ years of experience with full stack development with Java.

• 2+ years of experience with Production Operations/Observability with Dynatrace and Splunk in high-availability environments.

• Exposure to SPL (Splunk) and Dynatrace (APM/RUM/Synthetic)—including alert design, dashboards, and noise reduction.

• Proven incident commander experience for Sev1/Sev2 with clear comms, stakeholder management, and PIR facilitation.

Nice-To-Have Skills:

• Coding/scripting for automation and data manipulation (e.g., Python or PowerShell; Go/Bash a plus).

• Hands-on recent experience with GCP operations: Cloud Monitoring, Cloud Logging, Alerting Policies, Uptime Checks, SLOs/SLIs; familiarity with Error Reporting/Trace is a plus.

• Solid understanding of service reliability concepts: golden signals, SLOs/error budgets, capacity and saturation, graceful degradation.

• Strong analytical mindset with a bias to measurable outcomes (MTTD/MTTR, alert volume, SLO compliance).

Soft Skills:

• Must possess excellent verbal and written communication skills, as well as strong problem solving and analytic skills.

• Must be a self-starter, reliable, highly motivated, results-oriented, customer-focused and attentive to details.

• Excellent analytical, documentation, and organizational skills.

• Ability to manage multiple priorities and deliver within deadlines.

• Strong communication and stakeholder engagement skills.