Grafana Cloud Observability Platform Engineer
Top Skills: Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting. Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch. OpenTelemetry practitioner OTLP, collectors, SDK/agent instrumentation for at least three of Java,.NET, Go, Python, Node.js. eBPF-based auto-instrumentation experience with Beyla (or equivalent Pixie, Cilium Tetragon) in a production context. Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration, not webhook-only patterns); familiarity with ServiceNow ITOM, AIOps event correlation, and CMDB CI attachment. Multi-environment hosting fluency on-prem, AWS, Azure and Linux/Windows host agent deployment at scale. Dashboard-as-code and GitOps patterns (Grafana provisioning, Terraform provider, or Grizzly). Excellent written communication solution architecture documents, runbooks, and stakeholder-facing status reporting.
Role Summary: Own the end-to-end technical design, build, and operationalization of the Grafana Cloud observability platform for a 50-application estate spanning Java,.NET, Go, Python, and Node.js workloads hosted across on-premises data centres, AWS, and Azure. The SME serves as the senior technical authority across all eight in-scope Grafana Cloud modules and is accountable for instrumentation strategy, alerting design, dashboarding standards, and integration into ServiceNow ITOM via native Event Management. Scope is application-level observability only server and network health remain on SolarWinds, and URL/synthetic monitoring remains on Uptrends.
Key Responsibilities:
- Platform architecture and configuration across all eight in-scope Grafana Cloud modules.
- Tenancy and access design organizations, folders, teams, role-based access control, dashboard variables, template links, and annotations.
- Application instrumentation strategy by technology stack.
- Log pipeline engineering via Alloy structured JSON, Log4j/Logback, Serilog, NLog, Windows Event Log, Winston, Pino, loguru with parsing rules tuned per stack and LogQL-based dashboards and alerts.
- Alerting design PromQL/LogQL/TraceQL rules, severity taxonomy, grouping, routing, and notification policies.
- Single Pane of Glass design and deliver a tiered SPoG that surfaces Grafana application telemetry alongside contextual links to SolarWinds and Uptrends.
- Business Dashboards and Reporting partner with the Dashboard Lead to define KPI taxonomy and ensure dashboard-as-code patterns and version control.
- ServiceNow ITOM integration co-own the design and review of Grafana ? ServiceNow Event Management (native inbound integration) flow: event allow-list governance ( "deny by default "), enrichment, deduplication, AIOps correlation, automated incident creation with severity mapping and assignment group rules, CMDB CI attachment, and ServiceNow-as-master incident state.
- Quality assurance authority across all technical deliverables solution architecture document, instrumentation runbooks, dashboard and alert library, integration test results.
- Phased delivery execution Mobilise & Client ? Application Foundation (ML1) ? Onboarding of 40 Simple apps (ML2) ? Medium/Complex apps + ITOM Integration (ML2?3) ? SPoG, Dashboards & Reporting (ML3?4) ? Stabilisation, KT, and post-deployment support (ML4).
- Knowledge transfer produce platform operating procedures and conduct structured handover to the client's run team.
Required Skills & Experience:
- 7+ years in observability/monitoring engineering with deep, recent hands-on Grafana Cloud experience (not just OSS Grafana).
- Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting.
- Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch.
- OpenTelemetry practitioner OTLP, collectors, SDK/agent instrumentation for at least three of Java,.NET, Go, Python, Node.js.
- eBPF-based auto-instrumentation experience with Beyla (or equivalent Pixie, Cilium Tetragon) in a production context.
- Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration, not webhook-only patterns); familiarity with ServiceNow ITOM, AIOps event correlation, and CMDB CI attachment.
- Multi-environment hosting fluency on-prem, AWS, Azure and Linux/Windows host agent deployment at scale.
- Dashboard-as-code and GitOps patterns (Grafana provisioning, Terraform provider, or Grizzly).
- Excellent written communication solution architecture documents, runbooks, and stakeholder-facing status reporting.
Nice to Have:
- Grafana Certified Professional or equivalent vendor certification.
- Prior experience in a regulated utility, energy, or critical-infrastructure environment.
- Familiarity with SolarWinds and Uptrends (sufficient to design clean boundaries with retained tooling, not to administer them).
- Experience with ServiceNow CSDM and Service Mapping governance.
- Exposure to FinOps for observability cardinality control, log volume management, retention tuning in Mimir/Loki.
Out of Scope for This Role:
- Server health and network monitoring (owned by SolarWinds).
- URL/synthetic endpoint monitoring (owned by Uptrends).
- ServiceNow ITSM workflow ownership incident lifecycle remains with the client's ITSM/ITOM team; this role designs the integration, not the downstream process.
Grafana Cloud Observability Platform Engineer
Top Skills: Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting. Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch. OpenTelemetry practitioner OTLP, collectors, SDK/agent instrumentation for at least three of Java,.NET, Go, Python, Node.js. eBPF-based auto-instrumentation experience with Beyla (or equivalent Pixie, Cilium Tetragon) in a production context. Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration, not webhook-only patterns); familiarity with ServiceNow ITOM, AIOps event correlation, and CMDB CI attachment. Multi-environment hosting fluency on-prem, AWS, Azure and Linux/Windows host agent deployment at scale. Dashboard-as-code and GitOps patterns (Grafana provisioning, Terraform provider, or Grizzly). Excellent written communication solution architecture documents, runbooks, and stakeholder-facing status reporting.
Role Summary: Own the end-to-end technical design, build, and operationalization of the Grafana Cloud observability platform for a 50-application estate spanning Java,.NET, Go, Python, and Node.js workloads hosted across on-premises data centres, AWS, and Azure. The SME serves as the senior technical authority across all eight in-scope Grafana Cloud modules and is accountable for instrumentation strategy, alerting design, dashboarding standards, and integration into ServiceNow ITOM via native Event Management. Scope is application-level observability only server and network health remain on SolarWinds, and URL/synthetic monitoring remains on Uptrends.
Key Responsibilities:
- Platform architecture and configuration across all eight in-scope Grafana Cloud modules.
- Tenancy and access design organizations, folders, teams, role-based access control, dashboard variables, template links, and annotations.
- Application instrumentation strategy by technology stack.
- Log pipeline engineering via Alloy structured JSON, Log4j/Logback, Serilog, NLog, Windows Event Log, Winston, Pino, loguru with parsing rules tuned per stack and LogQL-based dashboards and alerts.
- Alerting design PromQL/LogQL/TraceQL rules, severity taxonomy, grouping, routing, and notification policies.
- Single Pane of Glass design and deliver a tiered SPoG that surfaces Grafana application telemetry alongside contextual links to SolarWinds and Uptrends.
- Business Dashboards and Reporting partner with the Dashboard Lead to define KPI taxonomy and ensure dashboard-as-code patterns and version control.
- ServiceNow ITOM integration co-own the design and review of Grafana ? ServiceNow Event Management (native inbound integration) flow: event allow-list governance ( "deny by default "), enrichment, deduplication, AIOps correlation, automated incident creation with severity mapping and assignment group rules, CMDB CI attachment, and ServiceNow-as-master incident state.
- Quality assurance authority across all technical deliverables solution architecture document, instrumentation runbooks, dashboard and alert library, integration test results.
- Phased delivery execution Mobilise & Client ? Application Foundation (ML1) ? Onboarding of 40 Simple apps (ML2) ? Medium/Complex apps + ITOM Integration (ML2?3) ? SPoG, Dashboards & Reporting (ML3?4) ? Stabilisation, KT, and post-deployment support (ML4).
- Knowledge transfer produce platform operating procedures and conduct structured handover to the client's run team.
Required Skills & Experience:
- 7+ years in observability/monitoring engineering with deep, recent hands-on Grafana Cloud experience (not just OSS Grafana).
- Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting.
- Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch.
- OpenTelemetry practitioner OTLP, collectors, SDK/agent instrumentation for at least three of Java,.NET, Go, Python, Node.js.
- eBPF-based auto-instrumentation experience with Beyla (or equivalent Pixie, Cilium Tetragon) in a production context.
- Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration, not webhook-only patterns); familiarity with ServiceNow ITOM, AIOps event correlation, and CMDB CI attachment.
- Multi-environment hosting fluency on-prem, AWS, Azure and Linux/Windows host agent deployment at scale.
- Dashboard-as-code and GitOps patterns (Grafana provisioning, Terraform provider, or Grizzly).
- Excellent written communication solution architecture documents, runbooks, and stakeholder-facing status reporting.
Nice to Have:
- Grafana Certified Professional or equivalent vendor certification.
- Prior experience in a regulated utility, energy, or critical-infrastructure environment.
- Familiarity with SolarWinds and Uptrends (sufficient to design clean boundaries with retained tooling, not to administer them).
- Experience with ServiceNow CSDM and Service Mapping governance.
- Exposure to FinOps for observability cardinality control, log volume management, retention tuning in Mimir/Loki.
Out of Scope for This Role:
- Server health and network monitoring (owned by SolarWinds).
- URL/synthetic endpoint monitoring (owned by Uptrends).
- ServiceNow ITSM workflow ownership incident lifecycle remains with the client's ITSM/ITOM team; this role designs the integration, not the downstream process.
Government Careers
Government jobs offer stability, competitive benefits, and the chance to make a meaningful impact on your community and country.
Whether you’re starting your career or seeking new opportunities, these roles provide pathways for growth, security, and service.
Explore positions across a wide range of fields and take the first step toward a rewarding future in public service.
MORE JOBS
-
Strategic Business Development Lead: Gov & Legal (Remote)
- New York, New York
- Neal R. Gross & Company
- Jul 01, 2026
-
Traffic Control Crew Member
- Athens, Alabama
- Grayson Carter and Son
- Jul 01, 2026
-
Forward Deployed Analyst - Government
- South Jordan, Utah
- Strider Technologies
- Jul 01, 2026
-
Battlespace Awareness SME
- Arlington, Texas
- Avalore, LLC
- Jul 01, 2026
-
Senior Product Delivery Leader, Cyber-Physical Security
- Austin, Texas
- M.C. Dean
- Jul 01, 2026
-
13U1 Field Artillery
- Redondo Beach, California
- US ARMY
- Jul 01, 2026