Government Careers
  • Grafana Observability SME

  • Omni Inclusive
  • New York, New York 10001 United States View Map

Grafana Cloud Observability Platform Engineer

Top Skills: Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting. Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch. OpenTelemetry practitioner OTLP, collectors, SDK/agent instrumentation for at least three of Java,.NET, Go, Python, Node.js. eBPF-based auto-instrumentation experience with Beyla (or equivalent Pixie, Cilium Tetragon) in a production context. Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration, not webhook-only patterns); familiarity with ServiceNow ITOM, AIOps event correlation, and CMDB CI attachment. Multi-environment hosting fluency on-prem, AWS, Azure and Linux/Windows host agent deployment at scale. Dashboard-as-code and GitOps patterns (Grafana provisioning, Terraform provider, or Grizzly). Excellent written communication solution architecture documents, runbooks, and stakeholder-facing status reporting.

Role Summary: Own the end-to-end technical design, build, and operationalization of the Grafana Cloud observability platform for a 50-application estate spanning Java,.NET, Go, Python, and Node.js workloads hosted across on-premises data centres, AWS, and Azure. The SME serves as the senior technical authority across all eight in-scope Grafana Cloud modules and is accountable for instrumentation strategy, alerting design, dashboarding standards, and integration into ServiceNow ITOM via native Event Management. Scope is application-level observability only server and network health remain on SolarWinds, and URL/synthetic monitoring remains on Uptrends.

Key Responsibilities:

  • Platform architecture and configuration across all eight in-scope Grafana Cloud modules.
  • Tenancy and access design organizations, folders, teams, role-based access control, dashboard variables, template links, and annotations.
  • Application instrumentation strategy by technology stack.
  • Log pipeline engineering via Alloy structured JSON, Log4j/Logback, Serilog, NLog, Windows Event Log, Winston, Pino, loguru with parsing rules tuned per stack and LogQL-based dashboards and alerts.
  • Alerting design PromQL/LogQL/TraceQL rules, severity taxonomy, grouping, routing, and notification policies.
  • Single Pane of Glass design and deliver a tiered SPoG that surfaces Grafana application telemetry alongside contextual links to SolarWinds and Uptrends.
  • Business Dashboards and Reporting partner with the Dashboard Lead to define KPI taxonomy and ensure dashboard-as-code patterns and version control.
  • ServiceNow ITOM integration co-own the design and review of Grafana ? ServiceNow Event Management (native inbound integration) flow: event allow-list governance ( "deny by default "), enrichment, deduplication, AIOps correlation, automated incident creation with severity mapping and assignment group rules, CMDB CI attachment, and ServiceNow-as-master incident state.
  • Quality assurance authority across all technical deliverables solution architecture document, instrumentation runbooks, dashboard and alert library, integration test results.
  • Phased delivery execution Mobilise & Client ? Application Foundation (ML1) ? Onboarding of 40 Simple apps (ML2) ? Medium/Complex apps + ITOM Integration (ML2?3) ? SPoG, Dashboards & Reporting (ML3?4) ? Stabilisation, KT, and post-deployment support (ML4).
  • Knowledge transfer produce platform operating procedures and conduct structured handover to the client's run team.

Required Skills & Experience:

  • 7+ years in observability/monitoring engineering with deep, recent hands-on Grafana Cloud experience (not just OSS Grafana).
  • Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting.
  • Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch.
  • OpenTelemetry practitioner OTLP, collectors, SDK/agent instrumentation for at least three of Java,.NET, Go, Python, Node.js.
  • eBPF-based auto-instrumentation experience with Beyla (or equivalent Pixie, Cilium Tetragon) in a production context.
  • Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration, not webhook-only patterns); familiarity with ServiceNow ITOM, AIOps event correlation, and CMDB CI attachment.
  • Multi-environment hosting fluency on-prem, AWS, Azure and Linux/Windows host agent deployment at scale.
  • Dashboard-as-code and GitOps patterns (Grafana provisioning, Terraform provider, or Grizzly).
  • Excellent written communication solution architecture documents, runbooks, and stakeholder-facing status reporting.

Nice to Have:

  • Grafana Certified Professional or equivalent vendor certification.
  • Prior experience in a regulated utility, energy, or critical-infrastructure environment.
  • Familiarity with SolarWinds and Uptrends (sufficient to design clean boundaries with retained tooling, not to administer them).
  • Experience with ServiceNow CSDM and Service Mapping governance.
  • Exposure to FinOps for observability cardinality control, log volume management, retention tuning in Mimir/Loki.

Out of Scope for This Role:

  • Server health and network monitoring (owned by SolarWinds).
  • URL/synthetic endpoint monitoring (owned by Uptrends).
  • ServiceNow ITSM workflow ownership incident lifecycle remains with the client's ITSM/ITOM team; this role designs the integration, not the downstream process.

Grafana Cloud Observability Platform Engineer

Top Skills: Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting. Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch. OpenTelemetry practitioner OTLP, collectors, SDK/agent instrumentation for at least three of Java,.NET, Go, Python, Node.js. eBPF-based auto-instrumentation experience with Beyla (or equivalent Pixie, Cilium Tetragon) in a production context. Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration, not webhook-only patterns); familiarity with ServiceNow ITOM, AIOps event correlation, and CMDB CI attachment. Multi-environment hosting fluency on-prem, AWS, Azure and Linux/Windows host agent deployment at scale. Dashboard-as-code and GitOps patterns (Grafana provisioning, Terraform provider, or Grizzly). Excellent written communication solution architecture documents, runbooks, and stakeholder-facing status reporting.

Role Summary: Own the end-to-end technical design, build, and operationalization of the Grafana Cloud observability platform for a 50-application estate spanning Java,.NET, Go, Python, and Node.js workloads hosted across on-premises data centres, AWS, and Azure. The SME serves as the senior technical authority across all eight in-scope Grafana Cloud modules and is accountable for instrumentation strategy, alerting design, dashboarding standards, and integration into ServiceNow ITOM via native Event Management. Scope is application-level observability only server and network health remain on SolarWinds, and URL/synthetic monitoring remains on Uptrends.

Key Responsibilities:

  • Platform architecture and configuration across all eight in-scope Grafana Cloud modules.
  • Tenancy and access design organizations, folders, teams, role-based access control, dashboard variables, template links, and annotations.
  • Application instrumentation strategy by technology stack.
  • Log pipeline engineering via Alloy structured JSON, Log4j/Logback, Serilog, NLog, Windows Event Log, Winston, Pino, loguru with parsing rules tuned per stack and LogQL-based dashboards and alerts.
  • Alerting design PromQL/LogQL/TraceQL rules, severity taxonomy, grouping, routing, and notification policies.
  • Single Pane of Glass design and deliver a tiered SPoG that surfaces Grafana application telemetry alongside contextual links to SolarWinds and Uptrends.
  • Business Dashboards and Reporting partner with the Dashboard Lead to define KPI taxonomy and ensure dashboard-as-code patterns and version control.
  • ServiceNow ITOM integration co-own the design and review of Grafana ? ServiceNow Event Management (native inbound integration) flow: event allow-list governance ( "deny by default "), enrichment, deduplication, AIOps correlation, automated incident creation with severity mapping and assignment group rules, CMDB CI attachment, and ServiceNow-as-master incident state.
  • Quality assurance authority across all technical deliverables solution architecture document, instrumentation runbooks, dashboard and alert library, integration test results.
  • Phased delivery execution Mobilise & Client ? Application Foundation (ML1) ? Onboarding of 40 Simple apps (ML2) ? Medium/Complex apps + ITOM Integration (ML2?3) ? SPoG, Dashboards & Reporting (ML3?4) ? Stabilisation, KT, and post-deployment support (ML4).
  • Knowledge transfer produce platform operating procedures and conduct structured handover to the client's run team.

Required Skills & Experience:

  • 7+ years in observability/monitoring engineering with deep, recent hands-on Grafana Cloud experience (not just OSS Grafana).
  • Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting.
  • Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch.
  • OpenTelemetry practitioner OTLP, collectors, SDK/agent instrumentation for at least three of Java,.NET, Go, Python, Node.js.
  • eBPF-based auto-instrumentation experience with Beyla (or equivalent Pixie, Cilium Tetragon) in a production context.
  • Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration, not webhook-only patterns); familiarity with ServiceNow ITOM, AIOps event correlation, and CMDB CI attachment.
  • Multi-environment hosting fluency on-prem, AWS, Azure and Linux/Windows host agent deployment at scale.
  • Dashboard-as-code and GitOps patterns (Grafana provisioning, Terraform provider, or Grizzly).
  • Excellent written communication solution architecture documents, runbooks, and stakeholder-facing status reporting.

Nice to Have:

  • Grafana Certified Professional or equivalent vendor certification.
  • Prior experience in a regulated utility, energy, or critical-infrastructure environment.
  • Familiarity with SolarWinds and Uptrends (sufficient to design clean boundaries with retained tooling, not to administer them).
  • Experience with ServiceNow CSDM and Service Mapping governance.
  • Exposure to FinOps for observability cardinality control, log volume management, retention tuning in Mimir/Loki.

Out of Scope for This Role:

  • Server health and network monitoring (owned by SolarWinds).
  • URL/synthetic endpoint monitoring (owned by Uptrends).
  • ServiceNow ITSM workflow ownership incident lifecycle remains with the client's ITSM/ITOM team; this role designs the integration, not the downstream process.
Government Careers

Government Careers

Government jobs offer stability, competitive benefits, and the chance to make a meaningful impact on your community and country.

Whether you’re starting your career or seeking new opportunities, these roles provide pathways for growth, security, and service.

Explore positions across a wide range of fields and take the first step toward a rewarding future in public service.

Show more

MORE JOBS