observability-monitoring

Metrics collection, logging infrastructure, distributed tracing, SLO implementation, and monitoring dashboards

View on GitHub
Author Seth Hobson
Namespace @wshobson/claude-code-workflows
Category operations
Version 1.2.1
Stars 27,261
Downloads 73
self.md verified
Table of content

Metrics collection, logging infrastructure, distributed tracing, SLO implementation, and monitoring dashboards

Installation

npx claude-plugins install @wshobson/claude-code-workflows/observability-monitoring

Contents

Folders: agents, commands, skills

Included Skills

This plugin includes 4 skill definitions:

distributed-tracing

Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems.

View skill definition

Distributed Tracing

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.

Purpose

Track requests across distributed systems to understand latency, dependencies, and failure points.

When to Use

Distributed Tracing Concepts

Trace Structure

Trace (Request ID: abc123)
Span (frontend) [100ms]
Span (api-gateway) [80ms]
  ├→ Span (auth-service) [10ms]
  └→ Span (user-service) [60ms]
      └→ Span (database) [40ms]

Key Components

Jaeger Setup

Kubernetes Deployment

# Deploy Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability

# Deploy Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
  ingress:
    enabled: true
EOF

Docker Compose

version: "3.8"
services:
  

...(truncated)

</details>

### grafana-dashboards

> Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.

<details>
<summary>View skill definition</summary>

# Grafana Dashboards

Create and manage production-ready Grafana dashboards for comprehensive system observability.

## Purpose

Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.

## When to Use

- Visualize Prometheus metrics
- Create custom dashboards
- Implement SLO dashboards
- Monitor infrastructure
- Track business KPIs

## Dashboard Design Principles

### 1. Hierarchy of Information

┌─────────────────────────────────────┐ │ Critical Metrics (Big Numbers) │ ├─────────────────────────────────────┤ │ Key Trends (Time Series) │ ├─────────────────────────────────────┤ │ Detailed Metrics (Tables/Heatmaps) │ └─────────────────────────────────────┘


### 2. RED Method (Services)

- **Rate** - Requests per second
- **Errors** - Error rate
- **Duration** - Latency/response time

### 3. USE Method (Resources)

- **Utilization** - % time resource is busy
- **Saturation** - Queue length/wait time
- **Errors** - Error count

## Dashboard Structure

### API Monitoring Dashboard

```json
{
  "dashboard": {
    "title": "API Monitoring",
    "tags": ["api", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 0, "w": 1

...(truncated)

</details>

### prometheus-configuration

> Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.

<details>
<summary>View skill definition</summary>

# Prometheus Configuration

Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.

## Purpose

Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.

## When to Use

- Set up Prometheus monitoring
- Configure metric scraping
- Create recording rules
- Design alert rules
- Implement service discovery

## Prometheus Architecture

┌──────────────┐ │ Applications │ ← Instrumented with client libraries └──────┬───────┘ │ /metrics endpoint ↓ ┌──────────────┐ │ Prometheus │ ← Scrapes metrics periodically │ Server │ └──────┬───────┘ │ ├─→ AlertManager (alerts) ├─→ Grafana (visualization) └─→ Long-term storage (Thanos/Cortex)


## Installation

### Kubernetes with Helm

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageVolumeSize=50Gi

Docker Compose

version: "3.8"
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--stora

...(truncated)

</details>

### slo-implementation

> Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.

<details>
<summary>View skill definition</summary>

# SLO Implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

## Purpose

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

## When to Use

- Define service reliability targets
- Measure user-perceived reliability
- Implement error budgets
- Create SLO-based alerts
- Track reliability goals

## SLI/SLO/SLA Hierarchy

SLA (Service Level Agreement) ↓ Contract with customers SLO (Service Level Objective) ↓ Internal reliability target SLI (Service Level Indicator) ↓ Actual measurement


## Defining SLIs

### Common SLI Types

#### 1. Availability SLI

```promql
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

2. Latency SLI

# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

3. Durability SLI

# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)

Reference: See references/slo-definitions.md

Setting SLO Targets

Availability SLO Examples

SLO %Downtime/MonthDowntime/Year
99%7.2 hours3.65 days
99.9%43.2 minutes8.76 hours
99.9

…(truncated)

Source

View on GitHub

Tags: operations observabilitymonitoringmetricsloggingtracingsloprometheusgrafana