Tracing #

Log memberitahu apa yang terjadi. Metrik memberitahu seberapa sering dan seberapa lambat. Tapi saat request lambat melintasi 10 microservice, log dan metrik tidak bisa menjawab pertanyaan yang paling penting: di mana bottleneck-nya dan kenapa request ini lebih lambat dari yang lain? Distributed tracing mengisi celah ini dengan merekam perjalanan lengkap setiap request melalui seluruh sistem — dari browser pengguna, ke API gateway, ke service A, ke service B, ke database, dan kembali. Ansible mengotomasi deployment infrastruktur tracing sehingga visibility ini tersedia dari hari pertama.

Arsitektur Distributed Tracing #

Aplikasi                OpenTelemetry Collector          Backend
  ├── service-a ──┐
  ├── service-b ──┼──► OTLP (gRPC/HTTP) ──► Jaeger/Tempo ──► Grafana
  └── service-c ──┘     (batching,              (storage,      (visualisasi,
                         sampling,               query)         korelasi)
                         export)

OpenTelemetry Collector adalah komponen sentral — ia menerima trace dari aplikasi, memproses (sampling, enrichment), dan meneruskannya ke backend. Memisahkan konfigurasi export dari aplikasi memungkinkan penggantian backend tanpa mengubah kode aplikasi.

Instalasi OpenTelemetry Collector #

# roles/otel-collector/tasks/main.yml
---
- name: Tambahkan repository OpenTelemetry
  apt_repository:
    repo: "deb [trusted=yes] https://packages.observiq.com/debian/ stable main"
    state: present

- name: Install OpenTelemetry Collector
  apt:
    name: otelcol-contrib
    state: present

- name: Deploy konfigurasi OTEL Collector
  template:
    src: otel-collector-config.yml.j2
    dest: /etc/otelcol-contrib/config.yaml
    owner: otelcol-contrib
    group: otelcol-contrib
    mode: '0640'
    validate: "otelcol-contrib validate --config=%s"
  notify: Restart otel-collector

- name: Jalankan OTEL Collector
  systemd:
    name: otelcol-contrib
    state: started
    enabled: true

{# templates/otel-collector-config.yml.j2 #}
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Juga terima dari Jaeger agent (untuk kompatibilitas aplikasi lama)
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
    send_batch_max_size: 2000

  # Sampling untuk mengurangi volume di production
  probabilistic_sampler:
    sampling_percentage: {{ otel_sampling_percentage | default(10) }}

  # Tambahkan resource attributes
  resource:
    attributes:
      - key: deployment.environment
        value: "{{ env }}"
        action: insert
      - key: host.name
        value: "{{ inventory_hostname }}"
        action: insert

exporters:
  otlp/tempo:
    endpoint: "{{ tempo_host }}:4317"
    tls:
      insecure: true

  # Juga kirim ke Jaeger jika masih digunakan
  jaeger:
    endpoint: "{{ jaeger_host }}:14250"
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger]
      processors: [batch, probabilistic_sampler, resource]
      exporters: [otlp/tempo]

Instalasi Grafana Tempo #

Tempo adalah backend tracing dari Grafana yang terintegrasi erat dengan stack Loki dan Prometheus:

# roles/tempo/tasks/main.yml
---
- name: Buat user tempo
  user:
    name: tempo
    system: true
    shell: /usr/sbin/nologin
    home: /var/lib/tempo
    create_home: true

- name: Download Tempo binary
  get_url:
    url: "https://github.com/grafana/tempo/releases/download/v{{ tempo_version }}/tempo_{{ tempo_version }}_linux_amd64.tar.gz"
    dest: /tmp/tempo.tar.gz
    checksum: "sha256:{{ tempo_checksum }}"

- name: Extract dan install
  unarchive:
    src: /tmp/tempo.tar.gz
    dest: /usr/local/bin/
    include: ['tempo']
    remote_src: true

- name: Deploy konfigurasi Tempo
  template:
    src: tempo-config.yml.j2
    dest: /etc/tempo/config.yml
    owner: tempo
    group: tempo
    mode: '0640'
  notify: Restart Tempo

{# templates/tempo-config.yml.j2 #}
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  trace_idle_period: 10s
  max_block_bytes: 1_000_000
  max_block_duration: 5m

compactor:
  compaction:
    compaction_window: 1h
    max_block_bytes: 100_000_000
    block_retention: {{ tempo_retention | default('720h') }}  # 30 hari default

storage:
  trace:
    backend: local
    local:
      path: /var/lib/tempo/traces
    wal:
      path: /var/lib/tempo/wal

Integrasi Tempo dengan Grafana #

Untuk korelasi trace-metrik-log di Grafana, tambahkan datasource Tempo dan konfigurasi derived fields:

# roles/grafana/tasks/tempo-datasource.yml
---
- name: Deploy datasource Tempo
  template:
    src: datasource-tempo.yml.j2
    dest: /etc/grafana/provisioning/datasources/tempo.yml
    owner: grafana
    group: grafana
    mode: '0640'
  notify: Restart Grafana

{# templates/datasource-tempo.yml.j2 #}
apiVersion: 1
datasources:
  - name: Tempo
    type: tempo
    uid: tempo
    access: proxy
    url: http://{{ tempo_host }}:3200
    jsonData:
      # Korelasi: dari trace ke log di Loki
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: -1m
        spanEndTimeShift: 1m
        filterByTraceID: true
        filterBySpanID: false
        customQuery: false
        query: '{job="${__span.tags["service.name"]}"} |= "${__trace.traceId}"'

      # Korelasi: dari trace ke metrik di Prometheus
      tracesToMetrics:
        datasourceUid: prometheus
        spanStartTimeShift: -2m
        spanEndTimeShift: 2m
        queries:
          - name: "Request rate"
            query: 'rate(http_requests_total{service="$${__span.tags["service.name"]}"}[5m])'

      # Service graph dari trace data
      serviceMap:
        datasourceUid: prometheus

      # Node graph view
      nodeGraph:
        enabled: true

Instrumentasi Aplikasi #

Untuk aplikasi Python menggunakan OpenTelemetry SDK:

# roles/app-tracing/tasks/main.yml
---
- name: Install OpenTelemetry Python packages
  pip:
    name:
      - opentelemetry-api
      - opentelemetry-sdk
      - opentelemetry-exporter-otlp
      - opentelemetry-instrumentation-flask
      - opentelemetry-instrumentation-sqlalchemy
      - opentelemetry-instrumentation-requests
    virtualenv: "{{ app_venv }}"

- name: Deploy konfigurasi OpenTelemetry untuk aplikasi
  template:
    src: otel-config.env.j2
    dest: "{{ app_dir }}/.otel.env"
    owner: "{{ app_user }}"
    mode: '0640'
  notify: Restart aplikasi

{# templates/otel-config.env.j2 #}
OTEL_SERVICE_NAME={{ app_name }}
OTEL_SERVICE_VERSION={{ app_version }}
OTEL_EXPORTER_OTLP_ENDPOINT=http://{{ otel_collector_host }}:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_RESOURCE_ATTRIBUTES=deployment.environment={{ env }},host.name={{ inventory_hostname }}
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG={{ otel_sampling_rate | default('0.1') }}

Sampling Strategy #

Di production dengan traffic tinggi, menyimpan semua trace terlalu mahal. Gunakan sampling yang cerdas:

# Sampling strategy di OTEL Collector
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100
    expected_new_traces_per_sec: 1000
    policies:
      # Selalu simpan trace yang error
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Selalu simpan trace yang lambat (> 1 detik)
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000

      # Sample 5% dari trace normal
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

Ringkasan #

Distributed tracing melengkapi log dan metrik — menjawab pertanyaan di mana bottleneck terjadi dalam perjalanan request melintasi banyak service.
OpenTelemetry Collector adalah lapisan tengah yang menerima trace dari aplikasi dan meneruskannya ke backend — memungkinkan ganti backend tanpa ubah kode aplikasi.
Grafana Tempo terintegrasi dengan Loki dan Prometheus — satu klik dari trace bisa langsung melompat ke log atau metrik yang berkorelasi di waktu yang sama.
Konfigurasi tracesToLogsV2 dan tracesToMetrics di datasource Tempo adalah kunci korelasi tiga pilar observability di Grafana.
Gunakan tail-based sampling di production — simpan semua trace error dan trace lambat, sample sebagian kecil trace normal untuk mengurangi biaya storage.
Kirim konfigurasi OTEL ke aplikasi via environment variables — tidak perlu ubah kode aplikasi saat konfigurasi sampling atau endpoint berubah.

← Sebelumnya: Metric Collection Berikutnya: Health Check →