Kubernetes分布式追踪与链路分析:实现全链路可观测性
Kubernetes分布式追踪与链路分析:实现全链路可观测性
一、分布式追踪概述
分布式追踪是一种用于跟踪跨多个服务的请求路径的技术,帮助定位性能瓶颈和故障点。
1.1 追踪架构
┌─────────────────────────────────────────────────────────────────┐ │ Distributed Tracing │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Service A │ │ Service B │ │ Service C │ │ │ │ (Span) │ │ (Span) │ │ (Span) │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ 追踪收集层 │ │ │ │ - Jaeger - Zipkin - OpenTelemetry Collector │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ 追踪存储与分析 │ │ │ │ - Elasticsearch - Cassandra - Query Service │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ 可视化界面 │ │ │ │ - Jaeger UI - Grafana - Zipkin UI │ │ │ └─────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘1.2 追踪概念
| 概念 | 描述 |
|---|---|
| Trace | 完整的请求链路 |
| Span | 单个操作单元 |
| Span Context | 跨服务传递的追踪信息 |
| Trace ID | 唯一标识整个追踪 |
| Span ID | 唯一标识单个Span |
二、Jaeger配置
2.1 Jaeger安装
kubectl create namespace observability kubectl apply -n observability -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/crds/jaegertracing.io_jaegers_crd.yaml kubectl apply -n observability -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/service_account.yaml kubectl apply -n observability -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role.yaml kubectl apply -n observability -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role_binding.yaml kubectl apply -n observability -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/operator.yaml # 创建Jaeger实例 kubectl apply -n observability -f - <<EOF apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: my-jaeger EOF2.2 Jaeger Ingress配置
apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: jaeger-ingress namespace: observability spec: rules: - host: jaeger.example.com http: paths: - path: / pathType: Prefix backend: service: name: my-jaeger-query port: number: 80三、OpenTelemetry配置
3.1 OpenTelemetry Collector安装
apiVersion: opentelemetry.io/v1alpha1 kind: OpenTelemetryCollector metadata: name: my-collector namespace: observability spec: config: | receivers: otlp: protocols: grpc: http: jaeger: protocols: grpc: thrift_http: zipkin: processors: batch: exporters: otlp: endpoint: jaeger:4317 tls: insecure: true service: pipelines: traces: receivers: [otlp, jaeger, zipkin] processors: [batch] exporters: [otlp]3.2 应用集成OpenTelemetry
apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: template: spec: containers: - name: app image: my-app:latest env: - name: OTEL_SERVICE_NAME value: my-app - name: OTEL_EXPORTER_OTLP_ENDPOINT value: http://my-collector:4318 - name: OTEL_RESOURCE_ATTRIBUTES value: service.name=my-app,deployment.environment=production四、追踪代码集成
4.1 Python追踪代码
from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter trace.set_tracer_provider(TracerProvider()) tracer = trace.get_tracer(__name__) processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4318")) trace.get_tracer_provider().add_span_processor(processor) with tracer.start_as_current_span("my-operation") as span: span.set_attribute("operation.type", "database") span.set_attribute("database.name", "postgres")4.2 Go追踪代码
package main import ( "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/otlp/otlptrace" "go.opentelemetry.io/otel/sdk/trace" ) func main() { exporter, _ := otlptrace.New(context.Background(), otlptrace.WithEndpoint("collector:4317"), otlptrace.WithInsecure(), ) tp := trace.NewTracerProvider( trace.WithBatcher(exporter), ) otel.SetTracerProvider(tp) tracer := otel.Tracer("my-app") ctx, span := tracer.Start(context.Background(), "my-operation") defer span.End() span.SetAttributes( attribute.String("operation.type", "api"), attribute.String("api.path", "/users"), ) }五、追踪查询与分析
5.1 Jaeger查询API
# 查询追踪 curl -X GET "http://jaeger/api/traces?service=my-app&limit=10" # 查询特定追踪 curl -X GET "http://jaeger/api/traces/{trace-id}" # 查询服务列表 curl -X GET "http://jaeger/api/services"5.2 追踪查询配置
apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: my-jaeger spec: query: options: log-level: debug query.max-traces: 1000六、追踪最佳实践
6.1 采样策略配置
apiVersion: opentelemetry.io/v1alpha1 kind: OpenTelemetryCollector metadata: name: my-collector spec: config: | processors: tailsampling: decision_wait: 100ms num_traces: 10 expected_new_traces_per_sec: 100 policies: - name: error-sampling type: status_code status_code: status_codes: - ERROR sampling_percentage: 100 - name: slow-traces type: latency latency: threshold_ms: 1000 sampling_percentage: 506.2 追踪属性规范
| 属性类别 | 属性名 | 描述 |
|---|---|---|
| Service | service.name | 服务名称 |
| Operation | operation.name | 操作名称 |
| HTTP | http.method, http.path, http.status_code | HTTP信息 |
| Database | db.system, db.name, db.operation | 数据库信息 |
| Error | error.type, error.message | 错误信息 |
6.3 追踪可视化
apiVersion: grafana.integreatly.org/v1beta1 kind: GrafanaDashboard metadata: name: tracing-dashboard spec: json: | { "title": "Distributed Tracing Dashboard", "panels": [ { "type": "graph", "title": "Trace Duration", "targets": [ { "expr": "avg(jaeger_trace_duration_seconds)", "legendFormat": "Average Duration" } ] } ] }七、分布式追踪与监控整合
7.1 Prometheus指标集成
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: jaeger-monitor spec: selector: matchLabels: app.kubernetes.io/name: jaeger endpoints: - port: metrics interval: 30s7.2 追踪告警配置
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: tracing-alerts spec: groups: - name: tracing.rules rules: - alert: HighTraceDuration expr: avg(jaeger_trace_duration_seconds) > 5 for: 5m labels: severity: warning annotations: summary: "High trace duration detected"八、总结
分布式追踪实践要点:
- 选择合适的工具:Jaeger、Zipkin或OpenTelemetry
- 代码集成:在关键路径添加追踪代码
- 采样策略:根据需求配置采样率
- 属性规范:统一追踪属性命名规范
- 可视化:配置Grafana仪表盘
- 告警配置:设置追踪相关的告警规则
建议在关键业务路径上添加追踪,定期分析追踪数据优化性能。
参考资料:
- Jaeger文档
- OpenTelemetry文档
- Zipkin文档
