ADOT Integration Guide — AWS Distro for OpenTelemetry
이 문서는 ADOT (AWS Distro for OpenTelemetry) 연동 가이드입니다.
ECS Sidecar 기반 Metrics/Traces 수집, AMP, X-Ray 연동을 다룹니다.
1) ADOT 아키텍처
1.1) ECS Task 구성
┌─────────────────────────────────────────────────────────────────┐
│ ECS Task │
│ │
│ ┌──────────────────┐ ┌──────────────────────┐ │
│ │ Spring Boot │ │ ADOT Collector │ │
│ │ + ADOT Agent │ │ (Sidecar) │ │
│ │ │ │ │ │
│ │ :8080/actuator │◀──scrape────│ prometheus receiver │ │
│ │ /prometheus │ (30s) │ │ │
│ │ │ │ │ │
│ │ OTLP push ──────┼────────────▶│ otlp receiver │ │
│ │ (:4317 gRPC) │ │ (:4317, :4318) │ │
│ └──────────────────┘ │ │ │
│ │ awsecscontainermetrics │ │
│ │ (ECS 컨테이너 자체 메트릭) │ │
│ └──────────┬─────────────┘ │
└───────────────────────────────────────────────┼─────────────────┘
│
┌───────────────────────────┼───────────────────┐
▼ ▼ ▼
AWS X-Ray Amazon Managed CloudWatch
(Traces) Prometheus (Logs)
(Metrics)
1.2) 3가지 Receiver
| Receiver |
방식 |
수집 대상 |
Export 대상 |
| prometheus |
Pull (scrape 30s) |
/actuator/prometheus |
AMP |
| otlp |
Push (gRPC :4317) |
ADOT Agent 자동 계측 |
X-Ray, AMP |
| awsecscontainermetrics |
Internal (30s) |
ECS Task CPU/Memory |
AMP |
2) Spring Boot 설정
2.1) application.yml
management:
endpoints:
web:
exposure:
# ADOT Collector가 scrape할 엔드포인트 노출
include: health,info,metrics,prometheus
base-path: /actuator
endpoint:
prometheus:
enabled: true
metrics:
export:
prometheus:
enabled: true # ADOT Collector가 scrape
# ⚠️ 필수: 서비스 구분용 태그
# AMP에서 쿼리 시 application, environment로 필터링
tags:
application: ${spring.application.name}
environment: ${spring.profiles.active:local}
# 히스토그램 분포 설정 (SLO 기반)
distribution:
percentiles-histogram:
http.server.requests: true
percentiles:
http.server.requests: 0.5,0.95,0.99
slo:
http.server.requests: 100ms,500ms,1s,5s
2.2) 의존성 (build.gradle)
dependencies {
// Micrometer Prometheus Registry (ADOT scrape용)
implementation 'io.micrometer:micrometer-registry-prometheus'
// OpenTelemetry API (선택: 수동 계측 시)
// ADOT Agent가 자동 계측하므로 보통 불필요
// implementation 'io.opentelemetry:opentelemetry-api'
}
3) Prometheus Receiver (Metrics Pull)
3.1) ADOT Config 예시
# adot-config.yaml (인프라 프로젝트)
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'application-metrics'
scrape_interval: 30s
static_configs:
- targets: ['localhost:${APP_PORT}']
metrics_path: /actuator/prometheus
# 필터: 필요한 메트릭만 수집
metric_relabel_configs:
- source_labels: [__name__]
regex: '(http_.*|jvm_.*|hikaricp_.*|application_.*|business_.*)'
action: keep
3.2) 수집되는 Metrics
| 메트릭 |
설명 |
예시 |
http_server_requests_seconds_count |
HTTP 요청 수 |
요청 카운터 |
http_server_requests_seconds_sum |
HTTP 요청 총 시간 |
latency 계산용 |
http_server_requests_seconds_bucket |
HTTP 요청 히스토그램 |
percentile 계산용 |
jvm_memory_used_bytes |
JVM 메모리 사용량 |
힙/논힙 |
jvm_gc_pause_seconds |
GC 일시정지 시간 |
GC 모니터링 |
hikaricp_connections_active |
활성 DB 커넥션 |
Pool 모니터링 |
application_* |
커스텀 애플리케이션 메트릭 |
개발자 정의 |
business_* |
커스텀 비즈니스 메트릭 |
개발자 정의 |
3.3) PromQL 쿼리 예시
# 5분간 요청 rate
rate(http_server_requests_seconds_count[5m])
# 5xx 에러 rate
rate(http_server_requests_seconds_count{status=~"5.."}[5m])
# P99 latency
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri)
)
# 에러율 (%)
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count[5m])) * 100
# 서비스별 필터링
http_server_requests_seconds_count{application="spring-standards-api", environment="prod"}
4) OTLP Receiver (Traces/Metrics Push)
4.1) ADOT Agent 설정
ECS Task Definition:
{
"containerDefinitions": [
{
"name": "app",
"image": "${ECR_REPOSITORY}:${IMAGE_TAG}",
"portMappings": [
{ "containerPort": 8080, "protocol": "tcp" }
],
"environment": [
{
"name": "JAVA_TOOL_OPTIONS",
"value": "-javaagent:/opt/aws-opentelemetry-agent.jar"
},
{
"name": "OTEL_EXPORTER_OTLP_ENDPOINT",
"value": "http://localhost:4317"
},
{
"name": "OTEL_SERVICE_NAME",
"value": "${SERVICE_NAME}"
},
{
"name": "OTEL_RESOURCE_ATTRIBUTES",
"value": "service.namespace=${NAMESPACE},deployment.environment=${ENVIRONMENT}"
},
{
"name": "OTEL_TRACES_SAMPLER",
"value": "parentbased_traceidratio"
},
{
"name": "OTEL_TRACES_SAMPLER_ARG",
"value": "0.1"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/${SERVICE_NAME}",
"awslogs-region": "${AWS_REGION}",
"awslogs-stream-prefix": "ecs"
}
}
},
{
"name": "adot-collector",
"image": "public.ecr.aws/aws-observability/aws-otel-collector:latest",
"essential": true,
"command": ["--config=/etc/ecs/ecs-default-config.yaml"],
"portMappings": [
{ "containerPort": 4317, "protocol": "tcp" },
{ "containerPort": 4318, "protocol": "tcp" }
]
}
]
}
4.2) 자동 계측 항목 (Traces)
| Span Attribute |
설명 |
예시 |
http.method |
HTTP 메서드 |
GET, POST |
http.status_code |
응답 코드 |
200, 404, 500 |
http.route |
요청 경로 |
/api/v1/orders/{id} |
http.url |
전체 URL |
https://api.example.com/… |
db.system |
DB 시스템 |
mysql, postgresql |
db.name |
데이터베이스명 |
spring_standards |
db.statement |
SQL 쿼리 |
SELECT * FROM … |
rpc.service |
gRPC 서비스 |
OrderService |
rpc.method |
gRPC 메서드 |
CreateOrder |
4.3) Sampling 전략
| 전략 |
설명 |
권장 환경 |
always_on |
모든 트레이스 수집 |
개발, 디버깅 |
always_off |
트레이스 비활성화 |
부하 테스트 |
traceidratio |
비율 기반 샘플링 |
프로덕션 |
parentbased_traceidratio |
부모 기반 + 비율 |
프로덕션 권장 |
# 10% 샘플링 (프로덕션 권장)
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1
5) ECS Container Metrics
5.1) ADOT Config
# adot-config.yaml
receivers:
awsecscontainermetrics:
collection_interval: 30s
5.2) 수집 메트릭
| 메트릭 |
설명 |
용도 |
ecs.task.cpu.utilized |
Task CPU 사용량 |
Auto Scaling |
ecs.task.cpu.reserved |
Task CPU 예약량 |
용량 계획 |
ecs.task.memory.utilized |
Task 메모리 사용량 |
OOM 감지 |
ecs.task.memory.reserved |
Task 메모리 예약량 |
용량 계획 |
ecs.container.network.rx_bytes |
네트워크 수신 |
트래픽 모니터링 |
ecs.container.network.tx_bytes |
네트워크 송신 |
트래픽 모니터링 |
ecs.container.storage.read_bytes |
스토리지 읽기 |
I/O 모니터링 |
ecs.container.storage.write_bytes |
스토리지 쓰기 |
I/O 모니터링 |
6) 커스텀 Metrics 구현
6.1) Business Metrics
@Component
public class BusinessMetrics {
private final MeterRegistry meterRegistry;
public BusinessMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
}
// Counter: 주문 건수
public void incrementOrderCount(String status) {
Counter.builder("business.orders.total")
.description("Total number of orders")
.tag("status", status) // success, failed, cancelled
.register(meterRegistry)
.increment();
}
// Timer: 결제 처리 시간
public void recordPaymentDuration(Duration duration, String method) {
Timer.builder("business.payment.duration")
.description("Payment processing duration")
.tag("method", method) // card, bank_transfer, virtual_account
.register(meterRegistry)
.record(duration);
}
// Gauge: 장바구니 아이템 수
public void recordCartItems(String userId, int count) {
Gauge.builder("business.cart.items", () -> count)
.description("Number of items in cart")
.tag("userId", userId)
.register(meterRegistry);
}
}
6.2) Downstream Metrics
@Component
public class DownstreamMetrics {
private final MeterRegistry meterRegistry;
public DownstreamMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
}
// Timer: Redis 응답 시간
public Timer.Sample startRedisTimer() {
return Timer.start(meterRegistry);
}
public void stopRedisTimer(Timer.Sample sample, String operation) {
sample.stop(Timer.builder("downstream.redis.latency")
.description("Redis operation latency")
.tag("operation", operation) // GET, SET, DEL
.register(meterRegistry));
}
// Timer: 외부 API 응답 시간
public void recordExternalApiLatency(Duration duration, String service, String endpoint) {
Timer.builder("downstream.external_api.latency")
.description("External API call latency")
.tag("service", service) // payment-gateway, notification-service
.tag("endpoint", endpoint)
.register(meterRegistry)
.record(duration);
}
}
6.3) 권장 Metrics 네이밍
| 분류 |
Prefix |
예시 |
| 비즈니스 |
business.* |
business.orders.total, business.payment.duration |
| Downstream |
downstream.* |
downstream.redis.latency, downstream.db.latency |
| 스케줄러 |
scheduler.* |
scheduler.job.runs.total, scheduler.job.duration |
| 캐시 |
cache.* |
cache.hit.ratio, cache.size |
| 큐 |
queue.* |
queue.messages.count, queue.processing.duration |
7) AMP (Amazon Managed Prometheus)
# AMP Workspace
resource "aws_prometheus_workspace" "main" {
alias = "${var.project_name}-${var.environment}"
tags = {
Environment = var.environment
Project = var.project_name
}
}
# ADOT Collector가 AMP에 쓰기 위한 IAM Role
resource "aws_iam_role" "adot_collector" {
name = "${var.project_name}-adot-collector"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ecs-tasks.amazonaws.com"
}
}
]
})
}
resource "aws_iam_role_policy" "adot_amp_write" {
name = "amp-remote-write"
role = aws_iam_role.adot_collector.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"aps:RemoteWrite"
]
Resource = aws_prometheus_workspace.main.arn
}
]
})
}
7.2) ADOT Exporter 설정
# adot-config.yaml
exporters:
prometheusremotewrite:
endpoint: ${AMP_ENDPOINT}/api/v1/remote_write
auth:
authenticator: sigv4auth
awsxray:
region: ${AWS_REGION}
8) X-Ray 설정
8.1) IAM 권한
# ADOT Collector X-Ray 권한
resource "aws_iam_role_policy" "adot_xray" {
name = "xray-write"
role = aws_iam_role.adot_collector.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"xray:PutTraceSegments",
"xray:PutTelemetryRecords",
"xray:GetSamplingRules",
"xray:GetSamplingTargets"
]
Resource = "*"
}
]
})
}
8.2) X-Ray Sampling Rules
{
"version": 2,
"rules": [
{
"description": "Health check - no sampling",
"host": "*",
"http_method": "GET",
"url_path": "/actuator/health",
"fixed_target": 0,
"rate": 0
},
{
"description": "Default sampling",
"host": "*",
"http_method": "*",
"url_path": "*",
"fixed_target": 1,
"rate": 0.1
}
],
"default": {
"fixed_target": 1,
"rate": 0.05
}
}
9) Alerting Rules (AMP)
9.1) 애플리케이션 알람
# alerting-rules.yml (인프라 프로젝트)
groups:
- name: application-alerts
rules:
# 5xx 에러율 > 5%
- alert: HighErrorRate
expr: |
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High 5xx error rate (> 5%)"
description: "Service has high error rate"
# P99 latency > 5초
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket[5m])) by (le, application)
) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency > 5s"
# JVM 힙 사용률 > 90%
- alert: HighHeapUsage
expr: |
jvm_memory_used_bytes{area="heap"}
/ jvm_memory_max_bytes{area="heap"} > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "JVM heap usage > 90%"
# DB Connection Pool 고갈 임박
- alert: DBConnectionPoolExhausted
expr: |
hikaricp_connections_active
/ hikaricp_connections_max > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "DB connection pool > 90% utilized"
9.2) ECS 인프라 알람
- name: infrastructure-alerts
rules:
# ECS Task CPU > 80%
- alert: HighCPUUsage
expr: |
ecs_task_cpu_utilized / ecs_task_cpu_reserved > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "ECS Task CPU > 80%"
# ECS Task Memory > 80%
- alert: HighMemoryUsage
expr: |
ecs_task_memory_utilized / ecs_task_memory_reserved > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "ECS Task Memory > 80%"
10) 데이터 흐름 요약
| Source |
Receiver |
Destination |
용도 |
/actuator/prometheus |
prometheus |
AMP |
애플리케이션 메트릭 |
| ADOT Agent (JVM) |
otlp |
X-Ray |
분산 추적 |
| ADOT Agent (JVM) |
otlp |
AMP |
자동 계측 메트릭 |
| ECS Task Metadata |
awsecscontainermetrics |
AMP |
컨테이너 리소스 메트릭 |
| stdout (JSON) |
awslogs |
CloudWatch Logs |
로그 |
11) 체크리스트
Spring Boot 설정
ECS Task Definition
IAM 권한
12) 관련 문서
작성자: Development Team
최종 수정일: 2025-12-05
버전: 1.0.0