Prometheus指标监控
Prometheus是一个开放性的监控解决方案,用户可以非常方便的安装和使用Prometheus并且能够非常方便的对其进行扩展
我们需要什么指标
对于DDD、TDD等,大家比较熟悉了,但是对于MDD可能就比较陌生了。MDD是Metrics-Driven Development的缩写,主张开发过程由指标驱动,通过实用指标来驱动快速、精确和细粒度的软件迭代。MDD可使所有可以测量的东西都得到量化和优化,进而为整个开发过程带来可见性,帮助相关人员快速、准确地作出决策,并在发生错误时立即发现问题并修复。依照MDD的理念,在需求阶段就应该考虑关键指标,在应用上线后通过指标了解现状并持续优化。
有一些基于指标的方法论,建议大家了解一下:
Google的四大黄金指标:延迟Latency、流量Traffic、错误Errors、饱和度Saturation Netflix的USE方法:使用率Utilization、饱和度Saturation、错误Error WeaveCloud的RED方法:速率Rate、错误Errors、耗时Duration
在SrpingBoot中引入prometheus
springboot 引入 prometheus相关jar包
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-core</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
在application.yaml中将prometheus的endpoint放出来。
management:
endpoints:
web:
exposure:
include: info,health,prometheus
application.prometheus文件配置
spring.application.name=actuator-prometheus
server.port=10001
# 管理端点的跟路径,默认就是/actuator
management.endpoints.web.base-path=/actuator
# 管理端点的端口
management.server.port=10002
# 暴露出 prometheus 端口
management.endpoints.web.exposure.include=prometheus
# 启用 prometheus 端口,默认就是true
management.metrics.export.prometheus.enabled=true
# 增加每个指标的全局的tag,及给每个指标一个 application的 tag,值是 spring.application.name的值
management.metrics.tags.application=${spring.application.name}
指标项的添加
添加指标API
Metrics.counter("", tags).increment();
SpringBoot2.0的metrics支持多tag
web通过可以http://localhost:8080/actuator/prometheus访问
# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
# TYPE jvm_threads_live_threads gauge
jvm_threads_live_threads 21.0
# HELP process_uptime_seconds The uptime of the Java virtual machine
# TYPE process_uptime_seconds gauge
process_uptime_seconds 1947.48
# HELP process_start_time_seconds Start time of the process since unix epoch.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.673243010006E9
# HELP executor_queue_remaining_tasks The number of additional elements that this queue can ideally accept without blocking
# TYPE executor_queue_remaining_tasks gauge
executor_queue_remaining_tasks{name="applicationTaskExecutor",} 2.147483647E9
# HELP tomcat_sessions_rejected_sessions_total
# TYPE tomcat_sessions_rejected_sessions_total counter
tomcat_sessions_rejected_sessions_total 0.0
......
指标项详解
进程
- process_uptime_seconds 表示进程运行时间
# HELP process_uptime_seconds The uptime of the Java virtual machine # TYPE process_uptime_seconds gauge process_uptime_seconds 35.901
- process_start_time_seconds 表示进程启动时刻
# HELP process_start_time_seconds Start time of the process since unix epoch. # TYPE process_start_time_seconds gauge process_start_time_seconds 1.673313858767E9
- process_files_max_files 最大文件数
- process_files_open_files 打开文件数
- process_cpu_usage cpu使用率
# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process # TYPE process_cpu_usage gauge process_cpu_usage 0.009232165838729317
系统
- system_cpu_count cpu个数
# HELP system_cpu_count The number of processors available to the Java virtual machine # TYPE system_cpu_count gauge system_cpu_count 16.0
- system_cpu_usage cpu使用情况
# HELP system_cpu_usage The "recent cpu usage" of the system the application is running in # TYPE system_cpu_usage gauge system_cpu_usage 0.09081119051122843
- system_load_average_1m 系统平均负载
http请求
- http_server_requests_seconds 每秒http请求数
- http_server_requests_seconds_max http请求数峰值
```text
HELP http_server_requests_seconds Duration of HTTP server request handling
TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{exception=”None”,method=”GET”,outcome=”SUCCESS”,status=”200”,uri=”/actuator/prometheus”,} 14.0 http_server_requests_seconds_sum{exception=”None”,method=”GET”,outcome=”SUCCESS”,status=”200”,uri=”/actuator/prometheus”,} 1.6270625
通过上面的数据可以发现一共请求14(http_server_requests_seconds_count)次,总时间为(http_server_requests_seconds_sum)
HELP http_server_requests_seconds_max Duration of HTTP server request handling
TYPE http_server_requests_seconds_max gauge
http_server_requests_seconds_max{exception=”None”,method=”GET”,outcome=”SUCCESS”,status=”200”,uri=”/actuator/prometheus”,} 0.0 最大请求时间为(http_server_requests_seconds_max)
**qps统计**
```text
sum(rate(http_server_requests_seconds_count{application="prometheus-example"}[10s]))
sum(rate(http_server_requests_seconds_count{instance="$instance", application="$application", uri!~".*actuator.*"}[5m]))
rate: 用于统计增长趋势,要求上报的Metric为Counter类型(只增不减) irate: 与rate相似,区别在于rate统计的是一段时间内的平均增长速率,无法反应这个时间窗口内的突发情况(即瞬时高峰),irate通过区间向量中最后两个样本数据来计算增长速率,但是当选用的区间范围较大时,可能造成不小的偏差 sum: 求和,适用于统计场景 耗时统计 除了qps,另外一个经常关注的指标就是rt了,如上面接口的平均rt,通过两个Metric的组合来实现
sum(rate(http_server_requests_seconds_sum{application="prometheus-example"}[10s])) / sum(rate(http_server_requests_seconds_count{application="prometheus-example"}[10s]))
JVM监控
缓冲区
- jvm_buffer_count_buffers 计数缓冲
- jvm_buffer_memory_used_bytes 缓冲内存使用大小
- jvm_buffer_total_capacity_bytes 缓冲容量大小
```text
HELP jvm_buffer_total_capacity_bytes An estimate of the total capacity of the buffers in this pool
TYPE jvm_buffer_total_capacity_bytes gauge
jvm_buffer_total_capacity_bytes{id=”direct”,} 82632.0 jvm_buffer_total_capacity_bytes{id=”mapped”,} 0.0
HELP jvm_buffer_memory_used_bytes An estimate of the memory that the Java virtual machine is using for this buffer pool
TYPE jvm_buffer_memory_used_bytes gauge
jvm_buffer_memory_used_bytes{id=”direct”,} 82632.0 jvm_buffer_memory_used_bytes{id=”mapped”,} 0.0
HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
TYPE jvm_buffer_count_buffers gauge
jvm_buffer_count_buffers{id=”direct”,} 11.0 jvm_buffer_count_buffers{id=”mapped”,} 0.0
类信息
- jvm_classes_loaded_classes 已加载类个数
- jvm_classes_unloaded_classes_total 已卸载类总数
```text
# HELP jvm_classes_loaded_classes The number of classes that are currently loaded in the Java virtual machine
# TYPE jvm_classes_loaded_classes gauge
jvm_classes_loaded_classes 8976.0
# HELP jvm_classes_unloaded_classes_total The total number of classes unloaded since the Java virtual machine has started execution
# TYPE jvm_classes_unloaded_classes_total counter
jvm_classes_unloaded_classes_total 0.0
gc信息
- jvm_gc_live_data_size_bytes gc存活数据大小
- jvm_gc_max_data_size_bytes gc最大数据大小
- jvm_gc_memory_allocated_bytes_total gc分配的内存大小
- jvm_gc_memory_promoted_bytes_total gc晋升到下一代的内存大小
- jvm_gc_pause_seconds gc等待的时间
- jvm_gc_pause_seconds_max gc等待的最大时间
```text
HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next
TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total 2.41696768E8
HELP jvm_gc_memory_promoted_bytes_total Count of positive increases in the size of the old generation memory pool before GC to after GC
TYPE jvm_gc_memory_promoted_bytes_total counter
jvm_gc_memory_promoted_bytes_total 1.229668E7
HELP jvm_gc_live_data_size_bytes Size of long-lived heap memory pool after reclamation
TYPE jvm_gc_live_data_size_bytes gauge
jvm_gc_live_data_size_bytes 1.7648168E7
HELP jvm_gc_max_data_size_bytes Max size of long-lived heap memory pool
TYPE jvm_gc_max_data_size_bytes gauge
jvm_gc_max_data_size_bytes 2.751463424E9
HELP jvm_gc_overhead_percent An approximation of the percent of CPU time used by GC activities over the last lookback period or since monitoring began, whichever is shorter, in the range [0..1]
TYPE jvm_gc_overhead_percent gauge
jvm_gc_overhead_percent 0.0
HELP jvm_gc_pause_seconds Time spent in GC pause
TYPE jvm_gc_pause_seconds summary
jvm_gc_pause_seconds_count{action=”end of major GC”,cause=”Metadata GC Threshold”,} 1.0 jvm_gc_pause_seconds_sum{action=”end of major GC”,cause=”Metadata GC Threshold”,} 0.057 jvm_gc_pause_seconds_count{action=”end of minor GC”,cause=”Metadata GC Threshold”,} 1.0 jvm_gc_pause_seconds_sum{action=”end of minor GC”,cause=”Metadata GC Threshold”,} 0.009 jvm_gc_pause_seconds_count{action=”end of minor GC”,cause=”Allocation Failure”,} 1.0 jvm_gc_pause_seconds_sum{action=”end of minor GC”,cause=”Allocation Failure”,} 0.011
HELP jvm_gc_pause_seconds_max Time spent in GC pause
TYPE jvm_gc_pause_seconds_max gauge
jvm_gc_pause_seconds_max{action=”end of major GC”,cause=”Metadata GC Threshold”,} 0.0 jvm_gc_pause_seconds_max{action=”end of minor GC”,cause=”Metadata GC Threshold”,} 0.0 jvm_gc_pause_seconds_max{action=”end of minor GC”,cause=”Allocation Failure”,} 0.0
内存信息
- 已提交内存 jvm_memory_committed_bytes
- 最大内存 jvm_memory_max_bytes
- 已使用内存 jvm_memory_used_bytes
```text
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap",id="PS Survivor Space",} 1.3478848E7
jvm_memory_used_bytes{area="heap",id="PS Old Gen",} 1.765636E7
jvm_memory_used_bytes{area="heap",id="PS Eden Space",} 4.5597968E7
jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 4.7203528E7
jvm_memory_used_bytes{area="nonheap",id="Code Cache",} 1.7431232E7
jvm_memory_used_bytes{area="nonheap",id="Compressed Class Space",} 6161240.0
# HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the Java virtual machine to use
# TYPE jvm_memory_committed_bytes gauge
jvm_memory_committed_bytes{area="heap",id="PS Survivor Space",} 1.4680064E7
jvm_memory_committed_bytes{area="heap",id="PS Old Gen",} 1.31596288E8
jvm_memory_committed_bytes{area="heap",id="PS Eden Space",} 1.95035136E8
jvm_memory_committed_bytes{area="nonheap",id="Metaspace",} 5.0552832E7
jvm_memory_committed_bytes{area="nonheap",id="Code Cache",} 1.8284544E7
jvm_memory_committed_bytes{area="nonheap",id="Compressed Class Space",} 6774784.0
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
# TYPE jvm_memory_max_bytes gauge
jvm_memory_max_bytes{area="heap",id="PS Survivor Space",} 1.4680064E7
jvm_memory_max_bytes{area="heap",id="PS Old Gen",} 2.751463424E9
jvm_memory_max_bytes{area="heap",id="PS Eden Space",} 1.343225856E9
jvm_memory_max_bytes{area="nonheap",id="Metaspace",} -1.0
jvm_memory_max_bytes{area="nonheap",id="Code Cache",} 2.5165824E8
jvm_memory_max_bytes{area="nonheap",id="Compressed Class Space",} 1.073741824E9
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap",id="PS Survivor Space",} 1.3478848E7
jvm_memory_used_bytes{area="heap",id="PS Old Gen",} 1.765636E7
jvm_memory_used_bytes{area="heap",id="PS Eden Space",} 4.5597968E7
jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 4.7203528E7
jvm_memory_used_bytes{area="nonheap",id="Code Cache",} 1.7431232E7
jvm_memory_used_bytes{area="nonheap",id="Compressed Class Space",} 6161240.0
# HELP jvm_memory_usage_after_gc_percent The percentage of long-lived heap pool used after the last GC event, in the range [0..1]
# TYPE jvm_memory_usage_after_gc_percent gauge
jvm_memory_usage_after_gc_percent{area="heap",pool="long-lived",} 0.006417079669673268
线程信息
- 守护线程 jvm_threads_daemon_threads
- 存活线程 jvm_threads_live_threads
- 线程峰值 jvm_threads_peak_threads
- 不同状态的线程 jvm_threads_states_threads
```text
HELP jvm_threads_states_threads The current number of threads
TYPE jvm_threads_states_threads gauge
jvm_threads_states_threads{state=”runnable”,} 6.0 jvm_threads_states_threads{state=”blocked”,} 0.0 jvm_threads_states_threads{state=”waiting”,} 12.0 jvm_threads_states_threads{state=”timed-waiting”,} 3.0 jvm_threads_states_threads{state=”new”,} 0.0 jvm_threads_states_threads{state=”terminated”,} 0.0
HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
TYPE jvm_threads_live_threads gauge
jvm_threads_live_threads 21.0
HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
TYPE jvm_threads_peak_threads gauge
jvm_threads_peak_threads 21.0
HELP jvm_threads_daemon_threads The current number of live daemon threads
TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads 17.0
### 日志
- 打印日志个数 logback_events_total
```text
# HELP logback_events_total Number of events that made it to the logs
# TYPE logback_events_total counter
logback_events_total{level="warn",} 0.0
logback_events_total{level="debug",} 0.0
logback_events_total{level="error",} 0.0
logback_events_total{level="trace",} 0.0
logback_events_total{level="info",} 6.0
rabbitmq
- 已发布消息数 rabbitmq_published_total
- 已消费消息数 rabbitmq_consumed_total
- 已拒绝消息数 rabbitmq_rejected_total
- 已确认消息数 rabbitmq_acknowledged_total
- 通道数 rabbitmq_channels
- 连接数 rabbitmq_connections
integration
- 通道数 spring_integration_channels
- 处理器数 spring_integration_handlers
- 发送消息数 spring_integration_send_seconds
- 单位时间发送消息最大值 spring_integration_send_seconds_max
tomcat信息
全局信息
- 总体报错数 tomcat_global_error_total
- 接收的字节总数 tomcat_global_received_bytes_total
- 发出的字节总数 tomcat_global_sent_bytes_total
- 每秒最大请求数 tomcat_global_request_max_seconds
- 每秒请求数 tomcat_global_request_seconds 会话信息
- 目前活跃会话数 tomcat_sessions_active_current_sessions
- 活跃最大会话数 tomcat_sessions_active_max_sessions
- 会话活跃的最长时间 tomcat_sessions_alive_max_seconds
- 累计创建的会话数 tomcat_sessions_created_sessions_total
- 累计失效的会话数 tomcat_sessions_expired_sessions_total
- 累计拒绝的会话数 tomcat_sessions_rejected_sessions_total 线程信息
- 繁忙的线程数 tomcat_threads_busy_threads
- 配置的最大线程数 tomcat_threads_config_max_threads
- 当前线程数 tomcat_threads_current_threads