telegraf和nvidia_gpu_exporter,英伟达显卡监控数据采集
telegraf通过nvidia -smi二进制文件,查询本机的显卡设备(/dev/nvidia*)
nvidia_gpu_exporter
1. 安装
VERSION=1.3.1 wget https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v${VERSION}/nvidia_gpu_exporter_${VERSION}_linux_x86_64.tar.gz tar -xvzf nvidia_gpu_exporter_${VERSION}_linux_x86_64.tar.gz mv nvidia_gpu_exporter /usr/bin nvidia_gpu_exporter --help
2. 运行
nvidia_gpu_exporter --web.listen-address=:9835 --web.telemetry-path=/metrics --nvidia-smi-command=nvidia-smi --log.level=info --query-field-names=AUTO --log.format=logfmt
3. 参数参考
usage: nvidia_gpu_exporter [<flags>]
Flags:
-h, --help Show context-sensitive help (also try --help-long and --help-man).
--web.config.file="" [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication.
--web.listen-address=":9835"
Address to listen on for web interface and telemetry.
--web.telemetry-path="/metrics"
Path under which to expose metrics.
--nvidia-smi-command="nvidia-smi"
Path or command to be used for the nvidia-smi executable
--query-field-names="AUTO"
Comma-separated list of the query fields. You can find out possible fields by running `nvidia-smi --help-query-gpus`. The value `AUTO` will
automatically detect the fields to query.
--log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error]
--log.format=logfmt Output format of log messages. One of: [logfmt, json]
--version Show application version.
4. 指标参考
telegraf英伟达插件官方文档
1. 流程概述:
telegraf通过nvidia -smi二进制文件,查询本机的显卡设备(/dev/nvidia*)
2. 安装telegraf:参考官方指南
3. 配置调整:
3.1 确认配置文件位置
可以根据安装采集器时的启动命令查找
- 二进制安装:ps -ef | grep telegraf 或者 systemctl cat telegraf
- k8s部署直接查看POD的yaml
参考:
/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
修改 /etc/telegraf/telegraf.conf 文件,或在/etc/telegraf/telegraf.d目录下新增配置文件telegraf-nvidia.conf;
3.2 配置文件新增nvidia_smi配置:
[[inputs.nvidia_smi]]
3.3 nvidia_smi其他配置支持:
[[inputs.nvidia_smi]]
# bin_path = "/usr/bin/nvidia-smi" nvidia-smi路径,若路径有修改可修改此配置 # timeout = "5s" 轮训超时时间
3.4 telegraf数据处理:
telegraf的outputs插件支持主动上报数据到数据库、支持常用的prometheus、influxDB和victoriametrics等数据库、同时支持http、SQL等方式直接上传到RDB或httpserver;
主动上报数据需要做好数据格式化,将采集数据格式化为数据库需要的格式
telegraf主动上报prometheus配置参考:
[[outputs.http]] url = "http://192.168.122.1:33090/api/v1/write" method = "POST" data_format = "prometheusremotewrite" # username = "username" 根据配置修改为实际的账号密码 # password = "xxxxxxx" [outputs.http.headers] Content-Type = "application/x-protobuf" Content-Encoding = "snappy" X-Prometheus-Remote-Write-Version = "0.1.0"
telegraf暴露端口被动等prometheus采集配置参考:
[[outputs.prometheus_client]] listen = ":9273" ## Use HTTP Basic Authentication. # basic_username = "Foo" # basic_password = "Bar" ## If set, the IP Ranges which are allowed to access metrics. ## ex: ip_range = ["192.168.0.0/24", "192.168.1.0/30"] # ip_range = [] ## Path to publish the metrics on. # path = "/metrics"
4. 常见问题排查:
-
确认nvidia-smi的路径已经是否可以正常返回
-
telegraf默认在telegraf用户下运行,确认配置文件是否为root权限
5. Metrics指标参考
注:
下面列举主要来源于官网,经对比与实际采集指标存在差异;
简要核查主要是两方面:
- 部分指标string类型prometheus不支持存储;
- github的master分支文档未更新,但切换至1.35版本tag后,也存在文档内指标不一致问题
- measurement:
nvidia_smi(前缀)
- tags
name
(type of GPU e.g.GeForce GTX 1070 Ti
)compute_mode
(The compute mode of the GPU e.g.Default
)index
(The port index where the GPU is connected to the motherboard e.g.1
)pstate
(Overclocking state for the GPU e.g.P0
)uuid
(A unique identifier for the GPU e.g.GPU-f9ba66fc-a7f5-94c5-da19-019ef2f9c665
)
- fields 使用时注意加前缀
nvidia_smi
:curl http://127.0.0.1:9090/api/v1/query?query=nvidia_smi_fan_speedfan_speed
(integer, percentage)fbc_stats_session_count
(integer)fbc_stats_average_fps
(integer)fbc_stats_average_latency
(integer)memory_free
(integer, MiB)memory_used
(integer, MiB)memory_total
(integer, MiB)memory_reserved
(integer, MiB)retired_pages_multiple_single_bit
(integer)retired_pages_double_bit
(integer)retired_pages_blacklist
(string)retired_pages_pending
(string)remapped_rows_correctable
(int)remapped_rows_uncorrectable
(int)remapped_rows_pending
(string)remapped_rows_failure
(string)power_draw
(float, W)temperature_gpu
(integer, degrees C)utilization_gpu
(integer, percentage)utilization_memory
(integer, percentage)utilization_encoder
(integer, percentage)utilization_decoder
(integer, percentage)pcie_link_gen_current
(integer)pcie_link_width_current
(integer)encoder_stats_session_count
(integer)encoder_stats_average_fps
(integer)encoder_stats_average_latency
(integer)clocks_current_graphics
(integer, MHz)clocks_current_sm
(integer, MHz)clocks_current_memory
(integer, MHz)clocks_current_video
(integer, MHz)driver_version
(string)cuda_version
(string)
- tags
实际测试采集到的指标
- nvidia_smi_clocks_current_graphics
- nvidia_smi_clocks_current_memory
- nvidia_smi_clocks_current_sm
- nvidia_smi_clocks_current_video
- nvidia_smi_encoder_stats_average_fps
- nvidia_smi_encoder_stats_average_latency
- nvidia_smi_encoder_stats_session_count
- nvidia_smi_fan_speed
- nvidia_smi_fbc_stats_average_fps
- nvidia_smi_fbc_stats_average_latency
- nvidia_smi_fbc_stats_session_count
- nvidia_smi_memory_free
- nvidia_smi_memory_reserved
- nvidia_smi_memory_total
- nvidia_smi_memory_used
- nvidia_smi_pcie_link_gen_current
- nvidia_smi_pcie_link_width_current
- nvidia_smi_power_draw
- nvidia_smi_remapped_rows_correctable
- nvidia_smi_remapped_rows_uncorrectable
- nvidia_smi_temperature_gpu
- nvidia_smi_utilization_decoder
- nvidia_smi_utilization_encoder
- nvidia_smi_utilization_gpu
- nvidia_smi_utilization_jpeg
- nvidia_smi_utilization_memory
- nvidia_smi_utilization_ofa

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。
更多推荐
所有评论(0)