telegraf和nvidia_gpu_exporter，英伟达显卡监控数据采集

telegraf通过nvidia -smi二进制文件，查询本机的显卡设备（/dev/nvidia*）

宋明河

464人浏览 · 2025-06-20 11:41:47

宋明河 · 2025-06-20 11:41:47 发布

nvidia_gpu_exporter

1. 安装

VERSION=1.3.1
wget https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v${VERSION}/nvidia_gpu_exporter_${VERSION}_linux_x86_64.tar.gz
tar -xvzf nvidia_gpu_exporter_${VERSION}_linux_x86_64.tar.gz
mv nvidia_gpu_exporter /usr/bin
nvidia_gpu_exporter --help

2. 运行

nvidia_gpu_exporter --web.listen-address=:9835 --web.telemetry-path=/metrics --nvidia-smi-command=nvidia-smi --log.level=info --query-field-names=AUTO --log.format=logfmt

Flags:
-h, --help Show context-sensitive help (also try --help-long and --help-man).
--web.config.file="" [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication.
--web.listen-address=":9835"
Address to listen on for web interface and telemetry.
--web.telemetry-path="/metrics"
Path under which to expose metrics.
--nvidia-smi-command="nvidia-smi"
Path or command to be used for the nvidia-smi executable
--query-field-names="AUTO"
Comma-separated list of the query fields. You can find out possible fields by running `nvidia-smi --help-query-gpus`. The value `AUTO` will
automatically detect the fields to query.
--log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error]
--log.format=logfmt Output format of log messages. One of: [logfmt, json]
--version Show application version.

4. 指标参考

telegraf英伟达插件官方文档

1. 流程概述：

telegraf通过nvidia -smi二进制文件，查询本机的显卡设备（/dev/nvidia*）

2. 安装telegraf：参考官方指南

3. 配置调整：

3.1 确认配置文件位置

可以根据安装采集器时的启动命令查找

二进制安装：ps -ef | grep telegraf 或者 systemctl cat telegraf
k8s部署直接查看POD的yaml

参考：

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

修改 /etc/telegraf/telegraf.conf 文件，或在/etc/telegraf/telegraf.d目录下新增配置文件telegraf-nvidia.conf；

3.2 配置文件新增nvidia_smi配置：

[[inputs.nvidia_smi]]

3.3 nvidia_smi其他配置支持：

[[inputs.nvidia_smi]]

# bin_path = "/usr/bin/nvidia-smi"  nvidia-smi路径，若路径有修改可修改此配置
# timeout = "5s" 轮训超时时间

3.4 telegraf数据处理：

telegraf的outputs插件支持主动上报数据到数据库、支持常用的prometheus、influxDB和victoriametrics等数据库、同时支持http、SQL等方式直接上传到RDB或httpserver；

主动上报数据需要做好数据格式化，将采集数据格式化为数据库需要的格式

支持传统promtheus采集端口暴露

telegraf主动上报prometheus配置参考：

[[outputs.http]]
  url = "http://192.168.122.1:33090/api/v1/write"
  method = "POST"
  data_format = "prometheusremotewrite"
  # username = "username" 根据配置修改为实际的账号密码
  # password = "xxxxxxx"
[outputs.http.headers]
 Content-Type = "application/x-protobuf"
 Content-Encoding = "snappy"
 X-Prometheus-Remote-Write-Version = "0.1.0"

telegraf暴露端口被动等prometheus采集配置参考：

[[outputs.prometheus_client]]
  listen = ":9273"
  ## Use HTTP Basic Authentication.
  # basic_username = "Foo"
  # basic_password = "Bar"
  ## If set, the IP Ranges which are allowed to access metrics.
  ##   ex: ip_range = ["192.168.0.0/24", "192.168.1.0/30"]
  # ip_range = []
  ## Path to publish the metrics on.
  # path = "/metrics"

4. 常见问题排查：

确认nvidia-smi的路径已经是否可以正常返回
telegraf默认在telegraf用户下运行，确认配置文件是否为root权限

5. Metrics指标参考

注：

下面列举主要来源于官网，经对比与实际采集指标存在差异；

简要核查主要是两方面：

部分指标string类型prometheus不支持存储；
github的master分支文档未更新，但切换至1.35版本tag后，也存在文档内指标不一致问题

measurement: nvidia_smi(前缀)
- tags
  - name (type of GPU e.g. GeForce GTX 1070 Ti)
  - compute_mode (The compute mode of the GPU e.g. Default)
  - index (The port index where the GPU is connected to the motherboard e.g. 1)
  - pstate (Overclocking state for the GPU e.g. P0)
  - uuid (A unique identifier for the GPU e.g. GPU-f9ba66fc-a7f5-94c5-da19-019ef2f9c665)
- fields 使用时注意加前缀nvidia_smi：curl http://127.0.0.1:9090/api/v1/query?query=nvidia_smi_fan_speed
  - fan_speed (integer, percentage)
  - fbc_stats_session_count (integer)
  - fbc_stats_average_fps (integer)
  - fbc_stats_average_latency (integer)
  - memory_free (integer, MiB)
  - memory_used (integer, MiB)
  - memory_total (integer, MiB)
  - memory_reserved (integer, MiB)
  - retired_pages_multiple_single_bit (integer)
  - retired_pages_double_bit (integer)
  - retired_pages_blacklist (string)
  - retired_pages_pending (string)
  - remapped_rows_correctable (int)
  - remapped_rows_uncorrectable (int)
  - remapped_rows_pending (string)
  - remapped_rows_failure (string)
  - power_draw (float, W)
  - temperature_gpu (integer, degrees C)
  - utilization_gpu (integer, percentage)
  - utilization_memory (integer, percentage)
  - utilization_encoder (integer, percentage)
  - utilization_decoder (integer, percentage)
  - pcie_link_gen_current (integer)
  - pcie_link_width_current (integer)
  - encoder_stats_session_count (integer)
  - encoder_stats_average_fps (integer)
  - encoder_stats_average_latency (integer)
  - clocks_current_graphics (integer, MHz)
  - clocks_current_sm (integer, MHz)
  - clocks_current_memory (integer, MHz)
  - clocks_current_video (integer, MHz)
  - driver_version (string)
  - cuda_version (string)

实际测试采集到的指标

- nvidia_smi_clocks_current_graphics
- nvidia_smi_clocks_current_memory
- nvidia_smi_clocks_current_sm
- nvidia_smi_clocks_current_video
- nvidia_smi_encoder_stats_average_fps
- nvidia_smi_encoder_stats_average_latency
- nvidia_smi_encoder_stats_session_count
- nvidia_smi_fan_speed
- nvidia_smi_fbc_stats_average_fps
- nvidia_smi_fbc_stats_average_latency
- nvidia_smi_fbc_stats_session_count
- nvidia_smi_memory_free
- nvidia_smi_memory_reserved
- nvidia_smi_memory_total
- nvidia_smi_memory_used
- nvidia_smi_pcie_link_gen_current
- nvidia_smi_pcie_link_width_current
- nvidia_smi_power_draw
- nvidia_smi_remapped_rows_correctable
- nvidia_smi_remapped_rows_uncorrectable
- nvidia_smi_temperature_gpu
- nvidia_smi_utilization_decoder
- nvidia_smi_utilization_encoder
- nvidia_smi_utilization_gpu
- nvidia_smi_utilization_jpeg
- nvidia_smi_utilization_memory
- nvidia_smi_utilization_ofa