nvidia_gpu_exporter

1. 安装

VERSION=1.3.1
wget https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v${VERSION}/nvidia_gpu_exporter_${VERSION}_linux_x86_64.tar.gz
tar -xvzf nvidia_gpu_exporter_${VERSION}_linux_x86_64.tar.gz
mv nvidia_gpu_exporter /usr/bin
nvidia_gpu_exporter --help

2. 运行

nvidia_gpu_exporter --web.listen-address=:9835 --web.telemetry-path=/metrics --nvidia-smi-command=nvidia-smi --log.level=info --query-field-names=AUTO --log.format=logfmt

3. 参数参考

usage: nvidia_gpu_exporter [<flags>]

Flags:
  -h, --help                Show context-sensitive help (also try --help-long and --help-man).
      --web.config.file=""  [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication.
      --web.listen-address=":9835"
                            Address to listen on for web interface and telemetry.
      --web.telemetry-path="/metrics"
                            Path under which to expose metrics.
      --nvidia-smi-command="nvidia-smi"
                            Path or command to be used for the nvidia-smi executable
      --query-field-names="AUTO"
                            Comma-separated list of the query fields. You can find out possible fields by running `nvidia-smi --help-query-gpus`. The value `AUTO` will
                            automatically detect the fields to query.
      --log.level=info      Only log messages with the given severity or above. One of: [debug, info, warn, error]
      --log.format=logfmt   Output format of log messages. One of: [logfmt, json]
      --version             Show application version.

4. 指标参考

telegraf英伟达插件官方文档

1. 流程概述:

telegraf通过nvidia -smi二进制文件,查询本机的显卡设备(/dev/nvidia*)

2. 安装telegraf:参考官方指南

3. 配置调整:

3.1 确认配置文件位置

可以根据安装采集器时的启动命令查找
  • 二进制安装:ps -ef | grep telegraf 或者 systemctl cat telegraf
  • k8s部署直接查看POD的yaml

参考:

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

修改 /etc/telegraf/telegraf.conf 文件,或在/etc/telegraf/telegraf.d目录下新增配置文件telegraf-nvidia.conf;

3.2 配置文件新增nvidia_smi配置:

[[inputs.nvidia_smi]]

3.3 nvidia_smi其他配置支持:

[[inputs.nvidia_smi]]
# bin_path = "/usr/bin/nvidia-smi"  nvidia-smi路径,若路径有修改可修改此配置
# timeout = "5s" 轮训超时时间

3.4 telegraf数据处理:

telegraf的outputs插件支持主动上报数据到数据库、支持常用的prometheusinfluxDB和victoriametrics等数据库、同时支持httpSQL等方式直接上传到RDB或httpserver;

主动上报数据需要做好数据格式化,将采集数据格式化为数据库需要的格式

支持传统promtheus采集端口暴露

telegraf主动上报prometheus配置参考:
[[outputs.http]]
  url = "http://192.168.122.1:33090/api/v1/write"
  method = "POST"
  data_format = "prometheusremotewrite"
  # username = "username" 根据配置修改为实际的账号密码
  # password = "xxxxxxx"
[outputs.http.headers]
 Content-Type = "application/x-protobuf"
 Content-Encoding = "snappy"
 X-Prometheus-Remote-Write-Version = "0.1.0"
telegraf暴露端口被动等prometheus采集配置参考:
[[outputs.prometheus_client]]
  listen = ":9273"
  ## Use HTTP Basic Authentication.
  # basic_username = "Foo"
  # basic_password = "Bar"
  ## If set, the IP Ranges which are allowed to access metrics.
  ##   ex: ip_range = ["192.168.0.0/24", "192.168.1.0/30"]
  # ip_range = []
  ## Path to publish the metrics on.
  # path = "/metrics"

4. 常见问题排查:

  1. 确认nvidia-smi的路径已经是否可以正常返回

  2. telegraf默认在telegraf用户下运行,确认配置文件是否为root权限

5. Metrics指标参考

注:

下面列举主要来源于官网,经对比与实际采集指标存在差异;

简要核查主要是两方面:

  1. 部分指标string类型prometheus不支持存储;
  2. github的master分支文档未更新,但切换至1.35版本tag后,也存在文档内指标不一致问题
  • measurement: nvidia_smi(前缀)
    • tags
      • name (type of GPU e.g. GeForce GTX 1070 Ti)
      • compute_mode (The compute mode of the GPU e.g. Default)
      • index (The port index where the GPU is connected to the motherboard e.g. 1)
      • pstate (Overclocking state for the GPU e.g. P0)
      • uuid (A unique identifier for the GPU e.g. GPU-f9ba66fc-a7f5-94c5-da19-019ef2f9c665)
    • fields 使用时注意加前缀nvidia_smi:curl http://127.0.0.1:9090/api/v1/query?query=nvidia_smi_fan_speed
      • fan_speed (integer, percentage)
      • fbc_stats_session_count (integer)
      • fbc_stats_average_fps (integer)
      • fbc_stats_average_latency (integer)
      • memory_free (integer, MiB)
      • memory_used (integer, MiB)
      • memory_total (integer, MiB)
      • memory_reserved (integer, MiB)
      • retired_pages_multiple_single_bit (integer)
      • retired_pages_double_bit (integer)
      • retired_pages_blacklist (string)
      • retired_pages_pending (string)
      • remapped_rows_correctable (int)
      • remapped_rows_uncorrectable (int)
      • remapped_rows_pending (string)
      • remapped_rows_failure (string)
      • power_draw (float, W)
      • temperature_gpu (integer, degrees C)
      • utilization_gpu (integer, percentage)
      • utilization_memory (integer, percentage)
      • utilization_encoder (integer, percentage)
      • utilization_decoder (integer, percentage)
      • pcie_link_gen_current (integer)
      • pcie_link_width_current (integer)
      • encoder_stats_session_count (integer)
      • encoder_stats_average_fps (integer)
      • encoder_stats_average_latency (integer)
      • clocks_current_graphics (integer, MHz)
      • clocks_current_sm (integer, MHz)
      • clocks_current_memory (integer, MHz)
      • clocks_current_video (integer, MHz)
      • driver_version (string)
      • cuda_version (string)

实际测试采集到的指标

- nvidia_smi_clocks_current_graphics
- nvidia_smi_clocks_current_memory
- nvidia_smi_clocks_current_sm
- nvidia_smi_clocks_current_video
- nvidia_smi_encoder_stats_average_fps
- nvidia_smi_encoder_stats_average_latency
- nvidia_smi_encoder_stats_session_count
- nvidia_smi_fan_speed
- nvidia_smi_fbc_stats_average_fps
- nvidia_smi_fbc_stats_average_latency
- nvidia_smi_fbc_stats_session_count
- nvidia_smi_memory_free
- nvidia_smi_memory_reserved
- nvidia_smi_memory_total
- nvidia_smi_memory_used
- nvidia_smi_pcie_link_gen_current
- nvidia_smi_pcie_link_width_current
- nvidia_smi_power_draw
- nvidia_smi_remapped_rows_correctable
- nvidia_smi_remapped_rows_uncorrectable
- nvidia_smi_temperature_gpu
- nvidia_smi_utilization_decoder
- nvidia_smi_utilization_encoder
- nvidia_smi_utilization_gpu
- nvidia_smi_utilization_jpeg
- nvidia_smi_utilization_memory
- nvidia_smi_utilization_ofa

Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐