Prometheus入门终极指南（五），别再说不会用Prometheus了！（长文更新，速速关注）-新盟教育-思科华为网络工程师认证-HCIE/CCIE|HCIP/CCNP|HCIA/CCNA线上培训机构

新闻资讯资讯详情

Prometheus入门终极指南（五），别再说不会用Prometheus了！（长文更新，速速关注）

发布作者：新盟教育发布日期：2022-01-27 浏览人数：74551人

运维实战

启动

这是个类似"hello，world"的试验，教大家怎样快速安装、配置和简单地搭建一个DEMO。

你会下载和本地化运行Prometheus服务，并写一个配置文件，监控Prometheus服务本身和一个简单的应用，然后配合使用查询、规则和图表展示采样点数据。

下载和运行普罗米修斯

先下载，然后提取和运行它，so easy：

tar zxvf prometheus-*.tar.gz
cd prometheus-*

在开始启动Prometheus之前，我们要配置它。

配置Prometheus监控自身

Prometheus从目标机上通过http方式拉取采样点数据，它也可以拉取自身服务数据并监控自身的健康状况。

当然Prometheus服务拉取自身服务采样数据，并没有多大的用处，但是它是一个好的DEMO。保存下面的Prometheus配置，并命名为：prometheus.yml：

global:  scrape_interval:     15s # 默认情况下，每15s拉取一次目标采样点数据。

  # 我们可以附加一些指定标签到采样点度量标签列表中, 用于和第三方系统进行通信, 包括：federation, remote storage, Alertmanager  external_labels:    monitor: 'codelab-monitor'

# 下面就是拉取自身服务采样点数据配置scrape_configs:  # job名称会增加到拉取到的所有采样点上，同时还有一个instance目标服务的host：port标签也会增加到采样点上  - job_name: 'prometheus'

    # 覆盖global的采样点，拉取时间间隔5s    scrape_interval: 5s

    static_configs:      - targets: ['localhost:9090']

对于一个完整的配置选项，请见配置文档。

启动普罗米修斯

指定启动Prometheus的配置文件，然后运行：

./prometheus --config.file=prometheus.yml

这样普罗米修斯服务应该起来了。你可以在浏览器上输入：http://localhost:9090，就可以看到Prometheus的监控界面。

你也可以通过输入http://localhost:9090/metrics，直接拉取到所有最新的采样点数据集。

使用expression browser（暂翻译：浏览器上输入表达式），为了使用Prometheus内置浏览器表达式，导航到http://localhost:9090/graph，并选择带有"Graph"的"Console"。

在拉取到的度量采样点数据中，有一个metric叫prometheus_target_interval_length_seconds，两次拉取实际的时间间隔，在表达式的console中输入：

prometheus_target_interval_length_seconds

这个应该会返回很多不同的倒排时间序列数据，这些度量名称都是prometheus_target_interval_length_seconds，但是带有不同的标签列表值，这些标签列表值指定了不同的延迟百分比和目标组间隔。

如果我们仅仅对99%的延迟感兴趣，则我们可以使用下面的查询去清洗信息：

prometheus_target_interval_length_seconds{quantile=“0.99”}

为了统计返回时间序列数据个数，你可以写：

count(prometheus_target_interval_length_seconds)

使用图界面

见图表表达式，导航到http://localhost:9090/graph，然后使用"Graph" tab。

例如，进入下面表达式，绘图最近1分钟产生chunks的速率：

rate(prometheus_tsdb_head_chunks_created_total[1m])

启动其他一些采样目标：

Go客户端包括了一个例子，三个服务只见的RPC调用延迟。

首先你必须有Go的开发环境，然后才能跑下面的DEMO，下载Prometheus的Go客户端，运行三个服务：

现在你在浏览器输入：http://localhost:8080/metrics， http://localhost:8081/metrics， http://localhost:8082/metrics，能看到所有采集到的采样点数据。

配置Prometheus监控目标服务

现在我们将会配置Prometheus，拉取三个目标服务的采样点。我们把这三个目标服务组成一个job，叫example-radom。

然而，想象成，前两个服务是生产环境服务，后者是测试环境服务。我们可以通过group标签分组，在这个例子中，我们通过group="production"标签和group="test"来区分生产和测试：

scrape_configs:  - job_name:       'example-random'

    scrape_interval: 5s

    static_configs:      - targets: ['localhost:8080', 'localhost:8081']        labels:          group: 'production'

      - targets: ['localhost:8082']        labels:          group: 'test'

进入浏览器，输入rpc_duration_seconds，验证Prometheus所拉取到的采样点中每个点都有group标签，且这个标签只有两个值生产， test

聚集到的采样点数据配置规则

上面的例子没有什么问题，但是当采样点海量时，计算成了瓶颈。查询、聚合成千上万的采样点变得越来越慢。

为了提高性能，Prometheus允许你通过配置文件设置规则，对表达式预先记录为全新的持续时间序列。

让我们继续看RPCs的延迟速率（rpc_durations_seconds_count），如果存在很多实例，我们只需要对特定的job和service进行时间窗口为5分钟的速率计算，我们可以写成这样：

avg(rate(rpc_durations_seconds_count[5m])) by (job, service)

为了记录这个计算结果，我们命名一个新的度量：job_service：rpc_durations_seconds_count：avg_rate5m，创建一个记录规则文件，并保存为prometheus.rules.yml：

然后再在Prometheus配置文件中，添加rule_files语句到global配置区域，最后配置文件应该看起来是这样的：

global:  scrape_interval:     15s # By default, scrape targets every 15 seconds.  evaluation_interval: 15s # Evaluate rules every 15 seconds.

  # Attach these extra labels to all timeseries collected by this Prometheus instance.  external_labels:    monitor: 'codelab-monitor'

rule_files:  - 'prometheus.rules.yml'

scrape_configs:  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.    scrape_interval: 5s

    static_configs:      - targets: ['localhost:9090']

  - job_name:       'example-random'

    # Override the global default and scrape targets from this job every 5 seconds.    scrape_interval: 5s

    static_configs:      - targets: ['localhost:8080', 'localhost:8081']        labels:          group: 'production'

      - targets: ['localhost:8082']        labels:          group: 'test'

然后重启Prometheus服务，并指定最新的配置文件，查询并验证job_service：rpc_durations_seconds_count：avg_rate5m度量指标。

可视化

格拉法纳

Grafana支持Prometheus可视化

Grafana支持Prometheus查询。从Grafana 2.5.0 （2015-10-28）开始Prometheus可以作为它的数据源。

下面的例子：Prometheus查询在Grafana Dashboard界面的图表展示：

图片1.png

Grafana安装：

在Linux安装Grafana，如下所示：

# Download and unpack Grafana from binary tar (adjust version as appropriate).curl -L -O https://grafanarel.s3.amazonaws.com/builds/grafana-2.5.0.linux-x64.tar.gztar zxf grafana-2.5.0.linux-x64.tar.gz

# Start Grafana.cd grafana-2.5.0/./bin/grafana-server web

使用方法

默认情况下，Grafana服务端口http://localhost:3000。默认登录用户名和密码"admin/admin"。

创建一个Prometheus数据源：

为了创建一个普罗米修斯数据源：

点击Grafana的logo，打开工具栏。
在工具栏中，点击"Data Source"菜单。
点击"新增"。
数据源类型选择"Prometheus"。
设置Prometheus服务访问地址（例如：http://localhost:9090）。
调整其他想要的设置（例如：关闭代理访问）。
点击"Add"按钮，保存这个新数据源。

下面显示了一个Prometheus数据源配置例子：

图片2.png

创建一个普罗米修斯 Graph图表：

下面是添加一个新的Grafana的标准方法：

点击图表Graph的title，它在图表上方中间。然后点击"Edit"。
在"Metrics"tab下面，选择你的Prometheus数据源（下面右边）。
在"Query"字段中输入你想查询的Prometheus表达式，同时使用"Metrics"字段通过自动补全查找度量指标。
为了格式化时间序列的图例名称，使用"Legend format"图例格式输入。例如，为了仅仅显示这个标签为method和status的查询结果，你可以使用图例格式{{method{} - {{status}}。
调节其他的Graph设置，直到你有一个工作图表。

下面显示了一个Prometheus图表配置：

图片3.png

从Grafana.net导入预构建的dashboard，Grafana.net维护一个共享仪板表，它们能够被下载，并在Grafana服务中使用。使用Grafana.net的"Filter"选项去浏览来自Prometheus数据源的dashboards。

你当前必须手动编辑下载下来的JSON文件和更改datasource：选择Prometheus服务作为Grafana的数据源，使用"Dashboard"->"Home"->"Import"选项去导入编辑好的dashboard文件到你的Grafana中。

控制模板

控制模板允许使用Go语言模板创建任意的console。这些由普罗米修斯服务提供。

开始：

Prometheus提供了一系列的控制模板来帮助您。这些可以在Prometheus服务上的console/index.html.example中找到，如果Prometheus服务正在删除带有标签job="node"的NodeExporter，则会显示NodeExporter控制台。

这个例子控制台包括5部分：

在顶部的导航栏。
左边的一个菜单。
底部的时间控制。
在中心的主内容，通常是图表。
右边的表格。

这个导航栏是链接到其他系统，例如Prometheus其他方面的文档，以及其他任何使你明白的。

该菜单用于在同一个Prometheus服务中导航，它可以快速在另一个tar中打开一个控制台。这些都是在console_libraries/menu.lib中配置。

时间控制台允许持久性和图表范围的改变。控制台URL能够被分享，并且在其他的控制台中显示相同的图表。

主要内容通常是图表。这里有一个可配置的JavaScript图表库，它可以处理来自普罗米修斯服务的请求，并通过Rickshaw来渲染。

最后，在右边的表格可以用笔图表更紧凑的形式显示统计信息。

例子控制台：

这是一个最基本的控制台。它显示任务的数量，其中CPU平均使用率、以及右侧表中的平均内存使用率。主要内容具有每秒查询数据：

{{template "head" .}}

{{template "prom_right_table_head"}}<tr>  <th>MyJobth>  <th>{{ template "prom_query_drilldown" (args "sum(up{job='myjob'})") }}      / {{ template "prom_query_drilldown" (args "count(up{job='myjob'})") }}  th>tr><tr>  <td>CPUtd>  <td>{{ template "prom_query_drilldown" (args      "avg by(job)(rate(process_cpu_seconds_total{job='myjob'}[5m]))"      "s/s" "humanizeNoSmallPrefix") }}  td>tr><tr>  <td>Memorytd>  <td>{{ template "prom_query_drilldown" (args       "avg by(job)(process_resident_memory_bytes{job='myjob'})"       "B" "humanize1024") }}  td>tr>{{template "prom_right_table_tail"}}



{{template "prom_content_head" .}}<h1>MyJobh1>

<h3>Queriesh3><div id="queryGraph">div><script>new PromConsole.Graph({  node: document.querySelector("#queryGraph"),  expr: "sum(rate(http_query_count{job='myjob'}[5m]))",  name: "Queries",  yAxisFormatter: PromConsole.NumberFormatter.humanizeNoSmallPrefix,  yHoverFormatter: PromConsole.NumberFormatter.humanizeNoSmallPrefix,  yUnits: "/s",  yTitle: "Queries"})script>

{{template "prom_content_tail" .}}

{{template "tail"}}

模板部分不翻译了，建议用Grafana。

工具

客户端

客户端库：

在你能够监控你的服务器之前，你需要通过Prometheus客户端库把监控的代码放在被监控的服务代码中。下面实现了Prometheus的度量指标类型metrictypes。

选择你需要的客户端语言，在你的服务实例上通过HTTP端口提供内部度量指标：

去
Java 或 Scala
蟒
红宝石

当Prometheus获取实例的HTTP端点时，客户库发送所有跟踪的度量指标数据到服务器上。

如果没有可用的客户端语言版本，或者你想要避免依赖，你也可以实现一个支持的导入格式到度量指标数据中。

在实现一个新的Prometheus客户端库时，请遵循客户端指南。注意，这个文档仍然在更新中。