专栏/k8s loki 容器日志解决方案-4. alertmanager 报警及loki rules

k8s loki 容器日志解决方案-4. alertmanager 报警及loki rules

2022年06月29日 09:01--浏览 · --点赞 · --评论
粉丝:1502文章:8


### alertmanager 部署

下载 https://github.com/prometheus/alertmanager/releases/tag/v0.24.0


上传文件到 /data 目录


```

cd /data

tar xf alertmanager-0.24.0.linux-amd64.tar.gz

mv alertmanager-0.24.0.linux-amd64 alertmanager


cat <<EOF> /etc/supervisord.d/alertmanager.ini 

[program:alertmanager] 

command=/data/alertmanager/alertmanager

autorestart=true

autostart=true

stderr_logfile=/tmp/alertmanager_err.log

stdout_logfile=/tmp/alertmanager_out.log

user=root

stopsignal=INT

startsecs=10

startretries=3

directory=/data/alertmanager

EOF

```


## alertmanager 报警配置

### alertmanager 部署
下载 https://github.com/prometheus/alertmanager/releases/tag/v0.24.0

上传文件到 /data 目录

```
cd /data
tar xf alertmanager-0.24.0.linux-amd64.tar.gz
mv alertmanager-0.24.0.linux-amd64 alertmanager

cat <<EOF> /etc/supervisord.d/alertmanager.ini
[program:alertmanager]
command=/data/alertmanager/alertmanager
autorestart=true
autostart=true
stderr_logfile=/tmp/alertmanager_err.log
stdout_logfile=/tmp/alertmanager_out.log
user=root
stopsignal=INT
startsecs=10
startretries=3
directory=/data/alertmanager
EOF
```


```
cat <<EOF> /data/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.qiye.aliyun.com:25'
  smtp_from: 'alert<XXX>'
  smtp_auth_username: 'XXX'
  smtp_auth_password: 'XXX'
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
  - name: 'web.hook'
    email_configs:
    - to: 'XXX'
      send_resolved: false
EOF

```

global 中配置了其他配置上下文中需要的参数,以及部分参数默认值。
route 中定义路由树中的节点及其子节点。每个警报进入路由树,根据标签匹配路径,如果不匹配任何子节点,则按照默认配置处理。
receivers 一个或多个通知集成的配置。内置邮箱,微信,webhook等通知。
```
email_configs:
  [ - <email_config>, ... ]
opsgenie_configs:
  [ - <opsgenie_config>, ... ]
pagerduty_configs:
  [ - <pagerduty_config>, ... ]
pushover_configs:
  [ - <pushover_config>, ... ]
slack_configs:
  [ - <slack_config>, ... ]
sns_configs:
  [ - <sns_config>, ... ]
victorops_configs:
  [ - <victorops_config>, ... ]
webhook_configs:
  [ - <webhook_config>, ... ]
wechat_configs:
  [ - <wechat_config>, ... ]
telegram_configs:
  [ - <telegram_config>, ... ]
```


启动

### 配置loki 报警规则

```
cat <<'EOF'> /data/loki/rules/fake/rules.yaml
groups:
    - name: service OutOfMemoryError
      rules:
        # 关键字监控
        - alert: loki check words java.lang.OutOfMemoryError
          expr: sum by (env, hostname, log_type, filename) (count_over_time({env=~"\\w+"} |= "java.lang.OutOfMemoryError" [5m]) > 0)
          labels:
            severity: critical
          annotations:
            description: '{{$labels.env}} {{$labels.hostname}} file {{$labels.filename}} has  {{ $value }} error'
            summary: java.lang.OutOfMemoryError
        # java 程序日志性能报警
        - alert: loki java full gc count check
          expr: sum by (env, hostname, log_type, filename) (count_over_time({env=~"\\w+"} |= "Full GC (Allocation" [5m]) > 5)
          labels:
            severity: warning
          annotations:
            description: '{{$labels.env}} {{$labels.hostname}} {{$labels.filename}} {{ $value }}'
            summary: java full gc count check
        # 使用正则表达式报警匹配示例
        - alert: dbperform slowlog sql 慢查询
          expr: 'sum by (env, hostname, log_type, filename) (count_over_time({env=~"\\w+"} |~ "time: [1-9]\\d{4,}" [5m]) > 5)'
          labels:
            severity: warning
          annotations:
            description: '{{$labels.env}} {{$labels.hostname}} file {{$labels.filename}} has  {{ $value }} error'
            summary: sql slowlog
EOF
```

查看 loki 日志,日志关键字
`msg="updating rule file" file=/data/loki/rules-temp/fake/rules.yaml`
上面日志显示,loki已经更新新的规则文件。
如果文件的格式有问题,将无法加载文件,日志会显示错误原因。
每次更新rule file,需要查看loki日志,确认配置更新。

### 手动触发报警

向 /var/log/messages 写入一行日志,日志中包含关键字

`echo 'The String object java.lang.OutOfMemoryError is used to represent and manipulate a sequence of characters.' >> /var/log/messages`

`echo 'The String object Full GC (Allocation is used to represent and manipulate a sequence of characters.' >> /var/log/messages`

`echo 'The String object time: 21345 ms is used to represent and manipulate a sequence of characters.' >> /var/log/messages`

grafana 执行触发语句,查看执行结果

登录邮箱查看邮件

alertmanager web ui 操作 http://192.168.171.129:9093/#/alerts

以上各个组件的单机部署已经完毕,下面进行容器化部署。


投诉或建议