SpringBoot实现健康检查的完整指南_Java

每天5分钟，掌握一个 springboot核心知识点。大家好，我是springboot指南的小坏。前两天我们聊了日志和监控，今天来点更实际的——如果你的应用”病”了，你怎么知道？知道了又该怎么办？

一、你的应用可能正在”带病工作”

先来看几个真实场景：

场景1：
用户：”为什么我支付成功了，订单还是没生成？”
你：”数据库有点慢，重启一下就好了。”
真相：数据库连接池泄漏，已经持续3天了。

场景2：
老板：”最近网站怎么这么卡？”
你：”服务器配置不够，加几台机器吧。”
真相：一个sql查询没加索引，全表扫描拖慢了整个系统。

场景3：
运维：”内存快爆了！”
你：”java应用就是这样，重启一下就好了。”
真相：内存泄漏，每天泄漏50mb，一个月后必崩。

这些问题的共同点：应用在”带病工作”，但没人知道它”病”在哪。直到用户投诉、老板发火、服务器宕机…

二、springboot的”体检中心”——actuator

springboot自带了一个”体检中心”，叫actuator。你可以理解为：

它是你应用的私人医生
24小时监测健康状况
随时给你出体检报告

2.1 3步开启体检中心

第一步：加个”体检设备”（依赖）

<!-- 在pom.xml里加这一行 -->
<dependency>
    <groupid>org.springframework.boot</groupid>
    <artifactid>spring-boot-starter-actuator</artifactid>
</dependency>

第二步：打开”体检开关”（配置）

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: "*"  # 打开所有体检项目

第三步：看看”体检报告” 启动应用，访问：http://localhost:8080/actuator/health

你会看到：

{
  "status": "up",  // 健康状态：up健康，down生病
  "components": {
    "diskspace": {
      "status": "up",
      "details": {
        "total": "500gb",   // 磁盘总量
        "free": "300gb",    // 剩余空间
        "threshold": "10gb" // 警戒线
      }
    },
    "ping": {
      "status": "up"  // 基本心跳
    }
  }
}

2.2 最重要的4个体检项目

/actuator/health - 总体健康
就像量血压、测心跳告诉你：应用还活着吗？
/actuator/info - 基本信息
就像身份证告诉你：这是谁？什么版本？什么时候出生的？
/actuator/metrics - 性能指标
就像全面体检告诉你：cpu高不高？内存够不够？请求多不多？
/actuator/loggers - 日志管理
就像病历本告诉你：现在记录了什么？还能改记录级别

三、自定义体检项目：检查第三方服务

光检查自己健康还不够，还要检查你依赖的”朋友”（其他服务）是否健康。

3.1 检查数据库连接

@component
public class databasehealthcheck {
    
    @autowired
    private datasource datasource;
    
    // 这个方法会出现在/actuator/health里
    public health check() {
        try {
            // 尝试获取数据库连接
            connection conn = datasource.getconnection();
            
            // 检查连接是否有效
            if (conn.isvalid(2)) { // 2秒超时
                return health.up()
                    .withdetail("message", "数据库连接正常")
                    .withdetail("time", localdatetime.now())
                    .build();
            } else {
                return health.down()
                    .withdetail("error", "数据库连接无效")
                    .build();
            }
        } catch (exception e) {
            return health.down()
                .withdetail("error", "数据库连接失败")
                .withdetail("reason", e.getmessage())
                .build();
        }
    }
}

访问/actuator/health，你会看到：

{
  "status": "down",  // 总体不健康！
  "components": {
    "db": {
      "status": "down",
      "details": {
        "error": "数据库连接失败",
        "reason": "connection refused"
      }
    }
  }
}

3.2 检查redis是否正常

@component
public class redishealthcheck {
    
    @autowired
    private redistemplate<string, string> redistemplate;
    
    public health check() {
        try {
            // 执行一个简单的ping命令
            string result = redistemplate.execute(
                (rediscallback<string>) connection -> 
                    connection.ping()
            );
            
            if ("pong".equals(result)) {
                return health.up()
                    .withdetail("message", "redis服务正常")
                    .build();
            } else {
                return health.down()
                    .withdetail("error", "redis返回异常")
                    .build();
            }
        } catch (exception e) {
            return health.down()
                .withdetail("error", "redis连接失败")
                .withdetail("reason", e.getmessage())
                .build();
        }
    }
}

3.3 检查第三方api

@component
public class thirdpartyhealthcheck {
    
    @autowired
    private resttemplate resttemplate;
    
    public health check() {
        // 检查支付接口
        health paymenthealth = checkpaymentservice();
        
        // 检查短信接口
        health smshealth = checksmsservice();
        
        // 如果有一个不健康，总体就不健康
        if (paymenthealth.getstatus() == status.down || 
            smshealth.getstatus() == status.down) {
            return health.down()
                .withdetail("payment", paymenthealth)
                .withdetail("sms", smshealth)
                .build();
        }
        
        return health.up()
            .withdetail("payment", paymenthealth)
            .withdetail("sms", smshealth)
            .build();
    }
    
    private health checkpaymentservice() {
        try {
            responseentity<string> response = resttemplate.getforentity(
                "https://payment.api.com/health", 
                string.class
            );
            
            if (response.getstatuscode().is2xxsuccessful()) {
                return health.up().build();
            } else {
                return health.down()
                    .withdetail("status", response.getstatuscodevalue())
                    .build();
            }
        } catch (exception e) {
            return health.down()
                .withdetail("error", e.getmessage())
                .build();
        }
    }
}

四、可视化监控大屏：grafana

看json太累？我们需要一个更直观的”体检报告大屏”。

4.1 什么是grafana？

简单说，grafana就是：

医院的体检大屏：所有指标一目了然
汽车的仪表盘：实时显示车速、油耗
应用的监控台：cpu、内存、请求量全显示

4.2 5分钟搭建监控大屏

第一步：加个”数据采集器”

<dependency>
    <groupid>io.micrometer</groupid>
    <artifactid>micrometer-registry-prometheus</artifactid>
</dependency>

第二步：暴露数据接口

management:
  endpoints:
    web:
      exposure:
        include: prometheus,health,metrics

第三步：启动grafana（docker最简单）

# 创建一个docker-compose.yml文件
version: '3'
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
      
  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - gf_security_admin_password=admin

# 运行
docker-compose up -d

第四步：访问大屏

打开 http://localhost:3000
用户名：admin，密码：admin
导入springboot监控模板

你就能看到这样的酷炫大屏：

┌─────────────────────────────────────┐
│  springboot应用监控                │
├─────────────────────────────────────┤
│  ✅ cpu使用率：25%                │
│  ✅ 内存使用：1.2g/2g             │
│  ✅ 请求量：1200次/分钟           │
│  ❌ 错误率：3.2% (偏高)           │
│  ✅ 数据库连接：45/100            │
│  ✅ 响应时间：平均125ms           │
└─────────────────────────────────────┘

4.3 最重要的5个监控图表

请求量折线图

看：每分钟多少请求
发现：流量高峰时段

错误率饼图

看：错误占比
发现：哪个接口出错最多

响应时间趋势图

看：接口响应时间变化
发现：什么时候开始变慢

jvm内存堆栈图

看：内存使用情况
发现：是否有内存泄漏

数据库连接池图

看：连接数变化
发现：连接是否被耗尽

五、告警：让系统自己”喊救命”

监控有了，但总不能24小时盯着屏幕吧？你需要告警——让系统自己”喊救命”。

5.1 配置钉钉告警（最常用）

第一步：创建钉钉机器人

钉钉群 → 群设置 → 智能群助手 → 添加机器人
选择”自定义”
设置机器人名字，比如”系统监控机器人”
复制webhook地址

第二步：配置告警规则

# application.yml
management:
  health:
    # 设置健康检查的细节显示
    show-details: always
    
  # 钉钉告警配置
  endpoint:
    health:
      enabled: true
      
alert:
  dingtalk:
    webhook: https://oapi.dingtalk.com/robot/send?access_token=你的token

第三步：写告警代码

@component
public class healthalert {
    
    @autowired
    private dingtalkservice dingtalk;
    
    // 监听健康状态变化
    @eventlistener
    public void onhealthchanged(health health) {
        if (health.getstatus() == status.down) {
            // 发送钉钉告警
            string message = string.format(
                "【系统告警】\n" +
                "应用状态：不健康\n" +
                "时间：%s\n" +
                "详情：%s",
                localdatetime.now(),
                health.getdetails()
            );
            
            dingtalk.sendalert(message);
        }
    }
}

5.2 告警消息示例

普通告警（发到钉钉群）：

【系统监控】用户服务响应时间偏高
服务：user-service
实例：192.168.1.100:8080
当前响应时间：2.1s
阈值：1.0s
时间：2024-01-15 14:30:00
建议：检查数据库索引

紧急告警（打电话）：

【紧急告警】订单服务数据库连接失败！
状态：down
问题：数据库连接池耗尽
影响：用户无法下单
时间：2024-01-15 14:35:00
操作：请立即重启或扩容！

5.3 告警分级策略

alert:
  levels:
    p0:  # 最高级：必须马上处理
      conditions:
        - 服务完全不可用
        - 数据库连接失败
        - 核心业务失败率>20%
      actions:
        - 打电话
        - 发钉钉
        - 发短信
        
    p1:  # 高级：1小时内处理
      conditions:
        - 响应时间>3s
        - 错误率>10%
        - 磁盘使用率>90%
      actions:
        - 发钉钉
        - 发邮件
        
    p2:  # 中级：今天处理
      conditions:
        - 内存使用率>80%
        - cpu使用率>70%
        - 请求量突增200%
      actions:
        - 发邮件
        - 记录日志
        
    p3:  # 低级：本周优化
      conditions:
        - 慢查询数量增加
        - 日志错误率<1%
        - 缓存命中率下降
      actions:
        - 记录日志
        - 周会讨论

六、实战案例：电商系统健康检查

假设你有一个电商系统，需要检查这些：

@component
public class ecommercehealthcheck {
    
    @autowired
    private orderservice orderservice;
    
    @autowired
    private paymentservice paymentservice;
    
    @autowired
    private inventoryservice inventoryservice;
    
    @autowired
    private redistemplate<string, string> redistemplate;
    
    public health check() {
        map<string, health> details = new hashmap<>();
        
        // 1. 检查订单服务
        details.put("orderservice", checkorderservice());
        
        // 2. 检查支付服务
        details.put("paymentservice", checkpaymentservice());
        
        // 3. 检查库存服务
        details.put("inventoryservice", checkinventoryservice());
        
        // 4. 检查缓存
        details.put("redis", checkredis());
        
        // 5. 检查数据库
        details.put("database", checkdatabase());
        
        // 判断整体健康状态
        boolean allup = details.values().stream()
            .allmatch(h -> h.getstatus() == status.up);
        
        if (allup) {
            return health.up()
                .withdetails(details)
                .build();
        } else {
            return health.down()
                .withdetails(details)
                .build();
        }
    }
    
    private health checkorderservice() {
        try {
            // 模拟创建订单
            order testorder = orderservice.createtestorder();
            
            return health.up()
                .withdetail("message", "订单服务正常")
                .withdetail("testorderid", testorder.getid())
                .build();
        } catch (exception e) {
            return health.down()
                .withdetail("error", "订单服务异常")
                .withdetail("reason", e.getmessage())
                .build();
        }
    }
    
    private health checkredis() {
        try {
            // 测试redis连接和性能
            long start = system.currenttimemillis();
            redistemplate.opsforvalue().set("health_check", "test");
            string value = redistemplate.opsforvalue().get("health_check");
            long cost = system.currenttimemillis() - start;
            
            if ("test".equals(value)) {
                return health.up()
                    .withdetail("message", "redis正常")
                    .withdetail("responsetime", cost + "ms")
                    .build();
            } else {
                return health.down()
                    .withdetail("error", "redis数据不一致")
                    .build();
            }
        } catch (exception e) {
            return health.down()
                .withdetail("error", "redis连接失败")
                .build();
        }
    }
}

访问/actuator/health，你会看到：

{
  "status": "up",
  "components": {
    "orderservice": {
      "status": "up",
      "details": {
        "message": "订单服务正常",
        "testorderid": "123456"
      }
    },
    "paymentservice": {
      "status": "up",
      "details": {
        "message": "支付服务正常",
        "responsetime": "45ms"
      }
    },
    "redis": {
      "status": "down",
      "details": {
        "error": "redis连接失败",
        "reason": "connection refused"
      }
    }
  }
}

关键信息：

总体状态：因为redis挂了，所以是down
具体哪个组件挂了：redis
为什么挂：connection refused

七、避坑指南

坑1：健康检查本身把系统搞挂了

// ❌ 错误：健康检查太耗时
public health check() {
    // 执行一个10秒的sql查询...
    resultset rs = executelongquery();
    return health.up().build();
}

// ✅ 正确：设置超时时间
public health check() {
    try {
        future<boolean> future = executor.submit(() -> {
            return checkdatabase();
        });
        
        // 最多等2秒
        boolean healthy = future.get(2, timeunit.seconds);
        return healthy ? health.up() : health.down();
    } catch (timeoutexception e) {
        return health.down()
            .withdetail("error", "健康检查超时")
            .build();
    }
}

坑2：敏感信息泄露

// ❌ 错误：暴露了数据库密码
return health.up()
    .withdetail("database", "连接正常")
    .withdetail("url", "jdbc:mysql://localhost:3306")
    .withdetail("username", "root")
    .withdetail("password", "123456")  // 天啊！密码泄露了！
    .build();

// ✅ 正确：只暴露必要信息
return health.up()
    .withdetail("database", "连接正常")
    .withdetail("响应时间", "20ms")
    .build();

坑3：告警太多，变成”狼来了”

# ❌ 错误：什么都告警
alert:
  rules:
    - cpu使用率 > 50%  # 太敏感了！
    - 内存使用率 > 60%
    - 请求量增加 10%
    - 响应时间 > 100ms

# ✅ 正确：只告警关键问题
alert:
  rules:
    - 服务不可用
    - 错误率 > 5%
    - 响应时间 > 3s
    - 磁盘使用率 > 90%

八、最佳实践总结

8.1 健康检查配置清单

# 必须检查的项目
health:
  checks:
    # 系统层面
    - 磁盘空间
    - 内存使用
    - cpu负载
    
    # 应用层面
    - 数据库连接
    - redis连接
    - 消息队列
    
    # 业务层面
    - 核心api可用性
    - 第三方服务
    - 定时任务状态

8.2 监控告警检查清单

[ ] 监控面板能否访问？
[ ] 告警是否正常工作？
[ ] 关键指标是否有阈值？
[ ] 告警联系人是否正确？
[ ] 是否有告警升级机制？
[ ] 告警历史是否可追溯？

8.3 一个完整的配置示例

# application-prod.yml
management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
      base-path: /monitor  # 自定义路径，更安全
      
  endpoint:
    health:
      show-details: when-authorized  # 只有授权用户能看到详情
      probes:
        enabled: true  # 开启就绪和存活检查
      
  # 安全配置
  security:
    enabled: true
    roles: admin  # 需要admin角色
    
  # 指标配置
  metrics:
    export:
      prometheus:
        enabled: true
    tags:
      application: ${spring.application.name}
      environment: prod
      
# 自定义健康检查
custom:
  health:
    # 检查频率
    check-interval: 30s
    # 超时时间
    timeout: 5s
    # 重试次数
    retry-times: 3

九、今日思考题

场景：你是公司的技术负责人，需要设计一套健康检查方案：

给老板看什么？

整体系统是否健康
今天有多少订单
用户增长趋势

给运维看什么？

服务器cpu、内存
数据库连接数
网络延迟

给开发看什么？

哪个接口最慢
什么错误最多
jvm内存情况

以上就是springboot实现健康检查的完整指南的详细内容，更多关于springboot健康检查的资料请关注代码网其它相关文章！

SpringBoot实现健康检查的完整指南

2025年12月31日 • Java •我要评论