利用 MySQL 进行数据清洗的操作方法_Mysql

利用 mysql 进行数据清洗是数据预处理的重要环节，以下是常见的数据清洗操作及对应 sql 示例：

1. 去除重复数据

使用 row_number() 或 group by 识别并删除重复记录。

-- 查找重复记录（以 user_id 和 email 为例）
with duplicates as (
  select 
    user_id, 
    email, 
    row_number() over (partition by email order by user_id) as rn
  from users
)
select * from duplicates where rn > 1;
-- 删除重复记录（保留最新的一条）
delete from users
where user_id not in (
  select max(user_id) 
  from users 
  group by email
);

2. 处理缺失值

填充默认值：使用 coalesce() 或 ifnull()。
删除缺失值：使用 where 过滤。

-- 填充缺失值（将 null 替换为默认值）
update products
set price = coalesce(price, 0),  -- 价格为 null 时填充 0
    category = ifnull(category, '未知')  -- 分类为 null 时填充 '未知'
where price is null or category is null;
-- 删除包含缺失值的记录
delete from orders
where customer_id is null;

3. 数据标准化（大小写、格式统一）

转换大小写：使用 upper() 或 lower()。
去除空格：使用 trim()。
日期格式化：使用 str_to_date() 或 date_format()。

-- 统一邮箱为小写
update users
set email = lower(trim(email));
-- 标准化日期格式（将 '2023-12-31' 转为 '31-12-2023'）
update orders
set order_date = date_format(str_to_date(order_date, '%y-%m-%d'), '%d-%m-%y');

4. 处理无效数据

范围过滤：检查数值是否在合理区间。
正则匹配：验证格式（如邮箱、手机号）。

-- 删除年龄小于 0 或大于 120 的记录
delete from users
where age < 0 or age > 120;
-- 查找不符合邮箱格式的记录
select * from users
where email not regexp '^[a-za-z0-9._%+-]+@[a-za-z0-9.-]+\.[a-za-z]{2,}$';

5. 数据类型转换

使用 cast() 或 convert() 转换字段类型。

-- 将字符串类型的价格转为数值类型
alter table products
modify price decimal(10, 2) after cast(price as decimal(10, 2));

6. 合并 / 拆分字段

合并字段：使用 concat()。
拆分字段：使用 substring() 或 substring_index()。

-- 合并姓名（first_name 和 last_name）
update users
set full_name = concat(first_name, ' ', last_name);
-- 拆分地址（以逗号分隔）
alter table customers
add street varchar(100),
add city varchar(50);
update customers
set 
  street = substring_index(address, ',', 1),
  city = substring_index(address, ',', -1);

7. 异常值处理

通过统计方法（如 z-score）识别并处理异常值。

-- 计算平均价格和标准差
with stats as (
  select 
    avg(price) as avg_price,
    stddev(price) as std_price
  from products
)
-- 删除价格超过 3 个标准差的异常值
delete from products
where abs(price - (select avg_price from stats)) > 3 * (select std_price from stats);

执行建议

备份数据：清洗前先备份，避免误操作。
测试逻辑：先用 select 验证清洗逻辑，再执行 update 或 delete。
分批处理：大数据量时使用 limit 分批更新，避免锁表。

-- 示例：分批删除重复记录
delete from users
where user_id in (
  select user_id from (
    select user_id, 
           row_number() over (partition by email order by user_id) as rn
    from users
  ) t
  where rn > 1
)
limit 1000;  -- 每次处理 1000 条