【Python实战】Pandas数据清洗：从脏数据到完美数据集只需10步（完整代码）

weixin_42904190

628人浏览 · 2026-01-09 15:01:56

weixin_42904190 · 2026-01-09 15:01:56 发布

Python实战：Pandas数据清洗完整实现教程

摘要

本文将详细介绍如何使用Python的Pandas库进行数据清洗，包含完整代码示例和实际应用场景。数据清洗是数据分析过程中最重要的步骤之一，直接影响后续分析结果的准确性。

正文

1. 问题背景和需求分析

在数据分析项目中，我们经常遇到各种数据质量问题：

缺失值（NaN、None等）
重复数据
异常值和离群点
数据类型不一致
格式不规范

这些问题如果不及时处理，会导致分析结果偏差，甚至得出错误结论。

2. 技术方案设计

我们将使用Python的Pandas库来系统性地解决这些数据清洗问题：

使用dropna()处理缺失值
使用drop_duplicates()去除重复数据
使用统计方法识别异常值
使用astype()统一数据类型
使用字符串函数规范化格式

3. 详细实现步骤

环境准备

首先确保安装必要的库：


pip install pandas numpy matplotlib seaborn

核心代码实现


import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns



class DataCleaner:

    """数据清洗工具类"""



    def __init__(self, df):

        """初始化数据清洗器"""

        self.df = df.copy()

        self.original_shape = df.shape



    def get_data_overview(self):

        """获取数据概览"""

        print("=== 数据概览 ===")

        print(f"数据形状: {self.df.shape}")

        print(f"数据类型:")

        print(self.df.dtypes)

        print(f"\n缺失值统计:")

        print(self.df.isnull().sum())

        print(f"\n数据前5行:")

        print(self.df.head())



    def handle_missing_values(self, strategy='drop', columns=None):

        """处理缺失值



        Args:

            strategy: 处理策略 ('drop', 'fill_mean', 'fill_median', 'fill_mode', 'fill_forward')

            columns: 指定列名，None表示所有列

        """

        if columns is None:

            columns = self.df.columns



        missing_before = self.df[columns].isnull().sum().sum()



        if strategy == 'drop':

            self.df = self.df.dropna(subset=columns)

        elif strategy == 'fill_mean':

            for col in columns:

                if self.df[col].dtype in ['int64', 'float64']:

                    self.df[col] = self.df[col].fillna(self.df[col].mean())

        elif strategy == 'fill_median':

            for col in columns:

                if self.df[col].dtype in ['int64', 'float64']:

                    self.df[col] = self.df[col].fillna(self.df[col].median())

        elif strategy == 'fill_mode':

            for col in columns:

                mode_value = self.df[col].mode()

                if len(mode_value) > 0:

                    self.df[col] = self.df[col].fillna(mode_value[0])

        elif strategy == 'fill_forward':

            self.df[columns] = self.df[columns].fillna(method='ffill')



        missing_after = self.df[columns].isnull().sum().sum()

        print(f"缺失值处理: {missing_before} -> {missing_after}")



    def remove_duplicates(self, subset=None, keep='first'):

        """去除重复数据"""

        duplicates_before = self.df.duplicated().sum()

        self.df = self.df.drop_duplicates(subset=subset, keep=keep)

        duplicates_after = self.df.duplicated().sum()

        print(f"重复数据处理: {duplicates_before} -> {duplicates_after}")



    def detect_outliers(self, column, method='iqr', threshold=1.5):

        """检测异常值"""

        if method == 'iqr':

            Q1 = self.df[column].quantile(0.25)

            Q3 = self.df[column].quantile(0.75)

            IQR = Q3 - Q1

            lower_bound = Q1 - threshold * IQR

            upper_bound = Q3 + threshold * IQR



            outliers = self.df[(self.df[column] < lower_bound) |

                           (self.df[column] > upper_bound)]



        elif method == 'zscore':

            z_scores = np.abs((self.df[column] - self.df[column].mean()) / self.df[column].std())

            outliers = self.df[z_scores > 3]



        print(f"在列 '{column}' 中发现 {len(outliers)} 个异常值")

        return outliers



    def remove_outliers(self, column, method='iqr', threshold=1.5):

        """移除异常值"""

        outliers = self.detect_outliers(column, method, threshold)

        if len(outliers) > 0:

            self.df = self.df.drop(outliers.index)

            print(f"已移除 {len(outliers)} 个异常值")



    def standardize_text_columns(self, columns):

        """标准化文本列"""

        for col in columns:

            if self.df[col].dtype == 'object':

                # 去除前后空格并转为小写

                self.df[col] = self.df[col].str.strip().str.lower()

                # 替换多个空格为单个空格

                self.df[col] = self.df[col].str.replace(r'\s+', ' ', regex=True)

        print(f"已标准化文本列: {columns}")



    def convert_data_types(self, type_mapping):

        """转换数据类型



        Args:

            type_mapping: 字典格式，如 {'column_name': 'int64'}

        """

        for col, dtype in type_mapping.items():

            try:

                if dtype == 'datetime':

                    self.df[col] = pd.to_datetime(self.df[col])

                else:

                    self.df[col] = self.df[col].astype(dtype)

                print(f"列 '{col}' 已转换为 {dtype}")

            except Exception as e:

                print(f"列 '{col}' 转换失败: {e}")



    def get_cleaning_summary(self):

        """获取清洗总结"""

        print("\n=== 数据清洗总结 ===")

        print(f"原始数据形状: {self.original_shape}")

        print(f"清洗后数据形状: {self.df.shape}")

        print(f"数据减少量: {self.original_shape[0] - self.df.shape[0]} 行")

        print(f"数据保留率: {self.df.shape[0]/self.original_shape[0]:.2%}")

        return self.df

4. 功能测试和验证

运行上面的代码，您将看到：

数据概览: 原始数据的基本信息和质量问题
缺失值处理: 智能填充或删除缺失数据
重复数据去除: 自动识别和删除重复记录
异常值检测: 使用IQR和Z-score方法识别异常值
数据标准化: 统一文本格式和数据类型
可视化分析: 直观展示清洗后的数据分布

5. 常见问题和解决方案

问题1: 如何选择缺失值处理策略？

解决方案:

数据量充足时，删除缺失值
重要数据缺失少时，用均值/中位数填充
分类数据用众数填充
时间序列数据用前向填充

问题2: 如何避免过度清洗？

解决方案:

保留原始数据备份
记录每步清洗操作
定期验证清洗结果
与业务需求对齐

问题3: 如何处理大规模数据集？

解决方案:


# 分块处理大数据集

chunk_size = 10000

cleaned_chunks = []



for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):

    cleaner = DataCleaner(chunk)

    cleaner.handle_missing_values(strategy='fill_mean')

    cleaner.remove_duplicates()

    cleaned_chunks.append(cleaner.get_cleaning_summary())



final_df = pd.concat(cleaned_chunks, ignore_index=True)

6. 扩展应用和优化建议

高级清洗技术

机器学习填充: 使用回归模型预测缺失值
聚类检测异常值: 基于密度的异常检测
自然语言处理: 清洗文本数据的语义问题
自动化清洗: 构建清洗规则引擎

性能优化


# 使用向量化操作提高性能

df['column'] = df['column'].str.strip() # 比循环快100倍



# 使用apply函数批量处理

def clean_text(text):

    return text.strip().lower()



df['text_column'] = df['text_column'].apply(clean_text)

总结

通过本文的学习，你掌握了：

数据清洗的核心概念和重要性
Pandas库的强大功能用于数据预处理
完整的清洗流程从问题识别到解决方案
实用的代码技巧可以直接应用到实际项目
可视化方法验证清洗效果

数据清洗虽然繁琐，但是数据分析成功的基础。掌握这些技能将大大提高你的数据分析效率和结果质量。

互动引导

你在数据清洗过程中遇到过什么有趣的问题？有什么独特的解决方法？欢迎在评论区分享你的经验！

如果你觉得这篇文章对你有帮助，别忘了点赞、收藏和关注我，获取更多Python学习资源！

魔乐社区（Modelers.cn) 是一个中立、公益的人工智能社区，提供人工智能工具、模型、数据的托管、展示与应用协同服务，为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作，由全产业链共同建设、共同运营、共同享有，推动国产AI生态繁荣发展。

更多推荐

小参数・大码力・易部署 | Qwen3.6-27B上线魔乐社区，基于昇腾的部署教程来了

继一周前模型开源发布后，千问再度开源Qwen3.6-27B —— 一个拥有270亿参数的稠密多模态模型，也是社区呼声最高的模型规格。Qwen3.6-27B 依然支持多模态思考与非思考模式，在智能体编程方面达到了旗舰级表现，全面超越前代开源旗舰 Qwen3.5-397B-A17B（总参数397B / 激活参数17B的MoE模型）。作为稠密架构，它无需MoE路由即可部署，是开发者在实用、可广泛部署规模