数据可视化数据分析常用图 seaborn

数据分析阶段常用的统计图，验证数据分布，发现数据之间的关系，进行异常值检测。

weixin_37763484

637人浏览 · 2022-12-13 11:12:56

weixin_37763484 · 2022-12-13 11:12:56 发布

本文主要介绍几种数据分析阶段常用的统计图，可以用来验证数据分布，发现数据之间的关系，或进行异常值检测等。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt	
import seaborn as sns
from scipy import stats
import math

import warnings 
warnings.filterwarnings("ignore")

数据准备

首先准备两份数据，鸢尾花数据集

第一份是iris原始数据，第二份iris_z将所有特征取整

然后观察数据基本信息，如缺失值，平均值等

1.构造数据

from sklearn.datasets import load_iris

data=load_iris().data
target=load_iris().target

data_f=pd.DataFrame(data)
target_f=pd.DataFrame(target)
iris=pd.concat([data_f,target_f],axis=1)

data_f.columns=["w1","w2","l1","l2"]
for column in data_f.columns:
    data_f[column]=data_f[column]=data_f[column].apply(lambda x :math.floor(x))

iris_z=pd.concat([data_f,target_f],axis=1)

iris.columns=["w1","w2","l1","l2","target"]
iris.head()

	w1	w2	l1	l2
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

iris_z.columns=["w1","w2","l1","l2","target"]
iris_z.head()

	w1	w2	l1
0	5	3	1
1	4	3	1
2	4	3	1
3	4	3	1
4	5	3	1

2.观察基本信息

iris.describe()

	w1	w2	l1	l2	target
count	150.000000	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667	1.000000
std	0.828066	0.433594	1.764420	0.763161	0.819232
min	4.300000	2.000000	1.000000	0.100000	0.000000
25%	5.100000	2.800000	1.600000	0.300000	0.000000
50%	5.800000	3.000000	4.350000	1.300000	1.000000
75%	6.400000	3.300000	5.100000	1.800000	2.000000
max	7.900000	4.400000	6.900000	2.500000	2.000000

iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
w1        150 non-null float64
w2        150 non-null float64
l1        150 non-null float64
l2        150 non-null float64
target    150 non-null int32
dtypes: float64(4), int32(1)
memory usage: 5.4 KB

绘图

1.countplot

对w1特征进行分析，看其取值不同时，对应的target分别有哪些

sns.countplot("w1",hue="target",data=iris_z)

在这里插入图片描述

2.透视图

看w1和w2的不同取值组合，分别对应的target的平均值

k=iris_z.groupby(["w1","w2"])["target"].mean()
k

w1  w2
4   2     0.750000
    3     0.000000
5   2     1.208333
    3     0.242424
    4     0.000000
6   2     1.440000
    3     1.689655
7   2     2.000000
    3     1.888889
Name: target, dtype: float64

3. 数据分布图

对两个数据的"w1"特征进行分析，蓝色线条是数据的分布，黑色是期待的分布，

可以看到将数据离散化后得到iris_z,与正态分布差异较大

sns.distplot(iris["w1"],fit=stats.norm)
plt.show()
sns.distplot(iris_z["w1"],fit=stats.norm)
plt.show()

在这里插入图片描述

4.qq图

qq图，蓝色点越接近红线，就越符合正态分布

对上面的iris[“w1”]进行分析，发现其非常符合正态分布

而iris_z[“w1”]则不符合正态分布

x=stats.probplot(iris["w1"],plot=plt)
plt.show()
x=stats.probplot(iris_z["w1"],plot=plt)

在这里插入图片描述

5.分布比较图

把数据分成训练集和测试集，看特征w1在两个数据及上分布是否一致

如果不一致，说明这个特征不应该被使用，应该被删除

下图中，差异比较小，说明w1可以使用

iris_copy=iris.copy()
iris_shuffle=iris_copy.sample(frac=1)
x_train=iris_shuffle.iloc[0:120]
x_test=iris_shuffle.iloc[120:]
ax=sns.kdeplot(x_train["w1"],color="red",shade=True)
ax=sns.kdeplot(x_test["w1"],color="blue",shade=True)

在这里插入图片描述

6.相关性热力图

用于比较特征之间的相关度

绝对值越大，相关性越强，即“-1”比“0”更相关

下图中，可以发现target与l1、l2相关度较高

plt.figure(figsize=[5,5])
sns.heatmap(iris[["l1","l2","w1","w2","target"]].corr(),annot=True)

在这里插入图片描述

7.箱型图

观察上一步中，w2的分散情况，可以用来发现异常值,如果需要还可以删除

fig = plt.figure(figsize=(6,4))
w2=pd.DataFrame(iris["w2"]) 
box=w2.boxplot(
    return_type="both",
            notch=True, # 是否用盒子形状
            sym='r*',    # 用红色矩形展示异常值
            showmeans=True,#展示均值点
            patch_artist=False,#是否要填充色
            meanline=True,#展示均值线
            widths=0.5,#设置箱盒宽度
            vert=True)   #垂直展示图形
 
t = plt.title('Box plot')
# 原文链接：https://blog.csdn.net/opp003/article/details/84959020

在这里插入图片描述

异常值的上下界，可以用

box.lines[“whiskers”][0] 和 box.lines[“whiskers”][0] 来获得

print("line1:",box.lines["whiskers"][0].get_ydata())
print("line2:",box.lines["whiskers"][1].get_ydata())

low_value=box.lines["whiskers"][0].get_ydata()[1]
high_value=box.lines["whiskers"][1].get_ydata()[1]
print("low_value:",low_value)
print("high_value:",high_value)

line1: [2.8 2.2]
line2: [3.3 4. ]
low_value: 2.2
high_value: 4.0

魔乐社区

魔乐社区（Modelers.cn) 是一个中立、公益的人工智能社区，提供人工智能工具、模型、数据的托管、展示与应用协同服务，为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作，由全产业链共同建设、共同运营、共同享有，推动国产AI生态繁荣发展。

更多推荐

大数据毕业设计选题推荐-基于大数据的农作物产量数据分析与可视化系统-Hadoop-Spark-数据可视化-BigData

魔乐社区

大模型推理适配实战：手把手带你完成vLLM Ascend迁移实操

魔乐社区

基于python大数据的汽车数据分析系统设计与实现

魔乐社区

所有评论(0)

查看更多评论

weixin_37763484

@weixin_37763484

已为社区贡献5条内容

	w1	w2	l1	l2
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	w1	w2	l1	l2
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

数据可视化 数据分析 常用图 seaborn

weixin_37763484

数据准备

1.构造数据

2.观察基本信息

绘图

1.countplot

2.透视图

3. 数据分布图

4.qq图

5.分布比较图

6.相关性热力图

7.箱型图

所有评论(0)

weixin_37763484

数据可视化数据分析常用图 seaborn

	w1	w2	l1	l2
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2