熊猫数据集

Tips and Tricks for Data Science

数据科学技巧与窍门

Pandas is a powerful and easy-to-use software library written in the Python programming language, and is used for data manipulation and analysis.

Pandas是使用Python编程语言编写的功能强大且易于使用的软件库,可用于数据处理和分析。

Installing pandas: https://pypi.org/project/pandas/

安装熊猫: https : //pypi.org/project/pandas/

pip install pandas

pip install pandas

什么是Pandas DataFrame? (What is a Pandas DataFrame?)

A pandas DataFrame is a two dimensional data structure which stores data in a tabular form. Every row and column are labeled and can hold data of any type.

pandas DataFrame是二维数据结构,以表格形式存储数据。 每行和每列都有标签,可以保存任何类型的数据。

Here is an example:

这是一个例子:

Image for post
First 3 rows of the Titanic: Machine Learning from Disaster dataset
泰坦尼克号的前三行:灾难数据中的机器学习

1.创建一个熊猫DataFrame (1. Creating a pandas DataFrame)

The pandas.DataFrame constructor:

pandas.DataFrame构造函数:

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False

data This parameter serves as the input to make a DataFrame, which could be a NumPy ndarray, iterable, dict or another DataFrame. An ndarray is a multidimensional container of items of the same type and size. An iterable is any Python object capable of returning its members one at a time, permitting to be iterated over in a for-loop. Some examples for iterables are lists, tuples and sets. Dict here can refer to pandas Series, arrays, constants or list-like objects.

data此参数用作制作DataFrame的输入,该DataFrame可以是NumPy ndarray,可迭代,dict另一个DataFramendarray是具有相同类型和大小的项目的多维容器。 可迭代对象是能够一次返回其成员并允许在for循环中对其进行迭代的任何Python对象。 可迭代的一些示例是列表,元组和集合。 这里的Dict可以引用pandas系列,数组,常量或类似列表的对象。

indexThis parameter could have an Index or an array-like data type and serves as the index for the row labels in the resulting DataFrame. If no indexing information is provided, this parameter will default to RangeIndex.

index此参数可以具有Index或类似数组的数据类型,并用作结果DataFrame中行标签的索引。 如果没有提供索引信息,则此参数将默认为RangeIndex

columnsThis parameter could have an Index or an array-like data type and serves as the index for the column labels in the resulting DataFrame. If no indexing information is provided, this parameter will default to RangeIndex.

columns此参数可以具有Index或类似数组的数据类型,并用作结果DataFrame中列标签的索引。 如果没有提供索引信息,则此参数将默认为RangeIndex

dtypeEach column in the DataFrame can only have a single data type. This parameter is used to force a certain data type. By default, datatype is inferred from data.

DTYPE在数据帧的每一列只能有一种数据类型。 此参数用于强制某种数据类型。 默认情况下,从数据推断出数据类型。

copyWhen this parameter is set to True, and the input data is a DataFrame or a 2D ndarray, data is copied into the resulting DataFrame. By default, copy is set to False.

复制如果将此参数设置为True,并且输入数据是DataFrame或2D ndarray,则将数据复制到结果DataFrame中。 默认情况下,复制设置为False。

从Python字典创建Pandas DataFrame (Creating a Pandas DataFrame from a Python Dictionary)

import pandas as pd

import pandas as pd

d = {'Name' : ['John', 'Adam', 'Jane'], 'Age' : [25, 18, 30]}pd.DataFrame(d)

d = {'Name' : ['John', 'Adam', 'Jane'], 'Age' : [25, 18, 30]}pd.DataFrame(d)

Image for post

The index parameter can be used to change the default row index and the columns parameter can be used to change the order of the keys:

index参数可用于更改默认行索引, columns参数可用于更改键的顺序:

d = {'Name' : ['John', 'Adam', 'Jane'], 'Age' : [25, 18, 30]}pd.DataFrame(d, index=[10, 20, 30], columns=['First Name', 'Current Age'])

d = {'Name' : ['John', 'Adam', 'Jane'], 'Age' : [25, 18, 30]}pd.DataFrame(d, index=[10, 20, 30], columns=['First Name', 'Current Age'])

Image for post

从列表创建Pandas DataFrame: (Creating a Pandas DataFrame from a list:)

l = [['John', 25], ['Adam', 18], ['Jane', 30]]pd.DataFrame(l, columns=['Name', 'Age'])

l = [['John', 25], ['Adam', 18], ['Jane', 30]]pd.DataFrame(l, columns=['Name', 'Age'])

Image for post

从文件创建Pandas DataFrame (Creating a Pandas DataFrame from a File)

For any Data Science process, the dataset is commonly stored in files having formats like CSV (Comma Separated Values). Pandas allows storing data along with their labels from a CSV file using the method pandas.read_csv().

对于任何数据科学过程,数据集通常存储在具有CSV(逗号分隔值)之类的格式的文件中。 Pandas允许使用pandas.read_csv()方法将数据及其标签中的数据与CSV文件一起存储

Image for post
Example1.csv
Example1.csv
Image for post

2.从Pandas DataFrame中选择行和列 (2. Selecting Rows and Columns from a Pandas DataFrame)

从Pandas DataFrame中选择列 (Selecting Columns from a Pandas DataFrame)

Columns can be selected using their column names.

可以使用列名称选择列。

df[column_1, column_2])

df[ column_1 , column_2 ])

Image for post
Selecting column ‘Name’ from DataFrame df
从DataFrame df中选择“名称”列

从Pandas DataFrame中选择行 (Selecting Rows from a Pandas DataFrame)

Pandas provides 2 attributes for selecting rows from a DataFrame: loc and iloc

Pandas提供了2个用于从DataFrame中选择行的属性: lociloc

loc is label-based, which means that the row label has to be specified and iloc is integer-based which means that the integer index has to be specified.

loc是基于标签的,这意味着必须指定行标签,而iloc是基于整数的,这意味着必须指定整数索引。

Image for post
Using loc and iloc for selecting rows from DataFrame df
使用loc和iloc从DataFrame df中选择行

3.在Pandas DataFrame中插入行和列 (3. Inserting Rows and Columns to a Pandas DataFrame)

在Pandas DataFrame中插入行 (Inserting Rows to a Pandas DataFrame)

One method of inserting a row into a DataFrame is to create a pandas.Series() object and insert it at the end of the DataFrame using the pandas.DataFrame.append()method. The column indices of the DataFrame serve as the index attribute for the Series object.

将行插入DataFrame的一种方法是创建pandas.Series() 对象,然后使用pandas.DataFrame.append()方法将其插入DataFrame的pandas.DataFrame.append() 。 DataFrame的列索引用作Series对象的索引属性。

Image for post
Inserting new row to DataFrame df
将新行插入DataFrame df

将列插入Pandas DataFrame (Inserting Columns to a Pandas DataFrame)

One easy method of adding a column to a DataFrame is by just referring to the new column and assigning values.

将列添加到DataFrame的一种简单方法是仅引用新列并分配值。

Image for post
Inserting columns ID, Score and Country to DataFrame df
将列ID,分数和国家/地区插入DataFrame df

4.从Pandas DataFrame删除行和列 (4. Deleting Rows and Columns from a Pandas DataFrame)

从Pandas DataFrame删除行 (Deleting Rows from a Pandas DataFrame)

A row can be deleted using the method pandas.DataFrame.drop() with it’s row label.

可以使用带有行标签的pandas.DataFrame.drop()方法删除一行。

Image for post
Deleting row with label 1 from DataFrame df
从DataFrame df中删除带有标签1的行

To delete a row based on a column, the index of the row is obtained using the DataFrame.index attribute and then the row with the index is deleted using the pandas.DataFrame.drop() method.

要删除基于列的行,请使用DataFrame.index属性获取该行的索引,然后使用pandas.DataFrame.drop()方法删除具有索引的行。

Image for post
Deleting row with Name Kelly from DataFrame df
从DataFrame df中删除名称为Kelly的行

从Pandas DataFrame删除列 (Deleting Columns from a Pandas DataFrame)

A column can be deleted from a DataFrame based on its label as well as its position in the DataFrame using the method pandas.DataFrame.drop().

可以使用pandas.DataFrame.drop()方法根据列的标签及其在DataFrame中的位置从DataFrame中删除列

Image for post
Deleting column with label ‘Country’ from DataFrame df
从DataFrame df中删除带有标签“国家”的列
Image for post
Deleting column with position 2 from DataFrame df
从DataFrame df中删除位置2的列

The axis argument is set to 1 when dropping columns, and 0 when dropping rows.

删除列时, axis参数设置为1;删除行时, axis参数设置为0。

5.对Pandas DataFrame排序 (5. Sorting a Pandas DataFrame)

A Pandas DataFrame can be sorted using the pandas.DataFrame.sort_values() method. The by parameter for the method serves as the label of the column to sort by and ascending is set to True for sorting in ascending order and to False for sorting in descending order.

可以使用pandas.DataFrame.sort_values()方法对Pandas DataFrame进行排序。 该方法的by参数用作要按其进行排序的列的标签,并且升序设置为True(以升序排序),设置为False(以降序排序)。

Image for post
Sorting DataFrame df by Name in ascending order
按名称对DataFrame df进行升序排序
Image for post
Sorting DataFrame df by Age in descending order
按年龄降序对DataFrame df进行排序

翻译自: https://medium.com/ml-course-microsoft-udacity/5-fundamental-operations-on-a-pandas-dataframe-93b4384dff9d

熊猫数据集

Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐