NBA数据清洗与存储
进行读取,完成清洗。升序排列,存储到 MySQL 中。结果保留两位小数,并加上单位。结果保留两位小数,并加上单位。最终将清洗完成的数据根据字段。中所有的空值补充上字符串。或空值数据替换为:数字。1英寸=2.54厘米。或空值替换为:字符串。
编程要求
任务说明:对本地数据集/data/bigfiles/nba_data_all.csv进行读取,完成清洗。具体要求如下:
-
过滤重复数据。
-
对字段身高
player_ht进行计算,将英尺转为厘米单位,转换公式为:1英尺=30.48厘米,1英寸=2.54厘米,1英尺=12英寸,如:七尺一寸7-1->7 * 30.48 + 1 * 2.54 = 215.90cm;结果保留两位小数,并加上单位cm。 -
对字段体重
player_wt进行计算,将英镑转为公斤单位,转换公式为:1英镑 = 0.4536公斤,如:260lb->260 * 0.4536 = 117.93kg;结果保留两位小数,并加上单位kg。 -
将国籍
country中的括号[]或空值替换为:字符串"Unknown"。 -
将大学
college、高中high_school、球队名称last_team_name中所有的空值补充上字符串"Unknown"。 -
将字段同届排名
rank_year进行优化,如:2010 (71)->2010-71。 -
将字段
PTS | TRB | AST | FG | FG3 | FT| EFG | PER| WS中数据值为省略符号-或空值数据替换为:数字0.0。 -
将首秀时间
firstTime与退役时间lastTime中所有的空值替换为"9999-99-99"。
MySQL数据库连接信息
账号: root 密码:123123 地址:127.0.0.1 端口:3306
最终将清洗完成的数据根据字段id升序排列,存储到 MySQL 中 work 库下的 nba_data 表中(已创建),表结构信息如下:

import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
object First_Question {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder()
.appName("First_Question")
.master("local[*]")
.getOrCreate()
/******************* Begin *******************/
// 1.读取采集的数据创建临时视图表
spark.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header", value = true)
.load("file:///data/bigfiles/nba_data_all.csv")
.createOrReplaceTempView("data")
// 2.根据数据清洗要求,完成任务。
val dataFrame: DataFrame = spark.sql( """
select
distinct(id) id,info_url,player_name,player_pos,
case when player_ht like "%月%"
then concat(cast(regexp_extract(player_ht,"([0-9]{1})") * 30.48 + regexp_extract
(player_ht,"['月']([0-9]{1})") * 2.54 as decimal(10,2)),"cm")
when player_ht is not null
then concat(cast(split(player_ht,'-')[0] * 30.48 + split(player_ht,'-')[1] * 2.54 as decimal(10,2)),"cm")
else player_ht
end player_ht,
case when player_wt is not null
then concat(cast(regexp_extract(player_wt,"([0-9]{3})") * 0.4526 as decimal(10,2)),"kg")
else player_wt
end player_wt,
player_age,
case when country == "[]" or country is null then "Unknown" else country end country,
case when college is null then "Unknown" else college end college,
case when high_school is null then "Unknown" else high_school end high_school,
case when rank_year is not null
then concat_ws('-',regexp_extract(rank_year,"([0-9]{4})"),regexp_extract(rank_year,"[(]([0-9]+)"))
else rank_year
end rank_year,
draft,draft_date,work_year,team_count,
case when last_team_name is null then "Unknown" else last_team_name end last_team_name,
season,games_count,
case when PTS == "-" or PTS is null then 0.0 else PTS end PTS,
case when TRB == "-" or TRB is null then 0.0 else TRB end TRB,
case when AST == "-" or AST is null then 0.0 else AST end AST,
case when FG == "-" or FG is null then 0.0 else FG end FG,
case when FG3 == "-" or FG3 is null then 0.0 else FG3 end FG3,
case when FT == "-" or FT is null then 0.0 else FT end FT,
case when EFG == "-" or EFG is null then 0.0 else EFG end EFG,
case when PER == "-" or PER is null then 0.0 else PER end PER,
case when WS == "-" or WS is null then 0.0 else WS end WS,
case when firstTime is null then "9999-99-99" else firstTime end firstTime,
case when lastTime is null then "9999-99-99" else lastTime end lastTime
from data
""".stripMargin)
// 3.根据 id 升序排列,存储到 Mysql 指定库表中。
dataFrame.orderBy("id")
.repartition(1)
.write
.format("jdbc")
.option("driver", "com.mysql.jdbc.Driver")
.option("url","jdbc:mysql://127.0.0.1:3306/work")
.option("dbtable", "nba_data")
.option("user", "root")
.option("password", "123123")
.mode(SaveMode.Overwrite)
.save()
/******************* End *******************/
spark.stop()
}
}
//过滤重复数据
distinct(id) id
//对字段身高player_ht进行计算,将英尺转为厘米单位,转换公式为:1英尺=30.48厘米,1英寸=2.54厘米,1英尺=12英寸,如:七尺一寸 7-1 -> 7 * 30.48 + 1 * 2.54 = 215.90cm;结果保留两位小数,并加上单位cm
case when player_ht like "%月%"
then concat(cast(regexp_extract(player_ht,"([0-9]{1})") * 30.48 + regexp_extract
(player_ht,"['月']([0-9]{1})") * 2.54 as decimal(10,2)),"cm")
when player_ht is not null
then concat(cast(split(player_ht,'-')[0] * 30.48 + split(player_ht,'-')[1] * 2.54 as decimal(10,2)),"cm")
else player_ht
end player_ht,
//对字段体重player_wt进行计算,将英镑转为公斤单位,转换公式为:1英镑 = 0.4536公斤,如:260lb -> 260 * 0.4536 = 117.93kg;结果保留两位小数,并加上单位kg
case when player_wt is not null
then concat(cast(regexp_extract(player_wt,"([0-9]{3})") * 0.4526 as decimal(10,2)),"kg")
else player_wt
end player_wt
//将国籍country中的括号[]或空值替换为:字符串"Unknown"
case when country == "[]" or country is null then "Unknown" else country end country
//将大学college、高中high_school、球队名称last_team_name中所有的空值补充上字符串"Unknown"
case when college is null then "Unknown" else college end college,
case when high_school is null then "Unknown" else high_school end high_school,
case when last_team_name is null then "Unknown" else last_team_name end last_team_name
//将字段同届排名rank_year进行优化,如:2010 (71) -> 2010-71
case when rank_year is not null
then concat_ws('-',regexp_extract(rank_year,"([0-9]{4})"),regexp_extract(rank_year,"[(]([0-9]+)"))
else rank_year
end rank_year
//将字段 PTS | TRB | AST | FG | FG3 | FT| EFG | PER| WS中数据值为省略符号-或空值数据替换为:数字0.0
case when PTS == "-" or PTS is null then 0.0 else PTS end PTS,
case when TRB == "-" or TRB is null then 0.0 else TRB end TRB,
case when AST == "-" or AST is null then 0.0 else AST end AST,
case when FG == "-" or FG is null then 0.0 else FG end FG,
case when FG3 == "-" or FG3 is null then 0.0 else FG3 end FG3,
case when FT == "-" or FT is null then 0.0 else FT end FT,
case when EFG == "-" or EFG is null then 0.0 else EFG end EFG,
case when PER == "-" or PER is null then 0.0 else PER end PER,
case when WS == "-" or WS is null then 0.0 else WS end WS
//将首秀时间firstTime与退役时间lastTime中所有的空值替换为"9999-99-99"
case when firstTime is null then "9999-99-99" else firstTime end firstTime,
case when lastTime is null then "9999-99-99" else lastTime end lastTime
魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。
更多推荐

所有评论(0)