编程要求

任务说明:对本地数据集/data/bigfiles/nba_data_all.csv进行读取,完成清洗。具体要求如下:

  1. 过滤重复数据。

  2. 对字段身高player_ht进行计算,将英尺转为厘米单位,转换公式为:1英尺=30.48厘米1英寸=2.54厘米1英尺=12英寸,如:七尺一寸 7-1 -> 7 * 30.48 + 1 * 2.54 = 215.90cm;结果保留两位小数,并加上单位cm

  3. 对字段体重player_wt进行计算,将英镑转为公斤单位,转换公式为:1英镑 = 0.4536公斤,如:260lb -> 260 * 0.4536 = 117.93kg;结果保留两位小数,并加上单位kg

  4. 将国籍country中的括号[]或空值替换为:字符串"Unknown"

  5. 将大学college、高中high_school、球队名称last_team_name中所有的空值补充上字符串"Unknown"

  6. 将字段同届排名rank_year进行优化,如:2010 (71) -> 2010-71

  7. 将字段 PTS | TRB | AST | FG | FG3 | FT| EFG | PER| WS中数据值为省略符号-或空值数据替换为:数字0.0

  8. 将首秀时间firstTime与退役时间lastTime中所有的空值替换为"9999-99-99"

MySQL数据库连接信息

账号: root 密码:123123 地址:127.0.0.1 端口:3306

最终将清洗完成的数据根据字段id升序排列,存储到 MySQL 中 work 库下的 nba_data 表中(已创建),表结构信息如下:

import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}

object First_Question {
    def main(args: Array[String]): Unit = {

        val spark: SparkSession = SparkSession
                .builder()
                .appName("First_Question")
                .master("local[*]")
                .getOrCreate()
        /******************* Begin *******************/
        // 1.读取采集的数据创建临时视图表
         spark.read
                .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
                .option("header", value = true)
                .load("file:///data/bigfiles/nba_data_all.csv")
                .createOrReplaceTempView("data")

        // 2.根据数据清洗要求,完成任务。
        val dataFrame: DataFrame = spark.sql( """
                      select
                      distinct(id) id,info_url,player_name,player_pos,
                      case when player_ht like "%月%"
                            then concat(cast(regexp_extract(player_ht,"([0-9]{1})") * 30.48 + regexp_extract
                            (player_ht,"['月']([0-9]{1})") * 2.54 as decimal(10,2)),"cm")
                            when player_ht is not null
                            then concat(cast(split(player_ht,'-')[0] * 30.48 + split(player_ht,'-')[1] * 2.54 as decimal(10,2)),"cm")
                            else player_ht
                      end player_ht,
                      case when player_wt is not null
                           then concat(cast(regexp_extract(player_wt,"([0-9]{3})") * 0.4526 as decimal(10,2)),"kg")
                           else player_wt
                      end player_wt,
                      player_age,
                      case when country == "[]" or country is null then "Unknown" else country end country,
                      case when college is null then "Unknown" else college end college,
                      case when high_school is null then "Unknown" else high_school end high_school,
                      case when rank_year is not null
                           then concat_ws('-',regexp_extract(rank_year,"([0-9]{4})"),regexp_extract(rank_year,"[(]([0-9]+)"))
                           else rank_year
                      end rank_year,
                      draft,draft_date,work_year,team_count,
                      case when last_team_name is null then "Unknown" else last_team_name end last_team_name,
                      season,games_count,
                      case when PTS == "-" or PTS is null then 0.0 else PTS end PTS,
                      case when TRB == "-" or TRB is null then 0.0 else TRB end TRB,
                      case when AST == "-" or AST is null then 0.0 else AST end AST,
                      case when FG == "-" or FG is null then 0.0 else FG end FG,
                      case when FG3 == "-" or FG3 is null then 0.0 else FG3 end FG3,
                      case when FT == "-" or FT is null then 0.0 else FT end FT,
                      case when EFG == "-" or EFG is null then 0.0 else EFG end EFG,
                      case when PER == "-" or PER is null then 0.0 else PER end PER,
                      case when WS == "-" or WS is null then 0.0 else WS end WS,
                      case when firstTime is null then "9999-99-99" else firstTime end firstTime,
                      case when lastTime is null then "9999-99-99" else lastTime end lastTime
                      from data
                      
                      """.stripMargin)
        // 3.根据 id 升序排列,存储到 Mysql 指定库表中。
        dataFrame.orderBy("id")
                .repartition(1)
                .write
                .format("jdbc")
                .option("driver", "com.mysql.jdbc.Driver")
                .option("url","jdbc:mysql://127.0.0.1:3306/work")
                .option("dbtable", "nba_data")
                .option("user", "root")
                .option("password", "123123")
                .mode(SaveMode.Overwrite)
                .save()
        /******************* End *******************/
        spark.stop()
    }
} 
//过滤重复数据
distinct(id) id
//对字段身高player_ht进行计算,将英尺转为厘米单位,转换公式为:1英尺=30.48厘米,1英寸=2.54厘米,1英尺=12英寸,如:七尺一寸 7-1 -> 7 * 30.48 + 1 * 2.54 = 215.90cm;结果保留两位小数,并加上单位cm
case when player_ht like "%月%"
  then concat(cast(regexp_extract(player_ht,"([0-9]{1})") * 30.48 + regexp_extract
  (player_ht,"['月']([0-9]{1})") * 2.54 as decimal(10,2)),"cm")
  when player_ht is not null
  then concat(cast(split(player_ht,'-')[0] * 30.48 + split(player_ht,'-')[1] * 2.54 as decimal(10,2)),"cm")
  else player_ht
end player_ht,
//对字段体重player_wt进行计算,将英镑转为公斤单位,转换公式为:1英镑 = 0.4536公斤,如:260lb -> 260 * 0.4536 = 117.93kg;结果保留两位小数,并加上单位kg
case when player_wt is not null
  then concat(cast(regexp_extract(player_wt,"([0-9]{3})") * 0.4526 as decimal(10,2)),"kg")
  else player_wt
end player_wt
//将国籍country中的括号[]或空值替换为:字符串"Unknown"
case when country == "[]" or country is null then "Unknown" else country end country
//将大学college、高中high_school、球队名称last_team_name中所有的空值补充上字符串"Unknown"
case when college is null then "Unknown" else college end college,
case when high_school is null then "Unknown" else high_school end high_school,
case when last_team_name is null then "Unknown" else last_team_name end last_team_name
//将字段同届排名rank_year进行优化,如:2010 (71) -> 2010-71
case when rank_year is not null
  then concat_ws('-',regexp_extract(rank_year,"([0-9]{4})"),regexp_extract(rank_year,"[(]([0-9]+)"))
  else rank_year
end rank_year
//将字段 PTS | TRB | AST | FG | FG3 | FT| EFG | PER| WS中数据值为省略符号-或空值数据替换为:数字0.0
case when PTS == "-" or PTS is null then 0.0 else PTS end PTS,
case when TRB == "-" or TRB is null then 0.0 else TRB end TRB,
case when AST == "-" or AST is null then 0.0 else AST end AST,
case when FG == "-" or FG is null then 0.0 else FG end FG,
case when FG3 == "-" or FG3 is null then 0.0 else FG3 end FG3,
case when FT == "-" or FT is null then 0.0 else FT end FT,
case when EFG == "-" or EFG is null then 0.0 else EFG end EFG,
case when PER == "-" or PER is null then 0.0 else PER end PER,
case when WS == "-" or WS is null then 0.0 else WS end WS
//将首秀时间firstTime与退役时间lastTime中所有的空值替换为"9999-99-99"
case when firstTime is null then "9999-99-99" else firstTime end firstTime,
case when lastTime is null then "9999-99-99" else lastTime end lastTime

Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐