Spark_SQL-DataFrame数据写出以及读写数据库（以MySQl为例）_dataframe mysql option

最近很多小伙伴找我要Linux学习资料，于是我翻箱倒柜，整理了一些优质资源，涵盖视频、电子书、PPT等共享给大家！

2401_83973995

477人浏览 · 2024-04-14 13:41:19

2401_83973995 · 2024-04-14 13:41:19 发布

    appName('write').\
    master('local[*]').\
    getOrCreate()

sc = spark.sparkContext

# 1.读取文件
schema = StructType().add('user_id', StringType(), nullable=True).\
    add('movie_id', IntegerType(), nullable=True).\
    add('rank', IntegerType(), nullable=True).\
    add('ts', StringType(), nullable=True)

df = spark.read.format('csv').\
    option('sep', '\t').\
    option('header', False).\
    option('encoding', 'utf-8').\
    schema(schema=schema).\
    load('../input/u.data')

# write text 写出，只能写出一个列的数据，需要将df转换为单列df
df.select(F.concat_ws('---', 'user_id', 'movie_id', 'rank', 'ts')).\
    write.\
    mode('overwrite').\
    format('text').\
    save('../output/sql/text')

# write csv
df.write.mode('overwrite').\
    format('csv').\
    option('sep',';').\
    option('header', True).\
    save('../output/sql/csv')

# write json
df.write.mode('overwrite').\
    format('json').\
    save('../output/sql/json')

# write parquet
df.write.mode('overwrite').\
    format('parquet').\
    save('../output/sql/parquet')


![](https://img-blog.csdnimg.cn/e04fbbf14cc0432f8dc7e0095afd2e21.png)


#### 二、写出MySQL数据库


        API写法：


![](https://img-blog.csdnimg.cn/4a3c81f6b0094cd99274f01c6a57f32c.png)


        **注意：**


        ①jdbc连接字符串中，建议使用useSSL=false 确保连接可以正常连接( 不使用SSL安全协议进行连接)


        ②jdbc连接字符串中，建议使用useUnicode=true 来确保传输中不出现乱码


        ③save()不要填参数，没有路径，是写出数据库


        ④dbtable属性：指定写出的表名

cording:utf8

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, IntegerType, StringType
import pyspark.sql.functions as F
if name == ‘main’:
spark = SparkSession.builder.
appName(‘write’).
master(‘local[*]’).
getOrCreate()

sc = spark.sparkContext

# 1.读取文件
schema = StructType().add('user_id', StringType(), nullable=True).\
    add('movie_id', IntegerType(), nullable=True).\
    add('rank', IntegerType(), nullable=True).\
    add('ts', StringType(), nullable=True)

df = spark.read.format('csv').\
    option('sep', '\t').\
    option('header', False).\
    option('encoding', 'utf-8').\
    schema(schema=schema).\
    load('../input/u.data')

# 2.写出df到MySQL数据库
df.write.mode('overwrite').\
    format('jdbc').\
    option('url', 'jdbc:mysql://pyspark01:3306/bigdata?useSSL=false&useUnicode=true&serverTimezone=GMT%2B8').\
    option('dbtable', 'movie_data').\
    option('user', 'root').\
    option('password', '123456').\
    save()

# 读取   
df2 = spark.read.format('jdbc'). \

自我介绍一下，小编13年上海交大毕业，曾经在小公司待过，也去过华为、OPPO等大厂，18年进入阿里一直到现在。

深知大多数Linux运维工程师，想要提升技能，往往是自己摸索成长或者是报班学习，但对于培训机构动则几千的学费，着实压力不小。自己不成体系的自学效果低效又漫长，而且极易碰到天花板技术停滞不前！

因此收集整理了一份《2024年Linux运维全套学习资料》，初衷也很简单，就是希望能够帮助到想自学提升又不知道该从何学起的朋友，同时减轻大家的负担。