今天在向 ClickHouse 测试表插入数据集时,报了以下错误:

ClickHouse.localdomain :) INSERT INTO hits_NoPrimaryKey SELECT
                             intHash32(c11::UInt64) AS UserID,
                             c15 AS URL,
                             c5 AS EventTime
                          FROM url('https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz')
                          WHERE URL != '';

INSERT INTO hits_NoPrimaryKey SELECT
    intHash32(CAST(c11, 'UInt64')) AS UserID,
    c15 AS URL,
    c5 AS EventTime
FROM url('https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz')
WHERE URL != ''

Query id: 40869efe-bbf8-41eb-82f9-4200cb72948a

→ Progress: 446.34 thousand rows, 843.07 MB (30.34 thousand rows/s., 57.31 MB/s.)                                    (0.0 CPU, 117.94 MB RAM)
0 rows in set. Elapsed: 14.711 sec. Processed 446.34 thousand rows, 843.07 MB (30.34 thousand rows/s., 57.31 MB/s.)

Received exception from server (version 22.9.1):
Code: 49. DB::Exception: Received from localhost:9000. DB::Exception: Failed to read all the data from the reader, missing 10485760 bytes: While executing ParallelParsingBlockInputFormat: While executing URL. (LOGICAL_ERROR)

主要关注错误信息:DB::Exception: Failed to read all the data from the reader, missing 10485760 bytes。预测可能是因为网络不稳定,数据下载波动导致的。因为多次执行发现有 timeout 字眼:

Received exception from server (version 22.9.1):
Code: 636. DB::Exception: Received from localhost:9000. DB::Exception: Cannot extract table structure from TSV format file:
Poco::Exception. Code: 1000, e.code() = 0, Timeout: connect timed out: 172.66.43.7:443 (version 22.9.1.2514 (official build))
You can specify the structure manually. (CANNOT_EXTRACT_TABLE_STRUCTURE)

所以遇到这网络的问题,可以将数据 wget 到本地,然后采用本地文件入库的方式导入,例如:

wget https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz
xz -d https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz
clickhouse-client --query "INSERT INTO hits_NoPrimaryKey FORMAT TSV" --max_insert_block_size=100000 --password < hits_v1.tsv

通过上面的 clickhouse 客户端导入,肯定会报错,因为从上面的 url 函数可以看出,其实它只选择了 3 列,所以我们还需要做一些改变才能在本地导入:
在这里插入图片描述
这里我们对数据进行筛选,我使用 cut 语句选择出这三列的数据:

cut -f5,11,15 < hits_v1.tsv > result.tsv # 先选择出这 3 列数据

重建一下表,修改一下字段顺序:

CREATE TABLE hits_NoPrimaryKey
(
    `EventTime` DateTime,
    `UserID` UInt32,
    `URL` String
)
ENGINE = MergeTree
PRIMARY KEY tuple();

使用客户端导入数据:

clickhouse-client --query "INSERT INTO hits_NoPrimaryKey FORMAT TSV" --max_insert_block_size=100000 --password < result.tsv

查询入库数据量:

ClickHouse.localdomain :) select count(*) from hits_NoPrimaryKey;

SELECT count(*)
FROM hits_NoPrimaryKey

Query id: 104a30dc-d281-463d-b4b2-3a9c9cc406a4

┌─count()─┐
│ 8873898 │
└─────────┘

1 row in set. Elapsed: 0.001 sec.

如果对您有用,麻烦动动小手【关注】、【点赞】、【收藏】,谢谢~

End~

Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐