NLTK自然语言处理实战：2.1 分词与句子分割

本文介绍了NLTK中分词与句子分割的基本概念和应用。分词是将文本分解为词汇单位的过程，NLTK提供word_tokenize、TreebankWordTokenizer等分词器，支持不同场景需求。句子分割则是将文本分割成独立句子，NLTK的sent_tokenize能处理缩写词等特殊情况。文章包含多个代码示例，展示如何使用NLTK工具进行英文文本的分词和句子分割，涵盖基础分词器、正则表达式分词器和

火马编程

877人浏览 · 2025-12-25 08:40:31

火马编程 · 2025-12-25 08:40:31 发布

发布日期：2025-12-25
专栏名称：NLTK自然语言处理实战
适用人群：初学者
前置知识：Python基础、NLTK基础

1. 引言

1.1 什么是分词与句子分割

分词（Tokenization）是将连续的文本序列分割成有意义的词汇单位（称为tokens）的过程。句子分割（Sentence Segmentation）则是将文本分割成独立句子的过程。这两个任务是自然语言处理（NLP）中的基础预处理步骤，几乎所有后续的NLP任务都依赖于它们的结果。

1.2 为什么要学习分词与句子分割

基础预处理：分词和句子分割是NLP pipeline中的第一步，后续任务如词性标注、命名实体识别、情感分析等都需要基于分词结果
语言理解：正确的分词有助于计算机理解文本的结构和含义
跨语言差异：不同语言的分词规则不同，掌握分词技术有助于处理多语言文本
实际应用：在搜索引擎、机器翻译、文本摘要等应用中，分词和句子分割都起着关键作用

1.3 本章学习目标

理解分词和句子分割的基本概念和原理
掌握NLTK中常用的分词器和句子分割器
能够使用NLTK进行英文和中文文本的分词与句子分割
了解不同分词算法的优缺点和适用场景

2. 核心知识点

2.1 分词的基本概念

分词是将文本分解为词汇单位的过程，这些词汇单位可以是单词、标点符号、数字等。分词的质量直接影响后续NLP任务的效果。

分词的挑战：

歧义处理：某些文本片段可以有多种分词方式，如"下雨天留客天留我不留"可以有多种断句方式
新词识别：不断出现的新词（如网络流行语）给分词带来挑战
多语言差异：不同语言的分词规则差异很大，如中文没有明显的词边界
特殊格式处理：如URL、电子邮件地址、表情符号等特殊格式的正确分割

2.2 NLTK中的分词器

NLTK提供了多种分词器，适用于不同的场景和需求。

2.2.1 word_tokenize

word_tokenize是NLTK中最常用的分词器，它能够处理英文文本的分词，包括单词、标点符号等。

特点：

基于TreebankWordTokenizer实现
能够正确处理标点符号和特殊字符
适用于大多数英文文本分词场景

2.2.2 TreebankWordTokenizer

TreebankWordTokenizer是NLTK中最基础的分词器，它遵循Penn Treebank的分词规则。

特点：

按照Penn Treebank语料库的标准进行分词
对于缩写词、连字符等有特殊处理
是其他高级分词器的基础

2.2.3 RegexpTokenizer

RegexpTokenizer允许用户自定义正则表达式来进行分词，具有很高的灵活性。

特点：

支持自定义正则表达式
可以根据特定需求定制分词规则
适用于特殊格式的文本处理

2.2.4 MWETokenizer

MWETokenizer（Multi-Word Expression Tokenizer）用于处理多词表达式，将多个单词合并为一个token。

特点：

支持将多个连续单词合并为一个token
适用于处理命名实体、固定搭配等
可以与其他分词器结合使用

2.3 句子分割的基本概念

句子分割是将文本分割成独立句子的过程，通常基于标点符号（如句号、问号、感叹号）进行分割。

句子分割的挑战：

句号的歧义：句号不仅用于句子结束，还用于缩写词（如Mr.、Dr.）、小数点等
换行符的处理：文本中的换行符不一定表示句子结束
特殊格式：如列表、引用等特殊格式的句子分割

2.4 NLTK中的句子分割器

NLTK提供了多种句子分割器，适用于不同的场景。

2.4.1 sent_tokenize

sent_tokenize是NLTK中最常用的句子分割器，它能够处理英文文本的句子分割。

特点：

基于PunktSentenceTokenizer实现
能够正确处理缩写词、小数点等特殊情况
适用于大多数英文文本句子分割场景

2.4.2 PunktSentenceTokenizer

PunktSentenceTokenizer是NLTK中基于无监督学习的句子分割器，它能够学习文本中的句子边界模式。

特点：

基于无监督学习算法
能够自动识别句子边界
支持多种语言
可以训练自定义模型

3. 代码示例

3.1 使用word_tokenize进行分词

功能说明：使用NLTK的word_tokenize进行英文文本分词

代码实现：

import nltk
from nltk.tokenize import word_tokenize

# 下载必要的资源
nltk.download('punkt')

# 示例文本
text = "Hello, NLTK! This is a sample text. It contains multiple sentences."

# 分词
words = word_tokenize(text)

print("原文本:")
print(text)
print("\n分词结果:")
print(words)
print(f"\n分词数量: {len(words)}")

代码解释：

导入word_tokenize模块
下载punkt资源，用于分词
定义示例文本
使用word_tokenize进行分词
打印原文本和分词结果

运行结果：

原文本:
Hello, NLTK! This is a sample text. It contains multiple sentences.

分词结果:
['Hello', ',', 'NLTK', '!', 'This', 'is', 'a', 'sample', 'text', '.', 'It', 'contains', 'multiple', 'sentences', '.']

分词数量: 15

3.2 使用TreebankWordTokenizer进行分词

功能说明：使用NLTK的TreebankWordTokenizer进行分词，了解其工作原理

代码实现：

from nltk.tokenize import TreebankWordTokenizer

# 创建分词器实例
tokenizer = TreebankWordTokenizer()

# 示例文本
text = "Don't forget to buy milk, eggs, and bread."

# 分词
words = tokenizer.tokenize(text)

print("原文本:")
print(text)
print("\n分词结果:")
print(words)

代码解释：

导入TreebankWordTokenizer
创建分词器实例
定义示例文本，包含缩写词和列表
使用tokenize方法进行分词
打印原文本和分词结果

运行结果：

原文本:
Don't forget to buy milk, eggs, and bread.

分词结果:
["Don't", 'forget', 'to', 'buy', 'milk', ',', 'eggs', ',', 'and', 'bread', '.']

3.3 使用RegexpTokenizer进行自定义分词

功能说明：使用NLTK的RegexpTokenizer根据自定义正则表达式进行分词

代码实现：

from nltk.tokenize import RegexpTokenizer

# 示例1：只保留字母和数字
tokenizer1 = RegexpTokenizer(r'\w+')
text = "Hello, NLTK! This is v1.0 of the tutorial."
words1 = tokenizer1.tokenize(text)
print("示例1 - 只保留字母和数字:")
print(words1)

# 示例2：只保留单词（不包含数字）
tokenizer2 = RegexpTokenizer(r'[a-zA-Z]+')
words2 = tokenizer2.tokenize(text)
print("\n示例2 - 只保留单词:")
print(words2)

# 示例3：按空格分割
tokenizer3 = RegexpTokenizer(r'\s+', gaps=True)
words3 = tokenizer3.tokenize(text)
print("\n示例3 - 按空格分割:")
print(words3)

代码解释：

导入RegexpTokenizer
创建三个不同的分词器：
- tokenizer1：只保留字母和数字
- tokenizer2：只保留单词（不包含数字）
- tokenizer3：按空格分割（gaps=True表示正则表达式匹配分隔符）
对同一文本进行分词
打印不同分词器的结果

运行结果：

示例1 - 只保留字母和数字:
['Hello', 'NLTK', 'This', 'is', 'v1', '0', 'of', 'the', 'tutorial']

示例2 - 只保留单词:
['Hello', 'NLTK', 'This', 'is', 'v', 'of', 'the', 'tutorial']

示例3 - 按空格分割:
['Hello,', 'NLTK!', 'This', 'is', 'v1.0', 'of', 'the', 'tutorial.']

3.4 使用sent_tokenize进行句子分割

功能说明：使用NLTK的sent_tokenize进行英文文本句子分割

代码实现：

from nltk.tokenize import sent_tokenize

# 示例文本
text = "Hello, NLTK! This is a sample text. It contains multiple sentences. Mr. Smith is a doctor. He works at St. Mary's Hospital."

# 句子分割
sentences = sent_tokenize(text)

print("原文本:")
print(text)
print("\n句子分割结果:")
for i, sentence in enumerate(sentences, 1):
    print(f"句子 {i}: {sentence}")
print(f"\n句子数量: {len(sentences)}")

代码解释：

导入sent_tokenize
定义示例文本，包含缩写词（Mr.、St.）
使用sent_tokenize进行句子分割
打印原文本和句子分割结果

运行结果：

原文本:
Hello, NLTK! This is a sample text. It contains multiple sentences. Mr. Smith is a doctor. He works at St. Mary's Hospital.

句子分割结果:
句子 1: Hello, NLTK!
句子 2: This is a sample text.
句子 3: It contains multiple sentences.
句子 4: Mr. Smith is a doctor.
句子 5: He works at St. Mary's Hospital.

句子数量: 5

4. 实战案例

4.1 案例介绍

案例名称：新闻文本的分词与句子分割
案例描述：使用NLTK对一篇新闻文本进行分词和句子分割，分析文本的基本结构
预期效果：

将新闻文本分割为句子
对每个句子进行分词
统计文本的句子数量和词汇数量
分析文本的基本结构

4.2 案例分析

核心问题：如何使用NLTK对实际新闻文本进行有效的分词和句子分割
解决思路：

获取或准备新闻文本
使用sent_tokenize进行句子分割
对每个句子使用word_tokenize进行分词
统计和分析分词结果
可视化展示文本结构

所需工具：

NLTK库
sent_tokenize和word_tokenize
Python基本数据结构和统计功能

4.3 实现步骤

步骤1：准备新闻文本

# 示例新闻文本
news_text = """
Apple Inc. announced on Tuesday that it will release its latest iPhone model next month. The new device, which features a faster processor and improved camera, is expected to boost sales in the upcoming quarter.

According to industry analysts, the iPhone remains Apple's most profitable product, accounting for more than 50% of the company's revenue. However, competition from smartphone makers like Samsung and Huawei continues to intensify.

In related news, Apple also reported strong quarterly earnings, with revenue increasing by 10% compared to the same period last year. CEO Tim Cook attributed the growth to strong demand for the company's services segment, which includes Apple Music and iCloud.
"""

print("新闻文本内容:")
print(news_text)

步骤2：句子分割

from nltk.tokenize import sent_tokenize, word_tokenize

# 句子分割
sentences = sent_tokenize(news_text)

print(f"\n句子分割结果:")
for i, sentence in enumerate(sentences, 1):
    print(f"句子 {i}: {sentence}")
print(f"\n总句子数量: {len(sentences)}")

步骤3：分词处理

# 对每个句子进行分词
all_tokens = []
sentence_lengths = []

for sentence in sentences:
    tokens = word_tokenize(sentence)
    all_tokens.extend(tokens)
    sentence_lengths.append(len(tokens))

# 统计词汇数量
unique_tokens = set(all_tokens)

print(f"\n总词汇数量: {len(all_tokens)}")
print(f"不同词汇数量: {len(unique_tokens)}")
print(f"平均每句词汇数: {sum(sentence_lengths) / len(sentence_lengths):.2f}")

步骤4：文本结构分析

# 分析句子长度分布
print(f"\n句子长度分布:")
print(f"最短句子长度: {min(sentence_lengths)} 词")
print(f"最长句子长度: {max(sentence_lengths)} 词")
print(f"句子长度列表: {sentence_lengths}")

# 查看前20个词汇
print(f"\n前20个词汇: {all_tokens[:20]}")

4.4 运行结果与分析

运行结果：

新闻文本内容:

Apple Inc. announced on Tuesday that it will release its latest iPhone model next month. The new device, which features a faster processor and improved camera, is expected to boost sales in the upcoming quarter.

According to industry analysts, the iPhone remains Apple's most profitable product, accounting for more than 50% of the company's revenue. However, competition from smartphone makers like Samsung and Huawei continues to intensify.

In related news, Apple also reported strong quarterly earnings, with revenue increasing by 10% compared to the same period last year. CEO Tim Cook attributed the growth to strong demand for the company's services segment, which includes Apple Music and iCloud.

句子分割结果:
句子 1: Apple Inc. announced on Tuesday that it will release its latest iPhone model next month.
句子 2: The new device, which features a faster processor and improved camera, is expected to boost sales in the upcoming quarter.
句子 3: According to industry analysts, the iPhone remains Apple's most profitable product, accounting for more than 50% of the company's revenue.
句子 4: However, competition from smartphone makers like Samsung and Huawei continues to intensify.
句子 5: In related news, Apple also reported strong quarterly earnings, with revenue increasing by 10% compared to the same period last year.
句子 6: CEO Tim Cook attributed the growth to strong demand for the company's services segment, which includes Apple Music and iCloud.

总句子数量: 6

总词汇数量: 117
不同词汇数量: 78
平均每句词汇数: 19.50

句子长度分布:
最短句子长度: 13 词
最长句子长度: 27 词
句子长度列表: [16, 21, 27, 13, 21, 19]

前20个词汇: ['Apple', 'Inc.', 'announced', 'on', 'Tuesday', 'that', 'it', 'will', 'release', 'its', 'latest', 'iPhone', 'model', 'next', 'month', '.', 'The', 'new', 'device', ',']

结果分析：

新闻文本被成功分割为6个句子
总词汇数量为117个，不同词汇数量为78个，词汇复用率约为66.7%
平均每句词汇数为19.5个，句子长度分布较为均匀
最短句子有13个词，最长句子有27个词
分词结果包含了专有名词（如Apple Inc.、iPhone、Samsung、Huawei）、缩写词（如Inc.）、数字（如50%、10%）等

4.5 代码优化与扩展

优化建议：

可以过滤掉标点符号和停用词，得到更有意义的词汇统计
可以使用更高级的分词器，如TweetTokenizer，处理社交媒体文本
可以计算词汇的频率分布，分析文本的关键词

扩展方向：

尝试使用NLTK进行中文文本的分词（需要额外的中文分词库支持）
比较不同分词器的性能差异
训练自定义的句子分割模型
实现一个简单的文本摘要系统，基于句子分割结果

5. 小结与思考

5.1 本章小结

分词：将连续文本分割成词汇单位的过程，是NLP的基础预处理步骤
句子分割：将文本分割成独立句子的过程，需要处理缩写词、小数点等特殊情况
NLTK分词器：
- word_tokenize：常用的英文分词器，基于TreebankWordTokenizer
- TreebankWordTokenizer：遵循Penn Treebank规则的基础分词器
- RegexpTokenizer：支持自定义正则表达式的分词器
- MWETokenizer：用于处理多词表达式的分词器
NLTK句子分割器：
- sent_tokenize：常用的英文句子分割器，基于PunktSentenceTokenizer
- PunktSentenceTokenizer：基于无监督学习的句子分割器，支持多种语言