广西民族大学高级人工智能课程—头歌实践教学实践平台-规则分词法

gxmzuai

1782人浏览 · 2023-12-19 12:01:14

gxmzuai · 2023-12-19 12:01:14 发布

1、正向最大匹配法

代码文件

def cutA(sentence, dictA):
    # sentence：要分词的句子
    result = []
    sentenceLen = len(sentence)
    n = 0
    maxDictA = max([len(word) for word in dictA])
    
    while n < sentenceLen:
        matched = False
        for i in range(maxDictA, 0, -1):
            if n + i <= sentenceLen:
                word = sentence[n:n + i]
                if word in dictA:
                    result.append(word)
                    n += i
                    matched = True
                    break
        if not matched:
            result.append(sentence[n])
            n += 1
    
    print(result)  # 输出分词结果

# 测试用例
if __name__ == "__main__":
    dictA = set(["南京市", "长江", "大桥"])
    sentences = ["南京市", "南京市长", "长江大桥", "大桥", "南京市长江大桥"]
    
    for sentence in sentences:
        cutA(sentence, dictA)

题目描述

任务描述

本关任务：根据本关所学有关中文分词的基础知识，采用规则分词法，完成正向最大匹配算法程序的编写并通过所有测试用例。

编程要求

根据提示，在右侧编辑器中的 Begin-End 之间补充 Python 代码，实现正向最大匹配算法，基于所输入的词典，完成对 sentence 的分词并输出分词结果。其中词典的值和 sentence 均通过 input 从后台获取。

测试说明

平台将使用测试集运行你编写的程序代码，若全部的运行结果正确，则通关。测试输入：南京市南京市长长江大桥大桥南京市长江大桥

预期输出： ['南京市长', '江', '大桥']

开始你的任务吧，祝你成功！

2、逆向最大匹配法

代码文件

def cutB(sentence, dictB):
    result = []
    sentenceLen = len(sentence)
    maxDictB = max([len(word) for word in dictB])
    # 任务：完成逆向最大匹配算法的代码描述
    # ********** Begin *********#
    n = sentenceLen
    while n > 0:
        matched = False
        for i in range(maxDictB, 0, -1):
            if n - i < 0:  # 如果超出句子起始位置，跳过
                continue
            word = sentence[n - i:n]
            if word in dictB:
                result.append(word)
                n -= i
                matched = True
                break
        if not matched:  # 如果没有匹配到词，取一个字符作为词
            result.append(sentence[n - 1])
            n -= 1
    # ********** End **********#
    print(result[::-1], end="")

题目描述

任务描述

本关任务：根据本关所学有关中文分词的基础知识，采用规则分词法，完成逆向最大匹配算法程序的编写并通过所有测试用例。

编程要求

根据提示，在右侧编辑器中的 Begin-End 之间补充 Python 代码，实现逆向最大匹配算法，基于所输入的词典，完成对 sentence 的分词并输出分词结果。其中词典的值和 sentence 均通过 input 从后台获取。

测试说明

平台将使用测试集运行你编写的程序代码，若全部的运行结果正确，则通关。测试输入：南京市南京市长长江大桥大桥南京市长江大桥

预期输出： ['南京市', '长江大桥']

开始你的任务吧，祝你成功！

3、双向最大匹配法

代码文件

class BiMM():
    def __init__(self):
        self.window_size = 3  # 字典中最长词数

    def MMseg(self, text, dict): # 正向最大匹配算法
        result = []
        index = 0
        text_length = len(text)
        while text_length > index:
            for size in range(self.window_size + index, index, -1):
                piece = text[index:size]
                if piece in dict:
                    index = size - 1
                    break
            index += 1
            result.append(piece)
        return result

    def RMMseg(self, text, dict): # 逆向最大匹配算法
        result = []
        index = len(text)
        while index > 0:
            for size in range(index - self.window_size, index):
                piece = text[size:index]
                if piece in dict:
                    index = size + 1
                    break
            index = index - 1
            result.append(piece)
        result.reverse()
        return result

    def main(self, text, r1, r2):
        # 比较两种分词方法的结果
        if len(r1) != len(r2):
            result = r1 if len(r1) < len(r2) else r2
        else:
            # 计算每个结果中单字词的数量
            r1_single_chars = sum(len(word) == 1 for word in r1)
            r2_single_chars = sum(len(word) == 1 for word in r2)

            # 比较单字词数量，选择单字词较少的结果
            result = r1 if r1_single_chars <= r2_single_chars else r2
        
        # 打印结果而不是返回
        print(result)