【NLP】<a target="_blank" href="https://www.huoban.com/news/tags-8270.html"style="font-weight:bold;">NLTK工具集</a><a target="_blank" href="https://www.huoban.com/news/tags-50.html"style="font-weight:bold;">使用</a>-伙伴云

【NLP】NLTK工具集使用

网友投稿 965 2025-04-01

学习总结

文章目录

学习总结

【NLP】NLTK工具集使用

一、Natural Language Toolkit

二、常用语料库和词典

三、常用NLP工具集

3.1 分句

3.2 标记解析

3.3 词性标注

Reference

一、Natural Language Toolkit

NLTK提供了多种语料库（Corpora）和词典（Lexicon）资源，如WordNet等，以及常用工具集，如分句、标记解析（Tokenization）、词干提取（Stemming）、词性标注（POS Taggin）和句法分析（Syntactic Parsing）等，用于英文文本数据处理。

关于nltk的下载还是很多坑的，如果直接import nltk和nltk.download()下载失败，可参考：

（1）nltk安装失败：由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。

（2）直接下载github的nltk：https://github.com/nltk/nltk_data。我一开始就是一直报错For more information see: https://www.nltk.org/data.html. Attempted to load tokenizers/punkt/english.pickle，然而nltk_data确实已经解压了还放在正确的路径中了还不行，尝试了几个办法后报错OSError: No such file or directory: 'D:\\anaconda1\\envs\\tensorflow\\lib\\nltk_data\\tokenizers\\punkt\\PY3\\english.pickle'发现木有PY3文件，加了个PY3文件夹后还是不行，最后直接去github上重新下载一个nltk的punkt包直接解压就行了。。。

（3）如果还是不行，就绝对路径吧sent_detector = nltk.data.load('D:\local\Anaconda3\Lib\site-packages//nltk-data//tokenizers/punkt/english.pickle')，狗头滑稽。

注意：

nltk包放在的位置，可以通过如下代码查看：

import nltk nltk.data.path

二、常用语料库和词典

常用语料库（文本数据集），如图书、电影评论和聊天记录等，分为未标注语料库和人工标注语料库。

NLP任务中可以将一些停用词（如冠词a、the，介词of、to等）删除，提升计算速度，它们含义也不太重要。英文的常用停用词：

from nltk.corpus import stopwords print(stopwords.words('english')) ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

三、常用NLP工具集

3.1 分句

分句：将较长的文档切分为若干句子。

一个句子结尾一般有明显标志（如句号、问好、感叹号等）。

也有特殊情况，在英文中，句号不仅作为句尾标志，还可以作为单词的一部分，如Mr.

# 分句 from nltk.corpus import gutenberg from nltk.tokenize import sent_tokenize text = gutenberg.raw("austen-emma.txt") sentences = sent_tokenize(text) # 对Emma小说全文分句 print(sentences[100]) # 显示其中一个句子

其中一句的分句的结果为：

Mr. Knightley loves to find fault with me, you know-- in a joke--it is all a joke.

也可以自己写的句子试试，然后进行分句：

from nltk.tokenize import sent_tokenize mytext = "Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude." print(sent_tokenize(mytext))

分句的结果为：

['Hello Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']

3.2 标记解析

NLP最基本的输入单元：标记Token，它可以是一个词或标点符号。

任务如，将句子结尾标点符号和前面的单词进行拆分。

可以使用nltk.tokenize.word_tokenize。

这里接着上面的一个句子sentences[100]进行标记解析：

# 标记解析 from nltk.tokenize import word_tokenize print(word_tokenize(sentences[100]))

得到的该句子的每个token标记：

['Mr.', 'Knightley', 'loves', 'to', 'find', 'fault', 'with', 'me', ',', 'you', 'know', '--', 'in', 'a', 'joke', '--', 'it', 'is', 'all', 'a', 'joke', '.']

3.3 词性标注

根据词语上下文，确定具体词性。

如They sat by the fire和They fire a gun的fire意思不同，前者是名词，后者是动词。

# 词性标记 from nltk import pos_tag # 对句子标记解析后再进行词性标注 In [3]:pos_tag(word_tokenize("They sat by the fire.")) Out[3]: [('They', 'PRP'), ('sat', 'VBP'), ('by', 'IN'), ('the', 'DT'), ('fire', 'NN'), ('.', '.')] In [4]:pos_tag(word_tokenize("They fire a gun.")) Out[4]: [('They', 'PRP'), ('fire', 'VBP'), ('a', 'DT'), ('gun', 'NN'), ('.', '.')]

从上面词性标注的结果看出，前者句子的fire被标注为名词（NN），后者被标注为动词（VBP），如果不知道词性单词的含义，可以help查询：

nltk.help.upenn_tagset('NN')

Reference

（1）NLTK官网：https://www.nltk.org/

（2）https://github.com/nltk/nltk_data

自然语言处理基础

九江庐山市星辰翰林高三复读班2025招生简章

965 2025-04-01

【NLP】NLTK工具集 使用

江科附中2025高复招生：分层锻造，特级名师领航一本线突围

excel表格怎么做好看颜色搭配

九江庐山市星辰翰林高三复读班2025招生简章

推荐文章

企业生产管理是什么，企业生产管理软件

进盘点进销存软件排行榜前十名

进销存系统哪个简单好用？进销存系统优点

工厂生产管理（工厂生产管理流程及制度）

生产管理软件，机械制造业生产管理，制造业生产过程管理软件

进销存软件和ERP有什么区别？进销存与erp软件理解

进销存如何进行库存管理

如何利用excel制作销售订单管理系统？

数据库订单管理系统有哪些功能？数据库订单管理系统怎么设计？

什么是数据库管理系统？

最近发表

热评文章

零代码开发是什么？2022低代码平台排行榜">零代码开发是什么？2022低代码平台排行榜

进销存库存管理 系统（智慧进销存）">智能进销存库存管理系统（智慧进销存）

在线文档哪家强？8款在线文档编辑软件推荐">在线文档哪家强？8款在线文档编辑软件推荐

WPS2016怎么绘制简单的价格表?

电子表格，居家办公更轻松">用在线电子表格，居家办公更轻松

定制家居数字化管理模式：提升品质、智能化和个性化的未

友情链接

【NLP】NLTK工具集使用

微信扫一扫：分享

推荐文章

最近发表

热评文章

零代码开发是什么？2022低代码平台排行榜">零代码开发是什么？2022低代码平台排行榜

进销存库存管理系统（智慧进销存）">智能进销存库存管理系统（智慧进销存）

在线文档哪家强？8款在线文档编辑软件推荐">在线文档哪家强？8款在线文档编辑软件推荐

电子表格，居家办公更轻松">用在线电子表格，居家办公更轻松

友情链接