python <a target="_blank" href="https://www.huoban.com/news/tags-155.html"style="font-weight:bold;">数据</a><a target="_blank" href="https://www.huoban.com/news/tags-2734.html"style="font-weight:bold;">挖掘</a>-伙伴云

python 数据挖掘

网友投稿 755 2025-03-31

Python 数据挖掘

数据挖掘概况

数据挖掘定义

数据挖掘是指从大量的数据中，通过统计学、人工智能、机器学习等方法挖掘出未知的且具有价值的信息和知识的过程。

数据挖掘和数据分析的区别

模型与算法

模型：定量（数学公式），定性：规则（年龄>30岁，收入>1万元）

算法：实现数据挖掘的技术、模型的具体步骤与方法

数据挖掘常见的问题

分类特点

分类型目标变量（Y）——有监督分类

使用已知目标分类历史的样本来训练

需要对未知分类的样本预测所属的分类

常见的分类方法：决策树、贝叶斯、KNN、支持向量机、神经网络、逻辑回归……

聚类特点

无分类目标变量——无监督分类

物以类聚的思想

常见的聚类算法：划分聚类、层次聚类、密度聚类、网格聚类、基于模型聚类……

关联特点

无目标变量——无监督分类

基于数据项关联，识别频繁发生的模式

关联常见的算法：Aprior算法、Carma算法、序列算法

预测特点

数值型目标变量——有监督分类

须有已知目标值的的历史样本来训练模型

对未知的样本预测其的目标值

python 数据挖掘

常见预测方法：简单线性回归分析、多重线性回归分析、时间序列。

数据挖掘流程

文本分析

词频统计 - 语料库的构建

语料库：使我们要分析的所有文档的集合

# -*- coding: utf-8 -*- import os import os.path filePaths = [] for root, dirs, files in os.walk("./Sample"): for name in files: filePaths.append(os.path.join(root, name)) import codecs filePaths = []; fileContents = []; for root, dirs, files in os.walk( "D:\\PDM\\2.1\\SogouC.mini\\Sample" ): for name in files: filePath = os.path.join(root, name); filePaths.append(filePath); f = codecs.open(filePath, 'r', 'utf-8') fileContent = f.read() f.close() fileContents.append(fileContent) import pandas; corpos = pandas.DataFrame({ 'filePath': filePaths, 'fileContent': fileContents })

词频统计 - 中文分词

中文分词是指将一个汉字序列切分为一个个单独的词

停用词是指数据处理时需要过滤掉的词

# -*- coding: utf-8 -*- import jieba; for w in jieba.cut("我爱Python"): print(w) for w in jieba.cut(""" 工信处女干事每月经过下属科室都要亲口交代 24口交换机等技术性器件的安装工作 """): print(w) #http://pinyin.sogou.com/dict/ seg_list = jieba.cut( "真武七截阵和天罡北斗阵哪个更厉害呢？" ) for w in seg_list: print(w) jieba.add_word('真武七截阵') jieba.add_word('天罡北斗阵') seg_list = jieba.cut( "真武七截阵和天罡北斗阵哪个更厉害呢？" ) for w in seg_list: print(w) jieba.load_userdict('./金庸武功招式.txt'); import os; import os.path; import codecs; filePaths = []; fileContents = []; for root, dirs, files in os.walk("./Sample"): for name in files: filePath = os.path.join(root, name); filePaths.append(filePath); f = codecs.open(filePath, 'r', 'utf-8') fileContent = f.read() f.close() fileContents.append(fileContent) import pandas; corpos = pandas.DataFrame({ 'filePath': filePaths, 'fileContent': fileContents }); import jieba segments = [] filePaths = [] for index, row in corpos.iterrows(): filePath = row['filePath'] fileContent = row['fileContent'] segs = jieba.cut(fileContent) for seg in segs: segments.append(seg) filePaths.append(filePath) segmentDataFrame = pandas.DataFrame({ 'segment': segments, 'filePath': filePaths });

词频统计 - 实现

词频是指某个词在该文档中出现的次数。

# -*- coding: utf-8 -*- import os; import os.path; import codecs; filePaths = []; fileContents = []; for root, dirs, files in os.walk( "D:\\PDM\\2.3\\SogouC.mini\\Sample" ): for name in files: filePath = os.path.join(root, name); filePaths.append(filePath); f = codecs.open(filePath, 'r', 'utf-8') fileContent = f.read() f.close() fileContents.append(fileContent) import pandas; corpos = pandas.DataFrame({ 'filePath': filePaths, 'fileContent': fileContents }); import jieba segments = [] filePaths = [] for index, row in corpos.iterrows(): filePath = row['filePath'] fileContent = row['fileContent'] segs = jieba.cut(fileContent) for seg in segs: segments.append(seg) filePaths.append(filePath) segmentDataFrame = pandas.DataFrame({ 'segment': segments, 'filePath': filePaths }); import numpy; #进行词频统计 segStat = segmentDataFrame.groupby( by="segment" )["segment"].agg({ "计数":numpy.size }).reset_index().sort( columns=["计数"], ascending=False ); #移除停用词 stopwords = pandas.read_csv( "D:\\PDM\\2.3\\StopwordsCN.txt", encoding='utf8', index_col=False ) fSegStat = segStat[ ~segStat.segment.isin(stopwords.stopword) ] import jieba segments = [] filePaths = [] for index, row in corpos.iterrows(): filePath = row['filePath'] fileContent = row['fileContent'] segs = jieba.cut(fileContent) for seg in segs: if seg not in stopwords.stopword.values and len(seg.strip())>0: segments.append(seg) filePaths.append(filePath) segmentDataFrame = pandas.DataFrame({ 'segment': segments, 'filePath': filePaths }); segStat = segmentDataFrame.groupby( by="segment" )["segment"].agg({ "计数":numpy.size }).reset_index().sort( columns=["计数"], ascending=False );

Python 数据挖掘

九江高三复读学校TOP10深度解析：从“提分王牌”到“名校摇篮”，哪所更适合你？

755 2025-03-31

python 数据 挖掘

九江高三复读学校TOP10深度解析：从“提分王牌”到“名校摇篮”，哪所更适合你？

江西科技学院附属中学（江科附中）2025年高三复读班招生信息全解析

九江高三复读学校有哪些，九江地区2025年高三复读学校推荐及选择指南

推荐文章

企业生产管理是什么，企业生产管理软件

进盘点进销存软件排行榜前十名

进销存系统哪个简单好用？进销存系统优点

工厂生产管理（工厂生产管理流程及制度）

生产管理软件，机械制造业生产管理，制造业生产过程管理软件

进销存软件和ERP有什么区别？进销存与erp软件理解

进销存如何进行库存管理

如何利用excel制作销售订单管理系统？

数据库订单管理系统有哪些功能？数据库订单管理系统怎么设计？

什么是数据库管理系统？

最近发表

热评文章

零代码开发是什么？2022低代码平台排行榜">零代码开发是什么？2022低代码平台排行榜

进销存库存管理 系统（智慧进销存）">智能进销存库存管理系统（智慧进销存）

在线文档哪家强？8款在线文档编辑软件推荐">在线文档哪家强？8款在线文档编辑软件推荐

WPS2016怎么绘制简单的价格表?

定制订单管理系统（为特定需求定制的订单管理系统）

家居定制平台是什么？">家居定制平台是什么？

友情链接

python 数据挖掘

微信扫一扫：分享

推荐文章

最近发表

热评文章

零代码开发是什么？2022低代码平台排行榜">零代码开发是什么？2022低代码平台排行榜

进销存库存管理系统（智慧进销存）">智能进销存库存管理系统（智慧进销存）

在线文档哪家强？8款在线文档编辑软件推荐">在线文档哪家强？8款在线文档编辑软件推荐

家居定制平台是什么？">家居定制平台是什么？

友情链接