【王喆-推荐系统】评估篇-(task3)TensorFlow的模型离线评估实践-伙伴云

【王喆-推荐系统】评估篇-(task3)TensorFlow的模型离线评估实践

网友投稿 1339 2022-05-30

学习总结

第一步是导入 Spark 分割好的训练集和测试集。

第二步是在 TensorFlow 中设置评估指标，再在测试集上调用 model.evaluate 函数计算这些评估指标。这里使用了最常用的 Loss、Accuracy、ROC AUC、PR AUC 四个指标。

第三步是根据四个深度推荐模型的评估结果，进行模型效果的对比。

文章目录

学习总结

一、训练集和测试集的生成

二、TensorFlow 评估指标的设置

三、模型的效果对比

3.1 选择模型

3.2 DeepFM在这里为啥效果差

四、作业

五、课后答疑

Reference

一、训练集和测试集的生成

第一步是生成训练集和测试集，这里使用最简单的Holdout检验来划分训练集和测试集，调用spark的randomSplit函数。在tensorflow内部直接调用get_dataset方法分别载入训练集和测试集。

代码参考：FeatureEngForRecModel 对象中的 splitAndSaveTrainingTestSamples 函数。

按照 8:2 的比例把全量样本集划分为训练集和测试集，再把它们分别存储在SparrowRecSys/src/main/resources/webroot/sampledata/trainingSamples.csv和SparrowRecSys/src/main/resources/webroot/sampledata/testSamples.csv路径中。

二、TensorFlow 评估指标的设置

接着是在tensorflow中设置评估指标，通过这些指标观察模型每轮epoch的效果变化，在模型编译阶段设置metrics指定评估指标：

（1）在 model complie 阶段设置准确度（Accuracy）、ROC 曲线 AUC（tf.keras.metrics.AUC(curve='ROC')）、PR 曲线 AUC（tf.keras.metrics.AUC(curve='PR')），这三个在评估推荐模型时最常用的指标。

（2）在训练和评估过程中，模型还会默认产生损失函数 loss 这一指标。在模型编译时我们采用了 binary_crossentropy 作为损失函数，所以这里的 Loss 指标就是在上个task中二分类问题的模型损失 Logloss。

（3）模型在每轮 epoch 结束后都会输出这些评估指标的当前值。在最后的测试集评估阶段，我们可以调用 model.evaluate 函数来生成测试集上的评估指标。

# compile the model, set loss function, optimizer and evaluation metrics model.compile( loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', tf.keras.metrics.AUC(curve='ROC'), tf.keras.metrics.AUC(curve='PR')]) # train the model model.fit(train_dataset, epochs=5) # evaluate the model test_loss, test_accuracy, test_roc_auc, test_pr_auc = model.evaluate(test_dataset)

有了上段代码，就看到如下每一轮训练的 Loss、Accuracy、ROC AUC、PR AUC 这四个指标的变化，以及最终在测试集上这四个指标的结果：

随着训练的进行，模型的 Loss 在降低，而 Accuracy、Roc AUC、Pr AUC 这几个指标都在升高，这证明模型的效果随着训练轮数的增加在逐渐变好。

可以得到测试集上的评估指标：

测试集上的评估结果相比训练集有所下降，比如 Accuracy 从 0.7524 下降到了 0.7427，ROC AUC 从 0.8256 下降到了 0.8138。这是非常正常的现象，因为模型在训练集上都会存在着轻微过拟合的情况。

Epoch 1/5 8236/8236 [==============================] - 60s 7ms/step - loss: 3.0724 - accuracy: 0.5778 - auc: 0.5844 - auc_1: 0.6301 Epoch 2/5 8236/8236 [==============================] - 55s 7ms/step - loss: 0.6291 - accuracy: 0.6687 - auc: 0.7158 - auc_1: 0.7365 Epoch 3/5 8236/8236 [==============================] - 56s 7ms/step - loss: 0.5555 - accuracy: 0.7176 - auc: 0.7813 - auc_1: 0.8018 Epoch 4/5 8236/8236 [==============================] - 56s 7ms/step - loss: 0.5263 - accuracy: 0.7399 - auc: 0.8090 - auc_1: 0.8305 Epoch 5/5 8236/8236 [==============================] - 56s 7ms/step - loss: 0.5071 - accuracy: 0.7524 - auc: 0.8256 - auc_1: 0.8481 1000/1000 [==============================] - 5s 5ms/step - loss: 0.5198 - accuracy: 0.7427 - auc: 0.8138 - auc_1: 0.8430 Test Loss 0.5198314250707626, Test Accuracy 0.7426666617393494, Test ROC AUC 0.813848614692688, Test PR AUC 0.8429719805717468

（1）如果测试集的评估结果相比训练集出现大幅下降，比如下降幅度超过了 5%，就说明模型产生了非常严重的过拟合现象，检查模型设计过程：

模型结构是否过于复杂；模型的层数或者每层的神经元数量过多；是否需要加入 Dropout，正则化项来减轻过拟合的风险。

（2）除了观察模型自己的效果，在模型评估阶段，更应该重视不同模型之间的效果做横向对比，这样才能确定我们最终上线的模型。

三、模型的效果对比

3.1 选择模型

分析：Embedding MLP 和 Wide&Deep 模型在我们的 MovieLens 这个小规模数据集上的效果最好，它们两个的指标也非常接近，只不过是在不同指标上有细微的差异，比如模型 Loss 指标上 Wide&Deep 模型好一点，在 Accuracy、ROC AUC、PR AUC 指标上 Embedding MLP 模型好一点。

选择模型：

（1）做进一步的模型调参，特别是对于复杂一点的 Wide&Deep 模型，我们可以尝试通过参数的 Fine Tuning（微调）让模型达到更好的效果；

（2）如果经过多次尝试两个模型的效果仍比较接近，我们就通过

线上评选

出最后的胜出者。

3.2 DeepFM在这里为啥效果差

本该DeepFM 的表达能力是最强的，这里可能是过拟合了。看DeepFM 在训练集上的表现（如下）发现训练集上的效果很牛逼，但是测试集上效果很拉跨，说明过拟合了。这是因为我们的数据集是规模很小的采样过的 MovieLens 数据集，难以让模型收敛。模型中很多参数其实没有达到稳定的状态，因此在测试集上的表现往往会呈现出比较大的随机性。

小结：根据具体业务和数据，因地制宜地调整模型和参数，这才是算法工程师最大的价值所在。

四、作业

（1）除了用到的 Loss、Accuracy、ROC AUC、PR AUC 这四个指标，你在 TensorFlow 的实践中还会经常用到哪些评估指标呢？你能把这些常用指标以及它们特点分享出来吗？（你可以参考 TensorFlow 的官方Metrics 文档）

【答】

class BinaryAccuracy: Calculates how often predictions match binary labels.

class BinaryCrossentropy: Computes the crossentropy metric between the labels and predictions.

class CategoricalAccuracy: Calculates how often predictions match one-hot labels.

class CategoricalCrossentropy: Computes the crossentropy metric between the labels and predictions.

class CategoricalHinge: Computes the categorical hinge metric between y_true and y_pred.

class CosineSimilarity: Computes the cosine similarity between the labels and predictions.

class FalseNegatives: Calculates the number of false negatives.

class FalsePositives: Calculates the number of false positives.

class Hinge: Computes the hinge metric between y_true and y_pred.

class KLDivergence: Computes Kullback-Leibler divergence metric between y_true and y_pred.

class LogCoshError: Computes the logarithm of the hyperbolic cosine of the prediction error.

class Mean: Computes the (weighted) mean of the given values.

class MeanAbsoluteError: Computes the mean absolute error between the labels and predictions.

class MeanAbsolutePercentageError: Computes the mean absolute percentage error between y_true and y_pred.

class MeanIoU: Computes the mean Intersection-Over-Union metric.

class MeanMetricWrapper: Wraps a stateless metric function with the Mean metric.

class MeanRelativeError: Computes the mean relative error by normalizing with the given values.

【王喆-推荐系统】评估篇-(task3)TensorFlow的模型离线评估实践

class MeanSquaredError: Computes the mean squared error between y_true and y_pred.

class MeanSquaredLogarithmicError: Computes the mean squared logarithmic error between y_true and y_pred.

class MeanTensor: Computes the element-wise (weighted) mean of the given tensors.

class Metric: Encapsulates metric logic and state.

class Poisson: Computes the Poisson metric between y_true and y_pred.

class Precision: Computes the precision of the predictions with respect to the labels.

class PrecisionAtRecall: Computes best precision where recall is >= specified value.

class RecallAtPrecision: Computes best recall where precision is >= specified value.

class RootMeanSquaredError: Computes root mean squared error metric between y_true and y_pred.

（2）你认为 DeepFM 评估结果这么差的原因，除了过拟合，还有什么更深层次的原因呢？可以尝试从模型结构的原理上给出一些解释吗？

【答】可能是因为

交叉层的数据太稀疏

了，不能够让交叉层完全收敛。

另外交叉层大量使用id类特征，测试集的id特征和训练集的id

特征重叠比较少

的话，很可能无法作出合理的预测。这也是所谓模型泛化性和记忆性的矛盾。

五、课后答疑

Reference

（1）https://github.com/wzhe06/Reco-papers

（2）《深度学习推荐系统实战》，王喆

TensorFlow 推荐系统机器学习

销售报表分析的秘密，掌握数据让业绩飞跃

1339 2022-05-30

【王喆-推荐 系统】评估篇-(task3)TensorFlow的模型离线评估实践

销售报表分析的秘密，掌握数据让业绩飞跃

b2b电商系统搭建的关键步骤与最佳实践解析，助力企业数字化转型

电话机器人系统搭建的关键步骤与企业数字化转型的重要性

推荐文章

企业生产管理是什么，企业生产管理软件

进盘点进销存软件排行榜前十名

进销存系统哪个简单好用？进销存系统优点

工厂生产管理（工厂生产管理流程及制度）

生产管理软件，机械制造业生产管理，制造业生产过程管理软件

进销存软件和ERP有什么区别？进销存与erp软件理解

进销存如何进行库存管理

如何利用excel制作销售订单管理系统？

数据库订单管理系统有哪些功能？数据库订单管理系统怎么设计？

什么是数据库管理系统？

最近发表

热评文章

零代码开发是什么？2022低代码平台排行榜

智能进销存库存管理系统（智慧进销存）

在线文档哪家强？8款在线文档编辑软件推荐

WPS2016怎么绘制简单的价格表?

什么是在线文档？怎么发在线文档

用在线电子表格，居家办公更轻松

友情链接

【王喆-推荐系统】评估篇-(task3)TensorFlow的模型离线评估实践

微信扫一扫：分享

推荐文章

最近发表

热评文章

友情链接