MindSpore第六期两日集训营】MindElec作业记录

网友投稿 840 2022-05-29

【MindSpore第六期两日集训营】于2021年11月6日到11月7日在B站拉开了帷幕,错过直播 https://live.bilibili.com/22127570 的老铁们别忘了还有录播,链接分别为:

第一天:

第六期两日集训营 | MindSpore AI电磁仿真 https://www.bilibili.com/video/BV1Y34y1Z7E8?spm_id_from=333.999.0.0

第六期两日集训营 | MindSpore并行使能大模型训练 https://www.bilibili.com/video/BV193411b7on?spm_id_from=333.999.0.0

第六期两日集训营 | MindSpore Boost,让你的训练变得飞快 https://www.bilibili.com/video/BV1c341187ML?spm_id_from=333.999.0.0

第二天:

第六期两日集训营 | MindSpore 控制流概述 https://www.bilibili.com/video/BV1A34y1d7G7?spm_id_from=333.999.0.0

第六期两日集训营 | MindSpore Lite1.5特性发布,带来全新端侧AI体验 https://www.bilibili.com/video/BV1f34y1o7mR?spm_id_from=333.999.0.0

第六期两日集训营 | 可视化集群调优重磅发布,从LeNet到盘古大模型都能调优 https://www.bilibili.com/video/BV1dg411K7Nb?spm_id_from=333.999.0.0

我们先看第一天第一讲,MindScience的MindElec——电磁仿真。

第一讲的作业如下:

其实张小白已经尝试过MindScience的MindSPONGE分子模拟套件包了:

具体链接如下:

论坛:https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=159269

博客:https://bbs.huaweicloud.com/blogs/302842

但是既然作业2要求做MindElec电磁仿真,所以,作业1也可以用MindElec来做一下。

一、购买ECS GPU云服务器

我们使用ECS的GPU云服务器来完成这个作业的MindElec部分,MindSponge的部分请看前面的链接。

到华为云的控制台-》ECS,切换到北京四,按照下图所示购买:

点击立即购买:

由于费用是1小时7块多,所以张小白迫不及待地登陆进去。

先看了一下内存和CUDA的版本:11.0

【MindSpore第六期两日集训营】MindElec作业记录

二、安装Anaconda环境

由于MindSpore传统上都是使用Python 3.7.5环境(当然后面也支持了Python 3.9),所以先装conda环境:

...

...

source ~/.bashrc

发现装的版本太老了,只好重新下载最新的Anaconda:

下载好后将其传到服务器,执行:

bash ./Anaconda3-2021.05-Linux-x86_64.sh

安装的时候自然提示目录已存在,

rm -rf /root/anaconda3

重新执行:

bash ./Anaconda3-2021.05-Linux-x86_64.sh

三、创建mindspore1.5的conda环境:

conda create -n mindspore1.5 python=3.7.5

。。。

conda activate mindspore1.5

conda install -c conda-forge pythonocc-core=7.5.1 cudatoolkit=11.1

按Y继续:

conda环境的CUDA 11.1的包比较大(1.2G),要耐心等待下载。

pythonocc也在其中。

四、安装mindspore 1.5的GPU版本

pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.5.0/MindSpore/gpu/x86_64/cuda-11.1/mindspore_gpu-1.5.0-cp37-cp37m-linux_x86_64.whl --trusted-host ms-release.obs.cn-north-4.myhuaweicloud.com -i https://pypi.tuna.tsinghua.edu.cn/simple

五、安装mindelec:

我们直接使用官网提供的MindElec的包安装吧,虽然名字写的是ascend,但是老师说gpu也能用。

wget https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.5.0/MindScience/x86_64/mindscience_mindelec_ascend-0.1.0-cp37-cp37m-linux_x86_64.whl

pip install ./mindscience_mindelec_ascend-0.1.0-cp37-cp37m-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple

验证安装:

出错了,cuda是11.0版本了,而且cudnn似乎没有安装。

六·、安装cuda 11.1和对应的cudnn 8.0.5

wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda_11.1.0_455.23.05_linux.run

sh cuda_11.1.0_455.23.05_linux.run

按下面的方式选择:

按图中的提示方式修改~/.bashrc:

PATH 增加 /usr/local/cuda-11.1/bin

LD_LIBRARY_PATH 增加 /usr/local/cuda-11.1/lib64

再检查一下CUDA版本:

nvidia-smi

是11.1了。

下载CUDA 11.1对应的cudnn 8.0.5(其他版本也可以装,只要对应CUDA 11.1即可),并将其上传到服务器:

解压

tar -zxvf cudnn-11.1-linux-x64-v8.0.5.39.tgz

将其拷贝到cuda的相应目录下:

七、验证mindspore 1.5和MindElec的安装:

python -c "import mindspore;mindspore.run_check()"

或者 vi test.py

python test.py

验证mindelec的安装:

python -c 'import mindelec'

好像万事俱备。

那么能不能成功尝试mindelec的例子呢?

八、下载MindElec代码仓:

git clone https://gitee.com/mindspore/mindscience.git

九、安装依赖包

1、安装easydict

2、安装opencv

pip install opencv-python -i  https://pypi.tuna.tsinghua.edu.cn/simple

十、验证

1、试验数据驱动的参数化电磁仿真:

https://gitee.com/mindspore/mindscience/tree/master/MindElec/examples/data_driven/parameterization

以下试验均需将相关代码中的Ascend改为GPU后再进行验证,以后不再赘述。

。。。

终于结束了:

具体结果如下:

epoch: 9966 step: 55, loss is 1.067301e-06 epoch time: 156.272 ms, per step time: 2.841 ms epoch: 9967 step: 55, loss is 1.6718128e-06 epoch time: 161.586 ms, per step time: 2.938 ms epoch: 9968 step: 55, loss is 1.9428162e-06 epoch time: 165.269 ms, per step time: 3.005 ms epoch: 9969 step: 55, loss is 1.1494253e-06 epoch time: 160.396 ms, per step time: 2.916 ms epoch: 9970 step: 55, loss is 1.2750754e-06 epoch time: 154.781 ms, per step time: 2.814 ms epoch: 9971 step: 55, loss is 1.2550026e-06 epoch time: 160.627 ms, per step time: 2.920 ms epoch: 9972 step: 55, loss is 1.4948789e-06 epoch time: 159.846 ms, per step time: 2.906 ms epoch: 9973 step: 55, loss is 1.8957531e-06 epoch time: 164.061 ms, per step time: 2.983 ms epoch: 9974 step: 55, loss is 1.8941449e-06 epoch time: 164.542 ms, per step time: 2.992 ms epoch: 9975 step: 55, loss is 2.340197e-06 epoch time: 166.823 ms, per step time: 3.033 ms epoch: 9976 step: 55, loss is 1.5545256e-06 epoch time: 152.811 ms, per step time: 2.778 ms epoch: 9977 step: 55, loss is 9.994957e-07 epoch time: 171.435 ms, per step time: 3.117 ms epoch: 9978 step: 55, loss is 2.12672e-06 epoch time: 154.989 ms, per step time: 2.818 ms epoch: 9979 step: 55, loss is 1.5981371e-06 epoch time: 159.917 ms, per step time: 2.908 ms epoch: 9980 step: 55, loss is 1.6546201e-06 epoch time: 151.021 ms, per step time: 2.746 ms epoch: 9981 step: 55, loss is 1.5869264e-06 epoch time: 162.313 ms, per step time: 2.951 ms epoch: 9982 step: 55, loss is 1.1969032e-06 epoch time: 168.984 ms, per step time: 3.072 ms epoch: 9983 step: 55, loss is 1.1927513e-06 epoch time: 163.749 ms, per step time: 2.977 ms epoch: 9984 step: 55, loss is 1.0608298e-06 epoch time: 160.595 ms, per step time: 2.920 ms epoch: 9985 step: 55, loss is 1.964669e-06 epoch time: 155.398 ms, per step time: 2.825 ms epoch: 9986 step: 55, loss is 1.5706166e-06 epoch time: 165.935 ms, per step time: 3.017 ms epoch: 9987 step: 55, loss is 1.3382705e-06 epoch time: 163.523 ms, per step time: 2.973 ms epoch: 9988 step: 55, loss is 1.2119517e-06 epoch time: 168.339 ms, per step time: 3.061 ms epoch: 9989 step: 55, loss is 1.7882771e-06 epoch time: 159.096 ms, per step time: 2.893 ms epoch: 9990 step: 55, loss is 1.1589409e-06 epoch time: 160.459 ms, per step time: 2.917 ms epoch: 9991 step: 55, loss is 8.78855e-07 epoch time: 156.461 ms, per step time: 2.845 ms epoch: 9992 step: 55, loss is 1.3546548e-06 epoch time: 157.824 ms, per step time: 2.870 ms epoch: 9993 step: 55, loss is 3.1089023e-06 epoch time: 158.035 ms, per step time: 2.873 ms epoch: 9994 step: 55, loss is 1.4939134e-06 epoch time: 160.428 ms, per step time: 2.917 ms epoch: 9995 step: 55, loss is 2.164372e-06 epoch time: 155.159 ms, per step time: 2.821 ms epoch: 9996 step: 55, loss is 9.635824e-07 epoch time: 156.919 ms, per step time: 2.853 ms epoch: 9997 step: 55, loss is 1.0471658e-06 epoch time: 160.262 ms, per step time: 2.914 ms epoch: 9998 step: 55, loss is 1.4574234e-06 epoch time: 160.660 ms, per step time: 2.921 ms epoch: 9999 step: 55, loss is 2.0352143e-06 epoch time: 150.130 ms, per step time: 2.730 ms epoch: 10000 step: 55, loss is 9.816508e-07 epoch time: 156.031 ms, per step time: 2.837 ms Eval current epoch: 10000 loss: 0.0002412886533234922 l2_s11: 0.0030976369803562306

ckpt下应该是训练好的模型:

在eval_res下有49张图片:

将其下载下来可以看到:

2、试验物理驱动的AI求解频域麦克斯韦方程:

https://gitee.com/mindspore/mindscience/tree/master/MindElec/examples/physics_driven/frequency_domain_maxwell

cd ~/mindscience/MindElec/examples/physics_driven/frequency_domain_maxwell

python solve.py

。。

具体结果如下:

(mindspore1.5) root@ecs-zhanghui-gpu:~/mindscience/MindElec/examples/physics_driven/frequency_domain_maxwell# python solve.py pid: 2676 check test dataset shape: (10201, 2), (10201, 1) [WARNING] OPTIMIZER(2676,7feb1bba3740,python):2021-11-09-00:05:06.369.176 [mindspore/ccsrc/frontend/optimizer/ad/dfunctor.cc:803] GetPrimalUser] J operation has no relevant primal call in the same graph. Func graph: 679_75_construct.92, J user: 679_75_construct.92:construct{[0]: [CNode]93, [1]: x0, [2]: u} [WARNING] OPTIMIZER(2676,7feb1bba3740,python):2021-11-09-00:05:06.382.175 [mindspore/ccsrc/frontend/optimizer/ad/dfunctor.cc:803] GetPrimalUser] J operation has no relevant primal call in the same graph. Func graph: 622_132_construct.94, J user: 622_132_construct.94:construct{[0]: [CNode]95, [1]: x0, [2]: u} [WARNING] OPTIMIZER(2676,7feb1bba3740,python):2021-11-09-00:05:06.595.722 [mindspore/ccsrc/frontend/optimizer/ad/dfunctor.cc:803] GetPrimalUser] J operation has no relevant primal call in the same graph. Func graph: 894_465_7_construct.116, J user: 894_465_7_construct.116:construct{[0]: [CNode]117, [1]: [CNode]118, [2]: [CNode]119} [WARNING] OPTIMIZER(2676,7feb1bba3740,python):2021-11-09-00:05:06.614.336 [mindspore/ccsrc/frontend/optimizer/ad/dfunctor.cc:803] GetPrimalUser] J operation has no relevant primal call in the same graph. Func graph: 894_465_7_construct.116, J user: 894_465_7_construct.116:construct{[0]: [CNode]120, [1]: [CNode]118, [2]: [CNode]121} [WARNING] CORE(2676,7feb1bba3740,python):2021-11-09-00:05:07.738.476 [mindspore/core/ir/anf_extends.cc:65] fullname_with_scope] Input 0 of cnode is not a value node, its type is CNode. epoch: 1 step: 78, loss is 600.0 epoch time: 11268.853 ms, per step time: 144.472 ms epoch: 2 step: 78, loss is 225.4 epoch time: 1389.687 ms, per step time: 17.816 ms epoch: 3 step: 78, loss is 199.9 ================================Start Evaluation================================ Total prediction time: 0.19255661964416504 s l2_error: 0.20626515080160301 =================================End Evaluation================================= epoch time: 1610.614 ms, per step time: 20.649 ms epoch: 4 step: 78, loss is 10.19 epoch time: 1730.271 ms, per step time: 22.183 ms epoch: 5 step: 78, loss is 2.803 epoch time: 1429.185 ms, per step time: 18.323 ms epoch: 6 step: 78, loss is 2.316 ================================Start Evaluation================================ Total prediction time: 0.0025403499603271484 s l2_error: 0.019291123630052236 =================================End Evaluation================================= epoch time: 1420.687 ms, per step time: 18.214 ms epoch: 7 step: 78, loss is 2.2 epoch time: 1844.602 ms, per step time: 23.649 ms epoch: 8 step: 78, loss is 1.953 epoch time: 1408.553 ms, per step time: 18.058 ms epoch: 9 step: 78, loss is 1.856 ================================Start Evaluation================================ Total prediction time: 0.0025916099548339844 s l2_error: 0.015916268073532643 =================================End Evaluation================================= epoch time: 1404.208 ms, per step time: 18.003 ms epoch: 10 step: 78, loss is 1.33 epoch time: 1459.013 ms, per step time: 18.705 ms l2 error: 0.0159162681 per step time: 18.7052916258

3、试验物理驱动的AI求解点源麦克斯韦方程组

https://gitee.com/mindspore/mindscience/tree/master/MindElec/examples/physics_driven/incremental_learning

cd ~/mindscience/MindElec/examples/physics_driven/incremental_learning

修改为GPU之后执行:

python piad.py --mode=pretrain

。。。

耐心等待:

突然发现pretrain的epoch是3000:

由于张小白囊中羞涩,所以果然暂停了训练:

但是估计mindspore团队是经过估算的,只有跑3000个epoch才能把loss降到0.1以下吧。。。现在loss虽然在收敛,但是还是蛮高的。

4、试验物理驱动的AI求解点源麦克斯韦方程组

https://gitee.com/mindspore/mindscience/tree/master/MindElec/examples/physics_driven/time_domain_maxwell

cd ~/mindscience/MindElec/examples/physics_driven/time_domain_maxwell

改下GPU。

基于上个试验的教训,果然的修改配置,减少下epoch:

将epoch从6000降到100。

开始训练:

100个还是蛮快的。

同样的,虽然减少了epoch,但是loss确实在收敛之中,想必修炼6000次之后确实会成为六神装。

但是,张小白不能用自己的血汗钱去试,所以,这个时候关机走人是最好的解脱了。

这样子,基本上就完成了MindScience的MindElec作业。

(全文完,谢谢阅读)

GPU加速云服务器 MindSpore Python 机器学习

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:IDEA整合 ssm的详细demo(使用maven)
下一篇:Mysql性能优化二:索引优化
相关文章