华为极客周活动-昇腾万里--模型王者挑战赛VQVAE调通过程 下

网友投稿 609 2022-05-30

现在的进展问题:

1 依瞳系统暂停之后,再开,那些文件又要重新打一遍补丁

2 issue3还没有解决,要持续关注这个issue https://gitee.com/ascend/modelzoo/issues/I28YYG

Issue3 解决过程

经排查,QueueDequeueMany输出shape与TF不一致系上游算子RandomShuffleQueue中shapes属性未向下传递导致

已联系负责该算子的开发人员进行修复

问已经解决,请替换附件中的文件至如下目录,替换前请先备份。

/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so

这样issue3已经解决,又打一个补丁文件。

Issue4 报错(AOE)

打上issue3的补丁,还是报错。由于时间有点久,那个报错也没有什么特征性的东西,所以从log上看不出是否修正了我以前的那个bug。原谅我有时候也偷懒,没有仔细比对报错信息。

发现aoe已经提交issue了,我在升级完issue的补丁后,我们的报错信心一样,于是就持续关注这个issue和解决方案,这个也算是issue4了:

https://gitee.com/ascend/modelzoo/issues/I2A7SC

返回信息,可以参考下方链接中“sess.run模式下开启混合计算”的方式将tf.train.shuffle_batch和tf.train.string_input_producer设置为不下沉,再试试网络是否可以跑起来

https://support.huaweicloud.com/mprtg-A800_9000_9010/atlasprtg_13_0033.html

说用下混合计算,看到文档里说混合计算模式下,iterations_per_loop必须为1。不过我在代码里没有找到这个关键字,那是否意味着我不用考虑iterations_per_loop的实际取值呢?(后来通过沟通知道,混合模式iterations_per_loop就已经设为1了)

用户还可通过without_npu_compile_scope自行配置不下沉的算子。

于是按照说明修改代码82行:

# change to 不下沉

with npu_scope.without_npu_compile_scope():

filename_queue = tf.train.string_input_producer(filenames,num_epochs=num_epochs)

因为AOE的问题解决了,他的issue关闭,但是我的问题还没解决,所以我报了自己的issue4:

Issue4报错

提交issue:

https://gitee.com/ascend/modelzoo/issues/I2AMHH

Instructions for updating:

Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.FixedLengthRecordDataset`.

WARNING:tensorflow:From cifar10.py:305: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.

Instructions for updating:

Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

2020-12-23 22:22:55.165648: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost

2020-12-23 22:22:55.166302: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_5 success. [0 ms]

2020-12-23 22:22:55.166352: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.

2020-12-23 22:22:55.166546: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is True

2020-12-23 22:22:55.166574: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1

2020-12-23 22:22:55.166804: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_9 begin.

2020-12-23 22:22:55.166829: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is True

2020-12-23 22:22:55.166876: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1

2020-12-23 22:22:55.167661: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 1, hasMakeIteratorOp:0, hasIteratorOp:0

2020-12-23 22:22:55.167710: I tf_adapter/util/npu_ops_identifier.cc:67] [MIX] Parsing json from /home/HwHiAiUser/Ascend/ascend-toolkit/latest/arm64-linux/opp/framework/built-in/tensorflow/npu_supported_ops.json

2020-12-23 22:22:55.169692: I tf_adapter/util/npu_ops_identifier.cc:69] 690 ops parsed

2020-12-23 22:22:55.170185: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [2 ms]

2020-12-23 22:22:55.176442: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1

2020-12-23 22:22:55.176485: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 382, max nodes count: 377 in subgraph: GeOp9_0 minGroupSize: 1

2020-12-23 22:22:55.176643: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_9 markForPartition success.

Traceback (most recent call last):

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call

return fn(*args)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn

target_list, run_metadata)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun

run_metadata)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "cifar10.py", line 517, in

extract_z(**config)

File "cifar10.py", line 330, in extract_z

sess.run(init_op)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run

run_metadata_ptr)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run

feed_dict_tensor, options, run_metadata)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run

run_metadata)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call

raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list

2020-12-23 22:22:56.194723: I tf_adapter/util/ge_plugin.cc:56] [GePlugin] destroy constructor begin

2020-12-23 22:22:56.194890: I tf_adapter/util/ge_plugin.cc:195] [GePlugin] Ge has already finalized.

2020-12-23 22:22:56.194990: I tf_adapter/util/ge_plugin.cc:58] [GePlugin] destroy constructor end

后来无法复现,就关闭了。但是今天又复现了。

发现原来依瞳系统下面,python和python3.7指向竟然不是同一个。

model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$  which python

/usr/bin/python

model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$  which python3.7

/usr/local/bin/python3.7

model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$ which pip

/usr/local/bin/pip

因此应该用usr/local/bin这个目录下的python,也就是python3.7

检查tf.train.batch里面的设置,尤其是allow_smaller_final_batch的设置,发现一共出现3处,唯一起作用那处已经改成了False

images,labels = tf.train.batch(

[image,label],

batch_size=BATCH_SIZE,

num_threads=1,

capacity=BATCH_SIZE,

allow_smaller_final_batch=False)

将代码改成非aoe模式,因为aoe他的规避方式不复合比赛的要求。

将其它两处train.Batch 改成了不下沉。混合计算里的不下沉,就是将不下沉的语句用这个语句包起来:

with npu_scope.without_npu_compile_scope():

由于修改好几个地方,程序被改的面目全非,因此又重新测试cpu代码,发现cifar10的cpu代码竟然也不通过了(这就是vqvae这个模型难的地方,改一点点地方,报错就不一样,甚至感觉没改哪些地方,它自己也会莫名其妙的不通了)。

又废了好大劲才终于又调通了cpu代码。Cpu代码程序单独写为cifar_base.py。然后再比着调通npu程序代码,也就是如果把npu相关代码都屏蔽掉,npu程序也是能跑通的。

Issue5的报错,是shape没有对齐

https://gitee.com/ascend/modelzoo/issues/I2AVI5

应该是data batch那里没有把最后一段丢弃的缘故,

images,labels = tf.train.batch(

[image,label],

batch_size=BATCH_SIZE,

num_threads=1,

华为极客周活动-昇腾万里--模型王者挑战赛VQVAE调通过程 下

capacity=BATCH_SIZE,

allow_smaller_final_batch=False)

加上黑体部分就ok了。

报issue5.1

https://gitee.com/ascend/modelzoo/issues/I2B2US

提到:麻烦以后出现DEVMM报错时,敲dmesg获取一下内核日志,方便定位,谢谢~

[ERROR] DEVMM(25538,python3.7):2020-12-27-12:52:10.065.876 [hardware/build/../dev_platform/devmm/devmm/devmm_svm.c:268][devmm_copy_ioctl 268] Ioctl(-1060090619) error! ret=-1, dst=0xfffed40af2a0, src=0x1008000bc000, size=112,

但是这个问题并不容易复现,在我的系统里偶尔能复现,在研发那块复现也很困难。

结果元旦后第一个工作日:新年新气象,今天略微修改了下代码,竟然跑通了,我都很惊讶。

数据读取部分用了混合计算不下沉,原则上没有修改骨干代码 ,但是元旦那天还不行,今天稍微改了下代码,就跑通了 。

这个issue的问题解决了,关闭。

上面的记录文字很短,其实这个issue花费的时间非常多,从2020年的年尾,一直到2021年的年初,两头占着算两年时间,中间代码改的面目全非,bug的样子也是日新月异,可以说最后成功的喜悦有多大,中间的情绪低落就有多深。

还没有完成的issue6 报错

这回的报错没有提交issue6 。

报错信息:

Instructions for updating:

Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

WARNING:tensorflow:From cn.py:346: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.

Instructions for updating:

Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

2021-01-04 16:05:37.573107: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost

2021-01-04 16:05:37.574181: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_15 success. [0 ms]

2021-01-04 16:05:37.574252: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.

2021-01-04 16:05:37.574439: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is True

2021-01-04 16:05:37.574461: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1

2021-01-04 16:05:37.574658: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_29 begin.

2021-01-04 16:05:37.574679: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is True

2021-01-04 16:05:37.574689: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1

2021-01-04 16:05:37.575437: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 1, hasMakeIteratorOp:0, hasIteratorOp:0

2021-01-04 16:05:37.575952: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [0 ms]

2021-01-04 16:05:37.582660: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1

2021-01-04 16:05:37.582750: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 382, max nodes count: 377 in subgraph: GeOp29_0 minGroupSize: 1

2021-01-04 16:05:37.583336: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_29 markForPartition success.

Traceback (most recent call last):

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call

return fn(*args)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn

target_list, run_metadata)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun

run_metadata)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer_2/limit_epochs/epochs/Assign is not in white list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "cn.py", line 558, in

extract_z(**config)

File "cn.py", line 371, in extract_z

sess.run(init_op)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run

run_metadata_ptr)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run

feed_dict_tensor, options, run_metadata)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run

run_metadata)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call

raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer_2/limit_epochs/epochs/Assign is not in white list

2021-01-04 16:05:38.559795: I tf_adapter/util/ge_plugin.cc:56] [GePlugin] destroy constructor begin

2021-01-04 16:05:38.560022: I tf_adapter/util/ge_plugin.cc:195] [GePlugin] Ge has already finalized.

2021-01-04 16:05:38.560042: I tf_adapter/util/ge_plugin.cc:58] [GePlugin] destroy constructor end

看到这句提示,是否要修改代码呢?

WARNING:tensorflow:From cn.py:347: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.

Instructions for updating:

Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

WARNING:tensorflow:From cn.py:347: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.

Instructions for updating:

Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

现在报错信息为:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "cn.py", line 559, in

extract_z(**config)

File "cn.py", line 372, in extract_z

sess.run(init_op)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run

run_metadata_ptr)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run

feed_dict_tensor, options, run_metadata)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run

run_metadata)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call

raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list

查找,发现有可能是没有正确初始化导致的,于是加上这句试试:

sess.graph.finalize()

还是同样的报错。

Main代码:

init_op = tf.group(tf.global_variables_initializer(),

tf.local_variables_initializer())

# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Run!

config = tf.ConfigProto()

# config.gpu_options.allow_growth = True

custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()

custom_op.name =  "NpuOptimizer"

custom_op.parameter_map["use_off_line"].b = True #在昇腾AI处理器执行训练

custom_op.parameter_map["mix_compile_mode"].b =  True

config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  #关闭remap开关

sess = tf.Session(config=config)

# sess.graph.finalize()

sess.run(init_op)

print("="*1000, "run sess.run(init_op) OK!")

summary_writer = tf.summary.FileWriter(LOG_DIR,sess.graph)

# logging.warning("dch summary_writer")

summary_writer.add_summary(config_summary.eval(session=sess))

# logging.warning("dch summary_writer.add")

extract_z代码:

with npu_scope.without_npu_compile_scope():

images,labels = tf.train.batch(

[image,label],

batch_size=BATCH_SIZE,

num_threads=1,

capacity=BATCH_SIZE,

allow_smaller_final_batch=False)

# <<<<<<<

# images = images.batch(batch_size, drop_remainder=True)

# >>>>>>> MODEL

with tf.variable_scope('net'):

with tf.variable_scope('params') as params:

pass

x_ph = tf.placeholder(tf.float32,[BATCH_SIZE,32,32,3])

net= VQVAE(None,None,BETA,x_ph,K,D,_cifar10_arch,params,False)

init_op = tf.group(tf.global_variables_initializer(),

tf.local_variables_initializer())

# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Run!

config = tf.ConfigProto()

# config.gpu_options.allow_growth = True

custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()

custom_op.name =  "NpuOptimizer"

custom_op.parameter_map["use_off_line"].b = True #在昇腾AI处理器执行训练

custom_op.parameter_map["mix_compile_mode"].b =  True # 测试算子下沉

config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  #关闭remap开关

sess = tf.Session(config=config)

logger.warn('warn sess = tf.Session(config=config)')

# sess = tf.Session()

sess.graph.finalize()

sess.run(init_op)

logger.warn('warn sess.run(init_op)')

最终采用将这句话里的epoch=1参数去掉,终于能够通过了 。

# image,label = get_image(num_epochs=1)

image,label = get_image()

这个解决方法可能不是最终解决方法,先这样处理。

issue7报错:

是train_prior部分:  config['TRAIN_NUM'] = 8 # 9个之后会报错

报issue:https://gitee.com/ascend/modelzoo/issues/I2BUME/

报错信息:

2021-01-04 20:30:06.952771: I tf_adapter/kernels/geop_npu.cc:573] [GEOP] RunGraphAsync callback, status:0, kernel_name:GeOp75_0[ 6456228us]

50%|██████████████████████████████████████████████                                              | 9/18 [01:36<01:36, 10.67s/it]

Traceback (most recent call last):

File "cifar10.py", line 531, in

train_prior(config=config,**config)

File "cifar10.py", line 476, in train_prior

sess.run(sample_summary_op,feed_dict={sample_images:sampled_ims}),it)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run

run_metadata_ptr)

File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1156, in _run

(np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))

ValueError: Cannot feed value of shape (20, 32, 32, 3) for Tensor 'misc/Placeholder:0', which has shape '(1, 32, 32, 3)'

这个问题现在还没有解决,完全不知道问题出在哪里。 大约跟数据的喂入有关系,但是我目前解决不了,只能先设为8:config['TRAIN_NUM'] = 8保证整个程序能跑通,这个issue先留着吧。

PR提交,最后的冲刺

经过艰苦卓绝的奋斗,终于迎来了模型大赛的曙光,整个模型能够在升腾系统上跑通了,而且基本符合大赛的要求。后面就是一些微调了。

审核方给的修改意见

1 程序设置了对tf.train.string_input_producer这个的不下沉

要把这个去掉,只开启混合计算

2 VQ-VAE的网络问题:网络结构中数据预处理的方式是通过这个循环控制的,这个循环在数据达到上限后抛出异常,根据异常来结束处理,目前在昇腾产品执行会core。请开发者自行修改成其他控制流程:

while not coord.should_stop():

x,y = sess.run([images,labels])

k = sess.run(net.k,feed_dict={x_ph:x})

ks.append(k)

ys.append(y)

print('.', end='', flush=True)

except tf.errors.OutOfRangeError:

VQVAE PR最后的整改

1 整个程序只启动混合计算,把单独的不下沉设置全部去掉。(也就是最终的使用方法,理论上系统把支持的全部下沉,不支持的默认就能不下沉,不需要用户手动设置)

2 将while循环改成for循环

for step in tqdm(xrange(TRAIN_NUM), dynamic_ncols=True):

x,y = sess.run([images,labels])

k = sess.run(net.k,feed_dict={x_ph:x})

ks.append(k)

ys.append(y)

并设置循环步数:

config['TRAIN_NUM'] = 24

再跟少芳那边沟通了一下,第二部分能通过就是将get_image函数参数去掉解决的,反正已经设置了循环步数,这里应该不影响整体。

修改: # image,label = get_image(num_epochs=1)

修改为: image,label = get_image()

然后提交PR,终于PR验收通过啦!乌拉!非常激动!结果并不重要,中间出现问题、解决问题的过程最重要。但是如果没有结果,这篇文档都师出无名,中间付出的精力可能就白白付出了,学到的东西可能也没现在这么多、印象这么深刻。

VQVAE 模型tensorflow迁移到升腾总结

本次大赛主要经历了报名、模型选择、模型迁移、排错、提交PR等几个阶段,具体过程如前面篇幅所讲,一言难尽啊!

本次模型迁移大赛是很好的一次学习和锻炼的机会,我原来对tensorflow一点都不懂,经过这次比赛,不管懂不懂,反正代码看了好多遍,tf程序的流程也懂了一点。升腾系统原来也只是在Modelarts的notebook和训练任务中有接触,像这次这样可以在依瞳系统里自由的安装软件、完全控制系统还是第一次。在排错的过程中,跟华为研发有了第一线接触,为及时准确的排错能力!对升腾系统和MindSpore AI框架充满信心!

模型大赛的白银赛段很快就要来了,大家快准备报名吧!

大赛 昇腾

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:【华为人】——创新,需要一点点超前
下一篇:一篇文章搞定前端面试
相关文章