一个HBase MultiActionResultTooLarge的问题分享-伙伴云

一个 HBase MultiActionResultTooLarge的问题分享

网友投稿 970 2022-05-29

概况:

某一用户反馈的hbase查询问题，查询使用get list，单次get list超过25条就查询异常，客户端返回multiActionResultTooLarge

2020-09-09 16:33:00,607Z+0000|INFO|custom-tomcat-51||||Https| requestId=05a89c1b-cecf-4693-a8d6-a319b6621cff|com.xxx.xxx.xxx.hbase.HBaseOperations.get(HBaseOperations.java:429)|(1078409497)get batch rows, tableName: DETAIL.

2020-09-09 16:33:00,622Z+0000|WARN|hconnection-0x39b61605-shared--pool1-t8272||||||org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.logNoResubmit(AsyncProcess.java:1313)|(1078409512)#1, table=DETAIL, attempt=1/1 failed=13ops, last exception: org.apache.hadoop.hbase.MultiActionResultTooLarge: org.apache.hadoop.hbase.MultiActionResultTooLarge: Max size exceeded CellSize: 132944 BlockSize: 109051904

问题现象：

1. 客户的HBase集群，写入69条数据后，使用htable.get(list)查询数据，如果list大于25，则会遇到MultiActionResultTooLarge异常。经过1小时左右后，list大于25不会出现异常。

2. 初始分析时，看到MultiActionResultTooLarge的报错，还以为是服务端设置的查询BlockSize超过了100M的阈值，100M由参数hbase.server.scanner.max.result.size控制，但是用户反馈，该表总共占的存储空间才几百KB。

3. 客户表结构信息为

COLUMN FAMILIES DESCRIPTION

{NAME => 'CF1', BLOOMFILTER => 'ROW', VERSIONS => '1000', IN_MEMORY => 'false',

KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => '2147472000 SECONDS (24855 DAYS)', COMPRES

SION => 'SNAPPY', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

问题分析：

1. 从表现上看，问题抛出了MultiActionResultTooLarge，查看该处代码，是因为这个context.getResponseCellSize超过了quota值，这里的quota就是hbase.server.scanner.max.result.size设置的100MB，

if (context != null

&& context.isRetryImmediatelySupported()

&& (context.getResponseCellSize() > maxQuotaResultSize

|| context.getResponseBlockSize() + context.getResponseExceptionSize()

> maxQuotaResultSize)) {

// We're storing the exception since the exception and reason string won't

// change after the response size limit is reached.

if (sizeIOE == null ) {

// We don't need the stack un-winding do don't throw the exception.

// Throwing will kill the JVM's JIT.

// Instead just create the exception and then store it.

sizeIOE = new MultiActionResultTooLarge("Max size exceeded"

+ " CellSize: " + context.getResponseCellSize()

+ " BlockSize: " + context.getResponseBlockSize());

// Only report the exception once since there's only one request that

// caused the exception. Otherwise this number will dominate the exceptions count.

rpcServer.getMetrics().exception(sizeIOE);

}

2. 接着分析context.getResponseCellSize为什么会超过100MB，从下面代码可以看到，这里是将查询的Result中的cell拿出来累加block的size, 如果上一个是相同block则不累加。

/**

* Method to account for the size of retained cells and retained data blocks.

* @return an object that represents the last referenced block from this response.

Object addSize(RpcCallContext context, Result r, Object lastBlock) {

if (context != null && r != null && !r.isEmpty()) {

for (Cell c : r.rawCells()) {

context.incrementResponseCellSize(CellUtil.estimatedHeapSizeOf(c));

// We're using the last block being the same as the current block as

// a proxy for pointing to a new block. This won't be exact.

// If there are multiple gets that bounce back and forth

// Then it's possible that this will over count the size of

// referenced blocks. However it's better to over count and

// use two RPC's than to OOME the RegionServer.

byte[] valueArray = c.getValueArray();

if (valueArray != lastBlock) {

context.incrementResponseBlockSize(valueArray.length);

lastBlock = valueArray;

}

return lastBlock;

}

3. 这时候怀疑可能是不是因为用户表的Version过多导致，从用户侧得知，他们的业务的确存在反复对一个Row做更新，且表的Version为1000，但是在重新把表的version从1000修改为1后，问题还是存在。经过测试，当把数据手工执行flush后，查询又能恢复，怀疑查询有问题的数据应该是没有落盘HDFS，可能是在WAL或者memstore中。

4. 后面到社区去根据关键字“MultiActionResultTooLarge”查询到https://issues.apache.org/jira/browse/HBASE-23158这个单，现象恰好是跟当前遇到的问题是一一样的，这个单是Unresolved的状态，这个是hbase为了保护bigScan所以设置了一个代码上的保护，这里单提及如果Cell还在Memstore的时候，代码中计算的那个array可能会变得很大。

5. 由于平时Get List是比较常见的操作，应该不至于因为这个保护就必然出现问题。接着从ISSUE单提供的test patch发现，复现此问题时，他把客户端的retry次数调低了。这时候回过头看客户的报错日志，发现重试次数只有1次，当我们把这个重试次数稍微调大，问题就不出现了。

org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.logNoResubmit(AsyncProcess.java:1313)|(1078409512)#1, table=DETAIL, attempt=1/1 failed=13ops, last exception: org.apache.hadoop.hbase.MultiActionResultTooLarge

规避此问题的方法是稍微调大客户端重试次数，当客户端重试次数为1时，遇到些异常时就不会重新去请求服务端，容易引起一些偶发性的问题。至于重试次数为1时，出现此问题，则需要HBase社区一起看看有什么好的解决方法。

EI企业智能智能数据 HBase 表格存储服务 CloudTable

筛选功能怎么用">excel2010自动筛选功能怎么用

970 2022-05-29

文件的副本在哪里看到搜索到（文件副本在哪里找）

970 2022-05-29

新建表格（手机wps怎么新建表格）">怎么新建表格（手机wps怎么新建表格）

970 2022-05-29

一个 HBase MultiActionResultTooLarge的问题分享

筛选功能怎么用">excel2010自动筛选功能怎么用

文件的副本在哪里看到搜索到（文件副本在哪里找）

新建表格（手机wps怎么新建表格）">怎么新建表格（手机wps怎么新建表格）

推荐文章

企业生产管理是什么，企业生产管理软件

进盘点进销存软件排行榜前十名

进销存系统哪个简单好用？进销存系统优点

工厂生产管理（工厂生产管理流程及制度）

生产管理软件，机械制造业生产管理，制造业生产过程管理软件

进销存软件和ERP有什么区别？进销存与erp软件理解

进销存如何进行库存管理

如何利用excel制作销售订单管理系统？

数据库订单管理系统有哪些功能？数据库订单管理系统怎么设计？

什么是数据库管理系统？

最近发表

热评文章

零代码开发是什么？2022低代码平台排行榜">零代码开发是什么？2022低代码平台排行榜

进销存库存管理 系统（智慧进销存）">智能进销存库存管理系统（智慧进销存）

在线文档哪家强？8款在线文档编辑软件推荐">在线文档哪家强？8款在线文档编辑软件推荐

WPS2016怎么绘制简单的价格表?

智能定制家居管理系统：重新定义家庭生活方式

定制家居数字化管理模式：提升品质、智能化和个性化的未

友情链接

一个HBase MultiActionResultTooLarge的问题分享

微信扫一扫：分享

筛选功能怎么用">excel2010自动筛选功能怎么用

新建表格（手机wps怎么新建表格）">怎么新建表格（手机wps怎么新建表格）

推荐文章

最近发表

热评文章

零代码开发是什么？2022低代码平台排行榜">零代码开发是什么？2022低代码平台排行榜

进销存库存管理系统（智慧进销存）">智能进销存库存管理系统（智慧进销存）

在线文档哪家强？8款在线文档编辑软件推荐">在线文档哪家强？8款在线文档编辑软件推荐

友情链接

一个 HBase MultiActionResultTooLarge的问题分享