【<a target="_blank" href="https://www.huoban.com/news/tags-644.html"style="font-weight:bold;">Spark</a>API】<a target="_blank" href="https://www.huoban.com/news/tags-231.html"style="font-weight:bold;">Java</a>PairRDD—

【SparkAPI】JavaPairRDD——countByKey、countByKeyApprox

网友投稿 730 2025-04-02

/** * Count the number of elements for each key, collecting the results to a local Map. * * @note This method should only be used if the resulting map is expected to be small, as * the whole thing is loaded into the driver's memory. * To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which * returns an RDD[T, Long] instead of a map. */

计算每个键的元素数，将结果放到Map中去。

注意：

只有当数据量很小时，才应使用此方法，因为整个数据都被载入内存中。

如果要处理大量数据，请考虑使用rdd.mapValues(_ => 1L).reduceByKey(_ + _)，

返回的结果是 RDD[T, Long] 而不是Map。

// java public java.util.Map countByKey() // scala def countByKey(): Map[K, Long]

public class CountByKey { public static void main(String[] args) { System.setProperty("hadoop.home.dir", "E:\hadoop-2.7.1"); SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("Spark_DEMO"); JavaSparkContext sc = new JavaSparkContext(sparkConf); JavaPairRDD javaPairRDD1 = sc.parallelizePairs(Lists.newArrayList( new Tuple2("cat", "11"), new Tuple2("dog", "22"), new Tuple2("cat", "33"), new Tuple2("pig", "44"), new Tuple2("duck", "55"), new Tuple2("cat", "66")), 3); Map key = javaPairRDD1.countByKey(); for (Map.Entry entry : key.entrySet()){ System.out.println(entry.getKey()+":"+entry.getValue()); } } }

19/03/20 16:36:11 INFO DAGScheduler: ResultStage 1 (countByKey at CountByKey.java:23) finished in 0.093 s 19/03/20 16:36:11 INFO DAGScheduler: Job 0 finished: countByKey at CountByKey.java:23, took 1.229949 s duck:1 cat:3 dog:1 pig:1 19/03/20 16:36:11 INFO SparkContext: Invoking stop() from shutdown hook

/** * Approximate version of countByKey that can return a partial result if it does * not finish within a timeout. * * The confidence is the probability that the error bounds of the result will * contain the true value. That is, if countApprox were called repeatedly * with confidence 0.9, we would expect 90% of the results to contain the * true count. The confidence must be in the range [0,1] or an exception will * be thrown. * * @param timeout maximum time to wait for the job, in milliseconds * @param confidence the desired statistical confidence in the result * @return a potentially incomplete result, with error bounds */

【SparkAPI】JavaPairRDD——countByKey、countByKeyApprox

CountByKey的近似版本，如果没有在规定时间内完成就返回部分结果。

@参数超时等待作业的最长时间（毫秒）

@参数置信度结果中所需的统计置信度

@返回一个可能不完整的结果，带有错误界限

// java public PartialResult> countByKeyApprox(long timeout) public PartialResult> countByKeyApprox(long timeout, double confidence) // scala def countByKeyApprox(timeout: Long): PartialResult[Map[K, BoundedDouble]] def countByKeyApprox(timeout: Long, confidence: Double = 0.95): PartialResult[Map[K, BoundedDouble]]

EI企业智能 Java spark 可信智能计算服务 TICS 智能数据

Java的面向对象编程">Java的面向对象编程

730 2025-04-02

一个 Java class">我是一个 Java class

730 2025-04-02

util.Random和concurrent.ThreadLocalRandom对比">java.util.Random和concurrent.ThreadLocalRandom对比

730 2025-04-02

【SparkAPI】JavaPairRDD——countByKey、countByKeyApprox

Java的面向对象编程">Java的面向对象编程

一个 Java class">我是一个 Java class

util.Random和concurrent.ThreadLocalRandom对比">java.util.Random和concurrent.ThreadLocalRandom对比

推荐文章

企业生产管理是什么，企业生产管理软件

进盘点进销存软件排行榜前十名

进销存系统哪个简单好用？进销存系统优点

工厂生产管理（工厂生产管理流程及制度）

生产管理软件，机械制造业生产管理，制造业生产过程管理软件

进销存软件和ERP有什么区别？进销存与erp软件理解

进销存如何进行库存管理

如何利用excel制作销售订单管理系统？

数据库订单管理系统有哪些功能？数据库订单管理系统怎么设计？

什么是数据库管理系统？

最近发表

热评文章

零代码开发是什么？2022低代码平台排行榜">零代码开发是什么？2022低代码平台排行榜

进销存库存管理 系统（智慧进销存）">智能进销存库存管理系统（智慧进销存）

在线文档哪家强？8款在线文档编辑软件推荐">在线文档哪家强？8款在线文档编辑软件推荐

WPS2016怎么绘制简单的价格表?

系统的功能有哪些？餐饮服务系统的构成及工作程序">连锁餐饮管理系统的功能有哪些？餐饮服务系统的构成及工

Excel项目进度表模板，简化您的项目进度管理">Excel项目进度表模板，简化您的项目进度管理

友情链接

【SparkAPI】JavaPairRDD——countByKey、countByKeyApprox

微信扫一扫：分享

Java的面向对象编程">Java的面向对象编程

一个Java class">我是一个Java class

util.Random和concurrent.ThreadLocalRandom对比">java.util.Random和concurrent.ThreadLocalRandom对比

推荐文章

最近发表

热评文章

零代码开发是什么？2022低代码平台排行榜">零代码开发是什么？2022低代码平台排行榜

进销存库存管理系统（智慧进销存）">智能进销存库存管理系统（智慧进销存）

在线文档哪家强？8款在线文档编辑软件推荐">在线文档哪家强？8款在线文档编辑软件推荐

系统的功能有哪些？餐饮服务系统的构成及工作程序">连锁餐饮管理系统的功能有哪些？餐饮服务系统的构成及工

Excel项目进度表模板，简化您的项目进度管理">Excel项目进度表模板，简化您的项目进度管理

友情链接

一个 Java class">我是一个 Java class