Apache CarbonData、Hudi及Open Delta的对比研究
541
2022-05-30
Journeys in big Data statistics
大数据统计之旅
School of Mathematical Sciences, University of Nottingham, UK
诺丁汉大学数学科学学院,英国
Abstract
The realm of big data is a very wide and varied one. We discuss old, new, small and big data, with some of the important challenges including dealing with highly-structured and object-oriented data. In many applications the objective is to discern patterns and learn from large datasets of historical data. We shall discuss such issues in some transportation network applications in non-academic settings, which are naturally applicable to other situations. Vital aspects include dealing with logistics, coding and choosing appropriate statistical methodology, and we provide a summary and checklist for wider implementation.
摘要
大数据领域是一个非常广泛和多样的领域。我们讨论旧的、新的、小的和大的数据,以及一些重要的挑战,包括处理高度结构化和面向对象的数据。在许多应用中,目标是识别模式并从大量历史数据中学习。我们将在一些非学术环境下的运输网络应用中讨论这些问题,这些问题自然适用于其他情况。关键方面包括处理后勤、编码和选择适当的统计方法,我们为更广泛的实施提供总结和检查表。
Keywords:Big data Object-oriented data Transport Networks
关键词:大数据 面向对象数据 传输 网络
1. A new natural resource一种新的自然资源
We will be the first to admit that it is difficult to keep up. How can you expect someone who is trained in dealing with datasets of n = 30 observations with p = 3 variables to suddenly cope with a 100 K-fold increase of n = 3 000 000 observations and p = 300 000 for example, or even worse? Everything has to change. Summarizing a dataset becomes a major computational challenge and p-values take on a ludicrous role where everything is significant. Yet dealing with a wide range of sizes of datasets has become vital for the modern statistician.
我们将是第一个承认很难跟上的人。你如何能期望那些在处理n=30个观测值以及p=3个变量的数据集方面受过训练的人突然能够处理n=3000000个观测值的100K倍的增长,例如p=300000,甚至更糟?一切都必须改变。对数据集进行总结成为主要的计算挑战,并且p值承担着一个荒谬的角色,其中一切都很重要。然而,处理各种各样的数据集对现代统计学家来说是至关重要的。
Virginia Rometty, chairman, president and chief executive officer of IBM said the following at Northwestern University’s 157th commencement ceremony in 2015: What steam was to the 18th century, electricity to the 19th and hydrocarbons to the 20th, data will be to the 21st century. That’s why I call data a new natural resource.
IBM董事长、总裁兼首席执行官弗吉尼亚·罗梅蒂在2015年西北大学第157届毕业典礼上说:什么蒸汽是十八世纪,电力第十九和碳氢化合物到第二十,数据将是二十一世纪。这就是为什么我把数据称为一种新的自然资源。
The need to make sense of the huge rich seams of data being produced underlines the great importance of Statistics, Mathematical and Computational Sciences in today’s society. But what is ‘new’ about data? Data has been used for centuries, for example data collected on the first bloom of cherry blossoms in Kyoto, Japan starting in 800AD and now highlighting climate change (Aono, 2017); Gauss’ meridian arc measurements in 1799 used to define the metre (Stigler, 1981); and Florence Nightingale’s 1859 mortality data and graphical rose diagram presentation on causes of death in the Crimean War leading to modern nursing practice (Nightingale, 1859). All of these old, small datasets are at the core of important issues for mankind, so it is not the data or its importance but the size, structure and ubiquity of data that is new.
对正在产生的大量丰富数据的理解需要强调统计、数学和计算科学在当今社会中的重要性。但是什么是“新”的数据呢?几个世纪以来,数据一直被使用,例如,从公元800年开始收集的关于日本京都樱花第一次绽放的数据,现在突出了气候变化(Aono,2017);1799年高斯子午线弧度测量用来定义米(Stigler,1981);和佛罗伦萨夜莺18的数据。59死亡率数据和图形玫瑰图介绍的死亡原因克里米亚战争导致现代护理实践(南丁格尔,1859年)。所有这些旧的、小型的数据集都是人类重要问题的核心,因此新的不是数据或其重要性,而是数据的大小、结构和普遍性。
Many of the challenges in the new world of Statistics in the Age of Big Data are of a different nature from traditional scenarios. Statisticians are used to dealing with bias and uncertainty, but how can this be handled when datasets are so large and collected in the wild without traditional sampling protocols? What do you do with all the data is an important question. The last 20 years has seen an explosion of statistical methodology to handle large p, often with sparsity assumptions (Hastie et al., 2015). Large n used to be the realm of careful asymptotic theory or thought experiments, but in reality one often does encounter large n now in practice. Two possible routes to practical inference are conditioning and sampling. Conditioning on a small window of values of a subset of covariates will very quickly reduce the size of data available as the number of covariates increases, due to the curse of dimensionality. Such small subsets of the dataset can be used to estimate predictive distributions conditional on the values of the covariates, leading to useful predictions. We give some further detail below in a case study from the transport industry. Sampling sensibly on the other hand is a more difficult task. Although it is straightforward to sample at random et al., 2015). Large n used to be the realm of careful asymptotic theory or thought experiments, but in reality one often does encounter large n now in practice.
大数据时代的统计新世界中的许多挑战具有不同于传统情景的性质。统计学家习惯于处理偏差和不确定性,但是在没有传统采样协议的情况下,当数据集如此之大且在野外收集时,如何处理这些呢?你如何处理所有的数据是一个重要的问题。过去20年,处理大p的统计方法爆炸式增长,通常采用稀疏假设(Hastie等人,2015)。大n过去是仔细的渐近理论或思想实验的领域,但现实中人们在实践中经常遇到大n。实际推理的两种可能途径是调节和采样。由于维数灾难,随着协变量数量的增加,对协变量子集的值的小窗口进行条件化将非常迅速地减少可用数据的大小。数据集的这种小子集可以用于以协变量的值为条件估计预测分布,从而产生有用的预测。在运输行业的案例研究中,我们给出了一些进一步的细节。另一方面,明智地取样是一项更困难的任务。虽然它是简单的随机抽样等,2015)。大N过去是仔细的渐近理论或思想实验的领域,但现实中人们在实践中经常遇到大N。
Two possible routes to practical inference are conditioning and sampling. Conditioning on a small window of values of a subset of covariates will very quickly reduce the size of data available as the number of covariates increases, due to the curse of dimensionality. Such small subsets of the dataset can be used to estimate predictive distributions conditional on the values of the covariates, leading to useful predictions. We give some further detail below in a case study from the transport industry. Sampling sensibly on the other hand is a more difficult task. Although it is straightforward to sample at random of course, given the inherent biases in most big data one needs to carry out sampling to counteract the bias in the data collection.
实际推理的两种可能途径是调节和采样。由于维数灾难,随着协变量数量的增加,对协变量子集的值的小窗口进行条件化将非常迅速地减少可用数据的大小。数据集的这种小子集可以用于以协变量的值为条件估计预测分布,从而产生有用的预测。在运输行业的案例研究中,我们给出了一些进一步的细节。另一方面,明智地取样是一项更困难的任务。虽然随机抽样当然很简单,但是考虑到大多数大数据中固有的偏差,需要执行抽样来抵消数据收集中的偏差。
A further aspect of the avalanche of new data being available is that it is often highly-structured. For example, large quantities of medical images are routinely collected each day in hospitals around the world, each containing highly complicated structured information. The emerging area of Object Oriented Data Analysis (Marron and Alonso, 2014) provides a new way of thinking of statistical analysis for such data. Examples of object data include functions, images, shapes, manifolds, dynamical systems, and trees. The main aims of multivariate analysis extend more generally to object data, e.g. defining a distance between objects, estimation of a mean, summarizing variability, reducing dimension to important components, specifying distributions of objects, carrying out hypothesis tests, prediction, classification and clustering.
可用的新数据雪崩的另一个方面是它经常是高度结构化的。例如,全世界的医院每天例行收集大量的医学图像,每个图像都包含高度复杂的结构化信息。面向对象数据分析的新兴领域(Marron和Aonso,2014)为这些数据提供了一种统计分析的新思路。对象数据的示例包括函数、图像、形状、流形、动力系统和树。多变量分析的主要目标更一般地扩展到对象数据,例如,定义对象之间的距离,估计平均值,总结可变性,将维数减少到重要分量,指定对象的分布,进行假设检验,预测,分类。和聚类。
From Marron and Alonso (2014), in any study an important consideration is to decide what are the atoms (most basic parts) of the data. A key question is ‘what should be the data objects?’, and the answer will then lead to appropriate methodology for statistical analysis. The subject is fast developing following initial definitions in Wang and Marron (2007), and a recent summary with discussion is given by Marron and Alonso (2014) with applications to Spanish human mortality functional data, shapes, trees and medical images. One of the key aspects of object data analysis is that registration of the objects must be considered as part of the analysis. In addition the identifiability of models, choice of regularization and whether to marginalize or optimize as part of the inference are important aspects of object data analysis, as they are in statistical shape analysis (Dryden and Mardia, 2016).
根据Marron和Aonso(2014),在任何研究中,一个重要的考虑是确定数据的原子(最基本的部分)是什么。一个关键的问题是“数据对象应该是什么?”然后,答案将引出适当的统计分析方法。根据Wang和Marron(2007)中的初始定义,该主题正在迅速发展,Marron和Aonso(2014)给出了最近的总结和讨论,并将其应用于西班牙人的死亡率功能数据、形状、树木和医学图像。对象数据分析的关键方面之一是必须将对象的注册视为分析的一部分。此外,模型的可识别性、正则化的选择以及作为推理一部分是否边缘化或优化是目标数据分析的重要方面,正如在统计形状分析中一样(.den和Mardia,2016)。
It is obvious that the realm of big data is a very wide and varied one. In some realms the difficulties lie with truly astronomical quantities of data which are not even feasibly stored for future retrieval, for which online algorithm development is a key area of research; whereas in other realms the challenge is in discerning patterns and learning from large datasets of historical data. We shall discuss the latter, in generality, below for what can loosely be thought of as transportation network applications in non-academic settings. Many of the approaches and recommendations discussed below are naturally applicable to other applications, such as a general practice of data retention, while others related to origin–destination filtering are clearly more specific to transportation problems.
显然,大数据领域是一个非常广泛和多样的领域。在某些领域中,困难在于甚至不可能为将来检索而存储大量真正天文数字的数据,对此,在线算法开发是研究的一个关键领域;而在其他领域中,挑战在于识别模式和从大型数据集中学习。历史数据。下面,我们将一般性地讨论后者,以了解在非学术环境中可以宽松地认为是运输网络应用的内容。下面讨论的许多方法和建议自然适用于其他应用程序,例如数据保留的一般实践,而与源目的地过滤相关的其他方法显然更特定于运输问题。
2. Case study: transportation big data案例研究:交通大数据
The classification of problems into different areas of interest can be greatly beneficial in allowing techniques of particular relevance to all problems in a particular area to be discussed as one. The contemporary challenge we shall now discuss surrounds the use of statistics in real-world infrastructure problems that can arise for public or mass transportation, such as train travel, bus travel, or similar networked transportation methods. Collaborations between universities and businesses up and down the country already exist, and will continue to grow in the coming years for trying to share best practices and perform statistical analysis on datasets harvested by businesses about their customers, to either improve customer experience or to improve business efficiency. We concern ourselves here with the challenges one will meet in embedding good practice and developing useful models for exploitation of data in businesses where perhaps even the initial data handling task has so far seemed daunting.
将问题分类到不同的关注领域可以极大地有助于将特定领域中与所有问题具有特定相关性的技术作为一个整体进行讨论。我们现在要讨论的当代挑战围绕着统计在现实世界基础设施问题中的应用,这些问题可能出现在公共或大众运输中,例如火车旅行、公共汽车旅行或类似的网络运输方法。全国上下的大学和企业之间的合作已经存在,并且在未来几年将继续增长,以尝试共享最佳实践并对企业收集的关于客户的数据集进行统计分析,以改进客户体验。e或提高业务效率。我们在此关注在嵌入良好实践以及开发有用模型以利用企业中的数据方面将遇到的挑战,在这些企业中,甚至可能最初的数据处理任务迄今看起来都令人畏惧。
Studying transportation systems as networked queues has been one of the most natural approaches, borne out of the queueing theory literature of previous decades. Courtesy of advances in computing, larger and larger network problems are now attempted to be ‘solved’ or at least approximately solved. Much of the focus in recent years lies with proposing online algorithms for live traffic management. With big data, opportunities arise to try and optimize these local dynamic decision problems: of re-routing a vehicle; skipping stops (if permitted); or allocating platforms, all in light of a wealth of additional statistical information. Approaches to dynamic resource allocation laid out in Glazebrook et al. (2014) would often benefit from a serious statistical analysis to first properly understand the dynamics of a network-based model, so that when formulating the problem in a queueing framework an appropriate level of confidence can be placed on the stochastic quantities. In particular, if you were to consider traffic management decisions on a railway surrounding the choice of platforms or use of signals outside a busy station, an effective algorithm for allocating the resource that is the station platform at a particular time can only function with a well-calibrated cost function which accounts for knock-on effects of such a decision. Possessing years of historical data during which a wealth of such decisions have been made and their consequences mapped, leads us very naturally to first want to perform some robust statistical analyses.
将运输系统作为网络队列进行研究,是前几十年排队论文献中最自然的方法之一。由于计算技术的进步,越来越大的网络问题现在试图“解决”或至少近似解决。近年来的许多焦点在于提出在线交通管理算法。有了大数据,就出现了尝试和优化这些局部动态决策问题的机会:重新选择车辆的路线;跳过停车站(如果允许);或分配平台,所有这些都基于丰富的附加统计信息。在Glazebrook等人提出的动态资源分配方法。(2014)通常得益于认真的统计分析,以便首先正确理解基于网络的模型的动态,从而当在排队框架中制定问题时,可以对随机量设置适当的置信度。特别地,如果您要考虑在围绕着选择站台或使用繁忙站台外的信号的铁路上的交通管理决策,那么在特定时间分配站台资源的有效算法只能很好地进行校准。D成本函数,该函数决定了这种决策的敲击效应。拥有数年的历史数据,在这期间,已经做出了大量这样的决策并绘制了它们的结果,这很自然地导致我们首先想要进行一些稳健的统计分析。
For statistical analyses, the natural starting point to a statistician is gaining access to the appropriate historical data. There are already two large data types of interest, customer-centric journey counting or vehicle journeys. In the world of buses it is estimated that over 5 billion passenger journeys occur each year in the UK Department of Transport, UK (2016), for example, and the alternative approach lies with vehicle logging data. For our discussion we shall concern ourselves more with logged vehicle movements, such as the approximately three million individual train movements which are logged on a given day. For both buses and trains there is far more information available than just logged movements of stops, departures, and waypoint visiting. These can include signal changes, platform assignments, detours, engine types, vehicle capacities or raw passenger numbers. A few million daily datapoints is certainly not as large as some datasets, however the ability to store and then later quickly access and filter a database over a considerable time period can still become a non-trivial challenge. The typical format of vehicle data is broken down across the regions of a country, but sometimes even individual journeys may span more than one region. Considerable data cleaning may also be required to remove duplication, to collate, and at times even resolve issues of contradictory data logged by different systems or network operators. Each logged message can also typically contain tens of covariates indicating information ranging from the current time and present location of a vehicle, to its previous location, intended destination, top speed, personal capacity, and even properties like the engine type.
对于统计分析,统计学家的自然起点是获得适当的历史数据。已经有两个大的数据类型感兴趣,以客户为中心的旅程计数或车辆旅行。例如,在公共汽车领域,估计每年在英国交通部(2016)有超过50亿的旅客出行,而备选方法在于车辆记录数据。对于我们的讨论,我们将更多地关注已记录的车辆运动,例如某一天记录的大约300万个单独的列车运动。对于公交车和火车来说,除了记录停靠、离开和路点访问的移动之外,还有更多的可用信息。这些可以包括信号变化、平台分配、弯道、发动机类型、车辆容量或原始乘客数。每天几百万个数据点当然不像某些数据集那么大,但是存储和随后在相当长的时间段内快速访问和过滤数据库的能力仍然可能成为一个非平凡的挑战。车辆数据的典型格式被分解为一个国家的各个区域,但有时甚至单个行程也可能跨越多个区域。为了消除重复、校对,有时甚至解决不同系统或网络运营商记录的相互矛盾的数据问题,可能还需要大量的数据清理。每个记录的消息通常还可以包含数十个协变量,这些协变量指示从车辆的当前时间和当前位置到其先前位置、预定目的地、最高速度、个人能力甚至诸如发动机类型的属性的信息。
Thus the first challenge is to place the data into a file structure robust to future data collection, but readily accessible for planned statistical analyses. When this data comfortably runs into the hundreds of gigabytes this is not a small issue. We imagine that in many real-world scenarios the potential future benefits to a company of just putting in place the procedures to store large quantities of log data which are readily available contemporaneously, but which are not intended for immediate use, is already a positive step towards future-proofing oneself to technological change across a range of sectors. In some domains such as social media, there exist a range of companies offering archiving and filtered searching facilities as a service, generally for marketing or research purposes. However, the reliance on outsourcing is likely not the preferred route in many industries, especially given the rate of attrition in some of these third-party service providers. Taking these database creation exercises in-house, and embedding the requisite expertise for later retrieval is already a positive benefit to big data thinking.
因此,第一个挑战是将数据放置到对未来数据收集具有鲁棒性的文件结构中,但容易用于计划的统计分析。当这些数据舒适地进入数百千兆字节时,这不是一个小问题。我们设想,在许多真实世界的场景中,潜在的未来好处是,公司只需要设置程序来存储大量的日志数据,这些数据是同时可用的,但不打算立即使用,这已经是未来的一个积极步骤。在各种领域内,为自己的技术变化打下了良好的基础。在一些领域,如社交媒体,有一系列公司提供归档和过滤搜索设施作为一种服务,一般用于市场或研究目的。然而,对外包的依赖很可能不是许多行业的首选路线,尤其是考虑到一些第三方服务提供商的流失率。在内部进行这些数据库创建练习,并将必要的专业知识嵌入到后续的检索中,这对大数据思维已经是一个积极的好处。
When it comes to handling large datasets some of the go-to tools of the research statistician such as the R software (R Core Team, 2017) need to be handled with care. Whilst packages exist to manage hardware issues such as large memory footprints, at every stage consideration needs to be given to repeated database access and careful filtering to ensure that analysis is only performed on sets of data which are not excessively large. In transport applications, this often means filtering journeys by their origin–destination pair, or for spatial analysis, aggregating datapoints over small geographical cells, for a large number of different cells. In removing all messages not directly related to vehicles travelling between a specific Origin–Destination (OD) pair one can concentrate on identifying recurrent behaviours and patterns. Unfortunately, in transport networks there may be other vehicles which interact with vehicles on this particular route but which do not share the same OD pair. Thus a combination of filtering approaches may be required. Many natural approaches to transport, therefore, try to model the characteristics of each individual piece of infrastructure, as a function of a range of up to fif or so covariates of vehicles that pass through it. This can then be used either for infrastructure assessment, or as part of live vehicle prediction modelling. Of more prevalent use in the business world are data visualization tools like Tableau for creating slick graphics for presentations. With big data, however, there is very often the need to have already performed some careful database filtering and covariate selection to ensure appropriately-sized and relevant datasets are used to create the visualizations.
当涉及到处理大型数据集时,一些研究统计学家常用的工具,如R软件(R核心小组,2017)需要小心处理。尽管存在软件包来管理诸如大内存占用之类的硬件问题,但是在每个阶段都需要考虑重复的数据库访问和仔细过滤,以确保只对不过大的数据集执行分析。在运输应用中,这通常意味着通过它们的起点-目的地对过滤旅程,或者用于空间分析,通过小地理单元聚集数据点,用于大量不同的单元。在移除与特定始发地-目的地对(OD)之间行驶的车辆不直接相关的所有消息时,可以集中精力识别重复的行为和模式。不幸的是,在运输网络中,可能有其他车辆与该特定路线上的车辆交互,但是它们不共享相同的OD对。因此,可能需要过滤方法的组合。因此,许多自然的交通方式试图将每个基础设施的特征建模为最多15个或15个通过该基础设施的车辆的协变量的范围的函数。这可以被用于基础设施评估,或者作为活车辆预测模型的一部分。在商业世界中更普遍使用的是数据可视化工具,如Tableau,用于为演示文稿创建光滑的图形。然而,对于大数据,通常需要已经执行了一些仔细的数据库筛选和协变量选择,以确保使用适当大小的相关数据集来创建可视化。
A number of academic challenges arise when trying to use such enormous quantities of data to make single predictions, or evaluations, for example in trying to predict the future lateness of a particular vehicle’s on-going journey. Identification of explanatory variables, along with appreciation of the physical mechanisms will often drive an appropriate model choice to distil the enormous datasets into appropriate predictions or summary data. Deciding whether to chain the use of linear models across a network of (assumed) independent locations, or to attempt kernel regression methods to weight a much larger number of proximate locations for the same purpose is not always easily determined. Indeed the most flexible approach obviously lies in maintaining the ability to try as many plausible avenues as possible, without having to make compromises to accommodate the potentially overwhelmingly large numbers of datapoints that may arise without appropriate data filtering.
华为云APP 云市场
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。