STEM 隨筆︰古典力學︰模擬術【小工具】八《大數據》六

 如何理解一個國家的富裕?假使說百業興盛、市場繁榮、人民所得逐年提高,是否就足以保障那個富裕呢??或者說這個國家不單是富裕,而且走在通往富強的康莊大道上哩。人們需要理由以及數據解釋當下發生的事,更想有水晶球預測未來的趨勢!由於事物總在時流裡變化,在在促使人門研究

時間序列

時間序列英語:time series實證經濟學的一種統計方法

內涵

時間序列是用時間排序的一組隨機變量,國內生產毛額(GDP)、消費者物價指數(CPI)、台灣加權股價指數、利率、匯率等等都是時間序列。

時間序列的時間間隔可以是分秒(如高頻金融數據),可以是日、周、月、季度、年、甚至更大的時間單位。

時間序列是計量經濟學所研究的三大數據形態(另兩大為橫截面數據和縱面數據)之一,在總體經濟學國際經濟學金融學金融工程學等學科中有廣泛應用。

時間序列變量的特徵

  • 非平穩性(nonstationarity,也譯作不平穩性非穩定性):即時間序列的變異數無法呈現出一個長期趨勢並最終趨於一個常數或是一個線性函數
  • 波動幅度隨時間變化(Time-varying Volatility):即一個時間序列變量的變異數隨時間的變化而變化

這兩個特徵使得有效分析時間序列變量十分困難。

平穩型時間數列(Stationary Time Series)係指一個時間數列其統計特性將不隨時間之變化而改變者。

傳統的計量經濟學的假設

  1. 假設時間序列變量是從某個隨機過程中隨機抽取並按時間排列而形成的,因而一定存在一個(狹義)穩定趨勢(stationarity),即:平均值是固定的
  2. 假定時間序列變量的波動幅度不隨時間改變,即:變異數是固定的。但這明顯不符合實際,人們早就發現股票收益的波動幅度是隨時間而變化的,並非常數

這兩個假設使得傳統的計量經濟學方法對實際生活中的時間序列變量無法有效分析。克萊夫·格蘭傑羅伯特·恩格爾的貢獻解決了這個問題。

!尤其今天已是大數據的時代,人們嚮往掌握方法,能夠藉著資訊煉金也!!

Time series: random data plus trend, with best-fit line and different applied filters

─── 《時間序列︰從微觀到巨觀

 

由於作者曾寫過一系列《時間序列︰ □ □ 》文本,實不宜再多贅述也。此處援引《時間序列︰安斯庫姆四重奏》,強調『視覺化』之重要性耳︰

『眼見為憑』,明白統計數據間之局部與全局關係,此乃

安斯庫姆四重奏

安斯庫姆四重奏Anscombe’s quartet)是四組基本的統計特性一致的數據,但由它們繪製出的圖表則截然不同。每一組數據都包括了11個(x,y)點。這四組數據由統計學家弗朗西斯·安斯庫姆(Francis Anscombe)於1973年構造,他的目的是用來說明在分析數據前先繪製圖表的重要性,以及離群值對統計的影響之大。

安斯庫姆四重奏的四組數據圖表

這四組數據的共同統計特性如下:

性質 數值
x平均數 9
x方差 11
y的平均數 7.50(精確到小數點後兩位)
y的方差 4.122或4.127(精確到小數點後三位)
xy之間的相關係數 0.816(精確到小數點後三位)
線性回歸 \displaystyle y=3.00+0.500x(分別精確到小數點後兩位和三位)

在四幅圖中,由第一組數據繪製的圖表(左上圖)是看起來最「正常」的,可以看出兩個隨機變量之間的相關性。從第二組數據的圖表(右上圖)則可以明顯 地看出兩個隨機變量間的關係是非線性的。第三組中(左下圖),雖然存在著線性關係,但由於一個離群值的存在,改變了線性回歸線,也使得相關係數從1降至 0.81。最後,在第四個例子中(右下圖),儘管兩個隨機變量間沒有線性關係,但僅僅由於一個離群值的存在就使得相關係數變得很高。

愛德華·塔夫特(Edward Tufte)在他所著的《圖表設計的現代主義革命》(The Visual Display of Quantitative Information)一書的第一頁中,就使用安斯庫姆四重奏來說明繪製數據圖表的重要性。

四組數據的具體取值如下所示。其中前三組數據的x值都相同。

安斯庫姆四重奏
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

 

的主旋律乎?如是者將會知道概念間的關係及其先後次序之重要性吧!

 

說來『資料圖示』想法古早矣︰

Exploratory data analysis

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA),[1] which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

Overview

Tukey defined data analysis in 1961 as: “[P]rocedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”[2]

Tukey’s championing of EDA encouraged the development of statistical computing packages, especially S at Bell Labs. The S programming language inspired the systems ‘S’-PLUS and R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify outliers, trends and patterns in data that merited further study.

Tukey’s EDA was related to two other developments in statistical theory: robust statistics and nonparametric statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating statistical models. Tukey promoted the use of five number summary of numerical data—the two extremes (maximum and minimum), the median, and the quartiles—because these median and quartiles, being functions of the empirical distribution are defined for all distributions, unlike the mean and standard deviation; moreover, the quartiles and median are more robust to skewed or heavy-tailed distributions than traditional summaries (the mean and standard deviation). The packages S, S-PLUS, and R included routines using resampling statistics, such as Quenouille and Tukey’s jackknife and Efron‘s bootstrap, which are nonparametric and robust (for many problems).

Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians’ work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the analytic theory of testing statistical hypotheses, particularly the Laplacian tradition’s emphasis on exponential families.[3]

Development

John W. Tukey wrote the book Exploratory Data Analysis in 1977.[4] Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic biasowing to the issues inherent in testing hypotheses suggested by the data.

The objectives of EDA are to:

Many EDA techniques have been adopted into data mining, as well as into big data analytics.[6] They are also being taught to young students as a way to introduce them to statistical thinking.[7]

Data science process flowchart

 

技術一籮筐︰

Techniques

There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques.[8]

Typical graphical techniques used in EDA are:

Typical quantitative techniques are:

 

這也是 bqplot 設計之旨呦!