不能为分析而分析,不能单纯为介绍模型而介绍,要把模型建立的依据写上,就是要添加探索性分析,线性模型是广义线性模型的特殊情况

1. 线性模型

分类数据回归,方差分析,假设检验问题其实都是特殊的线性模型。如果你是统计科班出身,对于这一点应该是觉得理所当然!故而后续不再刻意区分,统一划分在线性模型这个大框架下。

PlantGrowth 数据集最初来自书籍 Annette J. Dobson. 1983. An Introduction to Generalized Linear Models. London: Chapman and Hall. https://doi.org/10.1007/978-1-4899-3174-0 第9页,后来打包在 R 包 datasets 中。在两种不同的实验条件下,比较产量(用植物的干重来衡量)。首先观察一下收集到的数据集

head(PlantGrowth)
  weight group
1   4.17  ctrl
2   5.58  ctrl
3   5.18  ctrl
4   6.11  ctrl
5   4.50  ctrl
6   4.61  ctrl
str(PlantGrowth)
'data.frame':   30 obs. of  2 variables:
 $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
 $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...

weight 是数值型的向量,group 是因子型的分类变量

library(ggplot2)
png(filename = "PlantGrowth-boxplot.png", width = 1200, height = 1000, res = 200, type = "cairo")
ggplot(data = PlantGrowth, aes(x = group, y = weight, color = group)) + 
  geom_boxplot(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE) +
  theme_minimal() +
  labs(y = "Dried weight of plants")
dev.off()
# 或者
boxplot(weight ~ group,
  data = PlantGrowth,
  ylab = "Dried weight of plants", col = "lightgray",
  notch = FALSE, varwidth = TRUE
)

在不同实验条件下,植物产量的比较

1.1 lm 回归

lm.D9 <- lm(weight ~ group, data = PlantGrowth)
summary(lm.D9)
Call:
lm(formula = weight ~ group, data = PlantGrowth)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4180 -0.0060  0.2627  1.3690 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0320     0.1971  25.527   <2e-16 ***
grouptrt1    -0.3710     0.2788  -1.331   0.1944    
grouptrt2     0.4940     0.2788   1.772   0.0877 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared:  0.2641,    Adjusted R-squared:  0.2096 
F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591

不包含截距项

lm.D90 <- lm(weight ~ group - 1, data = PlantGrowth) # omitting intercept
summary(lm.D90)
Call:
lm(formula = weight ~ group - 1, data = PlantGrowth)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4180 -0.0060  0.2627  1.3690 

Coefficients:
          Estimate Std. Error t value Pr(>|t|)    
groupctrl   5.0320     0.1971   25.53   <2e-16 ***
grouptrt1   4.6610     0.1971   23.64   <2e-16 ***
grouptrt2   5.5260     0.1971   28.03   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared:  0.9867,    Adjusted R-squared:  0.9852 
F-statistic: 665.5 on 3 and 27 DF,  p-value: < 2.2e-16

1.1.1 模型输出

有时候,我们需要把模型结果转化成 LaTeX 表格,借助 xtable 包,我们可以很容易实现这一点。

library(xtable)
xtable(anova(lm.D9))
% latex table generated in R 3.6.1 by xtable 1.8-4 package
% Sun Aug 11 21:52:26 2019
\begin{table}[ht]
\centering
\begin{tabular}{lrrrrr}
  \hline
 & Df & Sum Sq & Mean Sq & F value & Pr($>$F) \\ 
  \hline
group & 2 & 3.77 & 1.88 & 4.85 & 0.0159 \\ 
  Residuals & 27 & 10.49 & 0.39 &  &  \\ 
   \hline
\end{tabular}
xtable(summary(lm.D9))
% latex table generated in R 3.6.1 by xtable 1.8-4 package
% Sun Aug 11 21:52:39 2019
\begin{table}[ht]
\centering
\begin{tabular}{rrrrr}
  \hline
 & Estimate & Std. Error & t value & Pr($>$$|$t$|$) \\ 
  \hline
(Intercept) & 5.0320 & 0.1971 & 25.53 & 0.0000 \\ 
  grouptrt1 & -0.3710 & 0.2788 & -1.33 & 0.1944 \\ 
  grouptrt2 & 0.4940 & 0.2788 & 1.77 & 0.0877 \\ 
   \hline
\end{tabular}
\end{table}

有时候面向 Markdown 格式输出,借助 broom 包可以将 lm 对象转化为 tibble 类型

library(broom)
tidy(lm.D9)
# A tibble: 3 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    5.03      0.197     25.5  1.94e-20
2 grouptrt1     -0.371     0.279     -1.33 1.94e- 1
3 grouptrt2      0.494     0.279      1.77 8.77e- 2

进一步转化为表格

knitr::kable(tidy(lm.D9))
term estimate std.error statistic p.value
(Intercept) 5.032 0.1971284 25.526514 0.0000000
grouptrt1 -0.371 0.2787816 -1.330791 0.1943879
grouptrt2 0.494 0.2787816 1.771996 0.0876817

1.1.2 方差分析

library(ggplot2)
ggplot(data = PlantGrowth, aes(x = group, y = weight, color = group)) +
  geom_point(show.legend = FALSE)

模型结构,将方差分析纳入线性模型的范畴来介绍

1.2 brm 回归

贝叶斯框架下,方差分析结构,计算问题 BANOVA

惩罚极大似然和贝叶斯回归的关系

brmsPaul-Christian Bürkner 开发,提供了 Stan 的 R 接口,求解统计模型的接口调用方式和 R 的公式语法基本一致,相比于 rstanrstanarm 使用更加方便,因为它内置了大量的统计模型的实现,将公式语法翻译成 Stan 编码的模型,然后调用 rstan 翻译成 C++ 最后编译成动态链接库,下面就以此为基础介绍各个模型的实现

library(brms)
lm.brm.D9 <- brm(weight ~ group, data = PlantGrowth)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: weight ~ group 
   Data: PlantGrowth (Number of observations: 30) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept     5.03      0.20     4.62     5.42       3431 1.00
grouptrt1    -0.37      0.29    -0.95     0.21       3412 1.00
grouptrt2     0.49      0.29    -0.08     1.06       3719 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     0.65      0.10     0.50     0.88       3322 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).
# 同样不包含截距项
lm.brm.D90 <- brm(weight ~ group - 1, data = PlantGrowth)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: weight ~ group - 1 
   Data: PlantGrowth (Number of observations: 30) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
groupctrl     5.03      0.21     4.61     5.45       2992 1.00
grouptrt1     4.66      0.21     4.25     5.06       3870 1.00
grouptrt2     5.53      0.21     5.10     5.95       3538 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     0.66      0.10     0.50     0.88       3395 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

1.2.1 模型输出

类似地,同样使用 broom 包清理模型拟合结果

library(broom)
tidy(lm.brm.D9)
         term    estimate  std.error        lower       upper
1 b_Intercept   5.0341992 0.20320253   4.70126340   5.3635147
2 b_grouptrt1  -0.3693260 0.29481164  -0.86360136   0.1050397
3 b_grouptrt2   0.4907606 0.28962231   0.01886856   0.9725647
4       sigma   0.6546244 0.09641419   0.51699801   0.8290725
5        lp__ -35.3832740 1.49849200 -38.30753264 -33.6299429
knitr::kable(tidy(lm.brm.D9))
term estimate std.error lower upper
b_Intercept 5.0341992 0.2032025 4.7012634 5.3635147
b_grouptrt1 -0.3693260 0.2948116 -0.8636014 0.1050397
b_grouptrt2 0.4907606 0.2896223 0.0188686 0.9725647
sigma 0.6546244 0.0964142 0.5169980 0.8290725
lp__ -35.3832740 1.4984920 -38.3075326 -33.6299429
plot(lm.brm.D9)

参数后验概率密度图

plot(marginal_effects(lm.brm.D9, effects = "group"))

边际效应

2. 广义线性模型

补充背景介绍、数据集描述、模型结果解释(回归问题)、模型输出结果如何看(统计符号含义)

2.1 酒精数据

存在有序分类数据

酒精的作用 effects of alcohol, tobacco and interaction, age-adjusted 数据集描述见 help(esoph)

head(esoph)
  agegp     alcgp    tobgp ncases ncontrols
1 25-34 0-39g/day 0-9g/day      0        40
2 25-34 0-39g/day    10-19      0        10
3 25-34 0-39g/day    20-29      0         6
4 25-34 0-39g/day      30+      0         5
5 25-34     40-79 0-9g/day      0        27
6 25-34     40-79    10-19      0         7
str(esoph)
'data.frame':   88 obs. of  5 variables:
 $ agegp    : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1..
 $ alcgp    : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2..
 $ tobgp    : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 ..
 $ ncases   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ncontrols: num  40 10 6 5 27 7 4 7 2 1 ...
glm.esoph <- glm(cbind(ncases, ncontrols) ~ agegp + tobgp * alcgp,
               data = esoph, family = binomial())
Call:
glm(formula = cbind(ncases, ncontrols) ~ agegp + tobgp * alcgp, 
    family = binomial(), data = esoph)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8895  -0.5317  -0.2304   0.2704   2.0724  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -1.75985    0.19822  -8.878  < 2e-16 ***
agegp.L          2.99646    0.65386   4.583 4.59e-06 ***
agegp.Q         -1.35008    0.59197  -2.281   0.0226 *  
agegp.C          0.13436    0.45056   0.298   0.7655    
agegp^4          0.07098    0.30974   0.229   0.8187    
agegp^5         -0.21347    0.19627  -1.088   0.2768    
tobgp.L          0.63846    0.19710   3.239   0.0012 ** 
tobgp.Q          0.02922    0.19617   0.149   0.8816    
tobgp.C          0.15607    0.19796   0.788   0.4304    
alcgp.L          1.37077    0.21136   6.485 8.85e-11 ***
alcgp.Q         -0.14913    0.19645  -0.759   0.4478    
alcgp.C          0.22823    0.18203   1.254   0.2099    
tobgp.L:alcgp.L -0.70426    0.41128  -1.712   0.0868 .  
tobgp.Q:alcgp.L  0.12225    0.42044   0.291   0.7712    
tobgp.C:alcgp.L -0.29187    0.42939  -0.680   0.4967    
tobgp.L:alcgp.Q  0.12948    0.38889   0.333   0.7392    
tobgp.Q:alcgp.Q -0.44527    0.39224  -1.135   0.2563    
tobgp.C:alcgp.Q -0.05205    0.39538  -0.132   0.8953    
tobgp.L:alcgp.C -0.16118    0.36697  -0.439   0.6605    
tobgp.Q:alcgp.C  0.04843    0.36211   0.134   0.8936    
tobgp.C:alcgp.C -0.13905    0.35754  -0.389   0.6973    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 227.241  on 87  degrees of freedom
Residual deviance:  47.484  on 67  degrees of freedom
AIC: 236.96

Number of Fisher Scoring iterations: 6
glm.brm.esoph <- brm(ncases | trials(ncases + ncontrols) ~ agegp + tobgp * alcgp, 
                     data = esoph, family = binomial())
 Family: binomial 
  Links: mu = logit 
Formula: ncases | trials(ncases + ncontrols) ~ agegp + tobgp * alcgp 
   Data: esoph (Number of observations: 88) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept          -1.91      0.25    -2.49    -1.51        735 1.01
agegp.L             3.39      0.86     2.13     5.45        674 1.01
agegp.Q            -1.68      0.78    -3.58    -0.50        658 1.01
agegp.C             0.31      0.57    -0.59     1.63        709 1.00
agegpE4            -0.01      0.36    -0.80     0.65        907 1.01
agegpE5            -0.20      0.21    -0.59     0.22       1970 1.00
tobgp.L             0.63      0.20     0.24     1.03       4654 1.00
tobgp.Q             0.03      0.20    -0.38     0.42       3469 1.00
tobgp.C             0.17      0.20    -0.21     0.57       3892 1.00
alcgp.L             1.41      0.22     0.99     1.84       4067 1.00
alcgp.Q            -0.16      0.20    -0.56     0.24       3335 1.00
alcgp.C             0.25      0.19    -0.12     0.62       3870 1.00
tobgp.L:alcgp.L    -0.69      0.42    -1.51     0.16       3878 1.00
tobgp.Q:alcgp.L     0.13      0.43    -0.75     0.97       4249 1.00
tobgp.C:alcgp.L    -0.30      0.44    -1.15     0.58       5149 1.00
tobgp.L:alcgp.Q     0.13      0.41    -0.67     0.94       3127 1.00
tobgp.Q:alcgp.Q    -0.46      0.41    -1.24     0.34       4037 1.00
tobgp.C:alcgp.Q    -0.05      0.40    -0.82     0.74       4490 1.00
tobgp.L:alcgp.C    -0.15      0.38    -0.89     0.58       3507 1.00
tobgp.Q:alcgp.C     0.04      0.37    -0.69     0.75       3274 1.00
tobgp.C:alcgp.C    -0.17      0.36    -0.88     0.54       3773 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

2.2 流产数据

infert

存在无序分类变量

require(stats)
model1 <- glm(case ~ spontaneous+induced, data = infert, family = binomial())
summary(model1)
## adjusted for other potential confounders:
summary(model2 <- glm(case ~ age + parity + education + spontaneous + induced,
                     data = infert, family = binomial()))
## Really should be analysed by conditional logistic regression
## which is in the survival package
if(require(survival)){
  model3 <- clogit(case ~ spontaneous+induced+strata(stratum), data = infert)
  print(summary(model3))
  detach()  # survival (conflicts)
}

多项回归

存在有序分类数据

data(nes96, package = "faraway")

party <- nes96$PID
levels(party) <- c("Democrat", "Democrat", "Independent", 
                   "Independent", "Independent", "Republican", "Republican")
inca <- c(
  1.5, 4, 6, 8, 9.5, 10.5, 11.5, 12.5, 13.5, 14.5, 16, 18.5, 21, 23.5,
  27.5, 32.5, 37.5, 42.5, 47.5, 55, 67.5, 82.5, 97.5, 115
)
income <- inca[unclass(nes96$income)]
rnes96 <- data.frame(party, income, education = nes96$educ, age = nes96$age)
summary(rnes96)

library(faraway)
library(MASS)
pomod <- polr(party ~ age + education + income, rnes96)

library(nnet)
mmod <- multinom(party ~ age + education + income, rnes96)

运行环境

必须安装的R包有 ggplot2, nlme, MASS, lme4 和 brms,Stan 编译环境的设置如下

writeLines(readLines("~/.R/Makevars.win"))
CXX14FLAGS = -O3 -march=native -mtune=native -fPIC
CXX14 = $(BINPREF)g++ -O2 -march=native -mtune=native
Sys.getenv("BINPREF")
[1] "C:/Rtools/mingw_$(WIN)/bin/"

R 控制台的运行环境如下

xfun::session_info(c('ggplot2', 'nlme', 'MASS', 'lme4', 'brms'))
Registered S3 method overwritten by 'xts':
  method     from
  as.zoo.xts zoo 
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8.1 x64 (build 9600)

Locale:
  LC_COLLATE=Chinese (Simplified)_China.936 
  LC_CTYPE=Chinese (Simplified)_China.936   
  LC_MONETARY=Chinese (Simplified)_China.936
  LC_NUMERIC=C                              
  LC_TIME=Chinese (Simplified)_China.936    

Package version:
  abind_1.4.7               askpass_1.1              
  assertthat_0.2.1          backports_1.1.4          
  base64enc_0.1.3           bayesplot_1.7.0          
  BH_1.69.0.1               boot_1.3.23              
  bridgesampling_0.7.2      brms_2.9.0               
  Brobdingnag_1.2.6         callr_3.3.1              
  checkmate_1.9.4           cli_1.1.0                
  coda_0.19.3               codetools_0.2.16         
  colorspace_1.4.1          colourpicker_1.0         
  compiler_3.6.1            crayon_1.3.4             
  crosstalk_1.0.0           curl_4.0                 
  desc_1.2.0                digest_0.6.20            
  dplyr_0.8.3               DT_0.7                   
  dygraphs_1.1.1.6          ellipsis_0.2.0.1         
  fansi_0.4.0               future_1.14.0            
  ggplot2_3.2.0             ggridges_0.5.1           
  globals_0.12.4            glue_1.3.1               
  graphics_3.6.1            grDevices_3.6.1          
  grid_3.6.1                gridExtra_2.3            
  gtable_0.3.0              gtools_3.8.1             
  htmltools_0.3.6           htmlwidgets_1.3          
  httpuv_1.5.1              igraph_1.2.4.1           
  inline_0.3.15             jsonlite_1.6             
  labeling_0.3              later_0.8.0              
  lattice_0.20.38           lazyeval_0.2.2           
  listenv_0.7.0             lme4_1.1.21              
  loo_2.1.0                 magrittr_1.5             
  markdown_1.0              MASS_7.3.51.4            
  Matrix_1.2.17             matrixStats_0.54.0       
  methods_3.6.1             mgcv_1.8.28              
  mime_0.7                  miniUI_0.1.1.1           
  minqa_1.2.4               munsell_0.5.0            
  mvnfast_0.2.5             nleqslv_3.3.2            
  nlme_3.1-141              nloptr_1.2.1             
  openssl_1.4.1             packrat_0.5.0            
  parallel_3.6.0            pillar_1.4.2             
  pkgbuild_1.0.3            pkgconfig_2.0.2          
  plogr_0.2.0               plyr_1.8.4               
  prettyunits_1.0.2         processx_3.4.1           
  promises_1.0.1            ps_1.3.0                 
  purrr_0.3.2               R6_2.4.0                 
  RColorBrewer_1.1.2        Rcpp_1.0.2               
  RcppArmadillo_0.9.600.4.0 RcppEigen_0.3.3.5.0      
  reshape2_1.4.3            rlang_0.4.0              
  rprojroot_1.3.2           rsconnect_0.8.15         
  rstan_2.19.2              rstantools_1.5.1         
  rstudioapi_0.10           scales_1.0.0             
  shiny_1.3.2               shinyjs_1.0              
  shinystan_2.5.0           shinythemes_1.1.2        
  sourcetools_0.1.7         splines_3.6.0            
  StanHeaders_2.18.1.10     stats_3.6.1              
  stats4_3.6.0              stringi_1.4.3            
  stringr_1.4.0             sys_3.2                  
  threejs_0.3.1             tibble_2.1.3             
  tidyselect_0.2.5          tools_3.6.1              
  utf8_1.1.4                utils_3.6.1              
  vctrs_0.2.0               viridisLite_0.3.0        
  withr_2.1.2               xtable_1.8.4             
  xts_0.11.2                yaml_2.2.0               
  zeallot_0.1.0             zoo_1.8.6