使用R语言分析白榆生长数据
老王 / 2017-09-23
1. 数据初步整理
使用白榆生长数据。指标分别有:品系号、生长地点、环境盐碱度(低中高三种)、树高、胸径、东西冠幅、南北冠幅,为2017年测定。
在excel里整理数据,导出为csv格式,存在R工作目录下,进行操作。如果在windows下,使用read.csv 即可,但苹果中需要另外一种形式。
后来发现是自己忘了加引号了。。。
all.data = read.table("data for analysis/all.data.demo.txt", header = T, sep="\t", fileEncoding="UTF-16")
head(all.data)
## X spe high dbh ew sn loca salt
## 1 1 58 5.04 6.2 2.12 2.90 河口 低
## 2 2 58 4.85 6.0 1.86 2.05 河口 低
## 3 3 58 4.83 6.3 2.18 2.42 河口 低
## 4 4 58 5.25 8.0 3.20 3.11 河口 低
## 5 5 58 5.35 8.0 2.98 2.92 河口 低
## 6 6 58 6.10 7.7 2.52 2.38 河口 低
# 然后把第一列去掉
all.data = all.data[,-1]
需要注意的是数据的格式,要以数据库能够认识的格式保存数据,即标题占一行,每一个个体指标都是独立,如上面所示。在excel里操作要注意千万不要合并单元格,最好把数据整理成数据透视表能够识别的格式。
对于excel中已经和并的数据,整理时有一个小技巧,假设列名是合并的:1、取消合并。2、选中列,F5定位。3、定位条件选空值。4、在第一个单元格里填写“=上一个单元格(地址)”。5、ctrl+enter。
数据涉及沾化、河口、垦利三地的1790棵树。包括变量 spe, high, dbh, ew, sn, loca, salt。
接下来分别整理并探索几个地方的数据的结构,按地方,按品系,按盐分梯度,不同角度都试试。
kenli = subset(all.data, loca == "垦利")
hekou = subset(all.data, loca == "河口")
zhanhua = subset(all.data, loca == "沾化")
# 把几个地方的数据分解出来
str(kenli)
## 'data.frame': 622 obs. of 7 variables:
## $ spe : int 58 58 58 58 58 58 58 58 58 58 ...
## $ high: num 7.6 7.5 7.3 7.5 7.4 7.5 7.1 7.1 7 7 ...
## $ dbh : num 10.2 8.7 8.7 9.7 9.2 9.4 8.9 8.65 7.21 8.38 ...
## $ ew : num 3.98 3.19 3.29 3.8 3.5 3.47 3.87 4.13 3.12 4 ...
## $ sn : num 4.5 3.71 3.49 3.63 3.8 3.56 3.96 3.99 3.66 3.55 ...
## $ loca: Factor w/ 3 levels "垦利","河口",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ salt: Factor w/ 3 levels "中","低","高": 2 2 2 2 2 2 2 2 2 2 ...
str(hekou)
## 'data.frame': 1027 obs. of 7 variables:
## $ spe : int 58 58 58 58 58 58 58 58 58 58 ...
## $ high: num 5.04 4.85 4.83 5.25 5.35 6.1 5.7 5.03 4.85 4.95 ...
## $ dbh : num 6.2 6 6.3 8 8 7.7 8.3 7 5.4 5.4 ...
## $ ew : num 2.12 1.86 2.18 3.2 2.98 2.52 2.93 2.3 2.25 2.74 ...
## $ sn : num 2.9 2.05 2.42 3.11 2.92 2.38 3.18 2.17 1.28 2.03 ...
## $ loca: Factor w/ 3 levels "垦利","河口",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ salt: Factor w/ 3 levels "中","低","高": 2 2 2 2 2 2 2 2 1 1 ...
str(zhanhua)
## 'data.frame': 141 obs. of 7 variables:
## $ spe : int 58 58 58 58 58 58 58 58 58 58 ...
## $ high: num 4.5 4.7 4.8 4.8 4.8 3.9 4.1 4.05 4.1 4.1 ...
## $ dbh : num 6.4 7.18 5.35 7.54 5.93 4.35 5.55 4.43 3.68 4.2 ...
## $ ew : num 2.9 2.7 2.65 2.62 2.43 2.37 2.42 2.67 3.35 2.71 ...
## $ sn : num 3 3.4 3 3.19 2.93 2.55 2.77 2.35 3.3 3.23 ...
## $ loca: Factor w/ 3 levels "垦利","河口",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ salt: Factor w/ 3 levels "中","低","高": 3 3 3 3 3 3 3 3 3 3 ...
3 排序(主成分分析)
3.1 第一次尝试
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-5
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
plot(rda(kenli[,c(2,3,4,5)]))
可以发现,点太多,区分不开。所以使用平均值进行分析。
这个过程可以在excel里完成,比较直观,也可以在R语言里完成。下面使用R语言。
3.2 把数据按品种算出平均值
library(dplyr) # 一个牛逼的包
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
hekou.spe = group_by(hekou, spe)
hekou.mean = summarise(hekou.spe,
high = mean(high),
dbh = mean(dbh),
ew = mean(ew),
sn = mean(sn)
)
head(hekou.mean)
## # A tibble: 6 x 5
## spe high dbh ew sn
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 5 5.33 5.92 2.16 2.12
## 2 7 5.25 6.19 2.21 2.14
## 3 20 5.27 6.01 2.19 2.19
## 4 30 5.37 6.54 2.44 2.30
## 5 34 5.16 5.88 2.17 2.03
## 6 41 5.23 5.94 2.09 2.09
# 以河口为例,先分组,再按分组计算,和excel里的数据透视表有点像。
kenli.spe = group_by(kenli, spe)
kenli.mean = summarise(kenli.spe,
high = mean(high),
dbh = mean(dbh),
ew = mean(ew),
sn = mean(sn)
)
zhanhua.spe = group_by(zhanhua, spe)
zhanhua.mean = summarise(zhanhua.spe,
high = mean(high),
dbh = mean(dbh),
ew = mean(ew),
sn = mean(sn)
)
有了这些分组的平均值,可以再次尝试做主成分分析。
3.3 再次排序
library(tidyverse) #这个包是个组合,里面包括dplyr,ggplot2等一系列包。
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.2
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(vegan)
hekou.mean = as.data.frame(hekou.mean)
hekou.mean = hekou.mean %>% remove_rownames %>% column_to_rownames(var="spe")
# %>% 是管道函数的符号,意思就是这么干了再这么干,比如上面就是先去掉行名,再把spe这一列变成行名。
# 如果直接做下面的函数,会有Warning message:
# Setting row names on a tibble is deprecated. 所以先把几个数据变成数据框。以垦利为例,原本差异如下:
kenli.mean
## # A tibble: 12 x 5
## spe high dbh ew sn
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 5 5.87 7.48 3.09 3.33
## 2 7 5.81 7.28 3.00 3.15
## 3 20 5.67 7.12 3.01 3.14
## 4 30 5.89 7.66 3.11 3.08
## 5 34 6.00 7.37 3.04 2.95
## 6 41 6.01 7.41 3.07 3.06
## 7 46 6.35 7.51 3.15 3.09
## 8 50 5.90 7.50 3.16 3.29
## 9 51 5.83 7.32 2.91 2.99
## 10 58 6.05 7.48 3.10 3.07
## 11 286 5.97 7.68 3.02 3.16
## 12 65225 6.15 7.42 3.06 2.96
as.data.frame(kenli.mean)
## spe high dbh ew sn
## 1 5 5.874000 7.484400 3.094400 3.327600
## 2 7 5.810000 7.280200 3.004600 3.150800
## 3 20 5.666667 7.123333 3.011000 3.144000
## 4 30 5.894444 7.659815 3.106111 3.076481
## 5 34 6.004167 7.369375 3.039375 2.947917
## 6 41 6.014118 7.411176 3.074314 3.057451
## 7 46 6.350000 7.513196 3.145652 3.092609
## 8 50 5.897959 7.499184 3.155306 3.294694
## 9 51 5.830508 7.315763 2.906271 2.990847
## 10 58 6.054000 7.484600 3.096000 3.069600
## 11 286 5.971667 7.677667 3.017333 3.161417
## 12 65225 6.153333 7.424222 3.060667 2.955778
# 接着把垦利和沾化的数据也弄好。
kenli.mean = as.data.frame(kenli.mean)
kenli.mean = kenli.mean %>% remove_rownames %>% column_to_rownames(var="spe")
zhanhua.mean = as.data.frame(zhanhua.mean)
zhanhua.mean = zhanhua.mean %>% remove_rownames %>% column_to_rownames(var="spe")
再做主成分分析。在vegan里,用rda函数。
hekou.rda = rda(hekou.mean)
str(hekou.rda)
## List of 10
## $ colsum : Named num [1:4] 0.155 0.353 0.138 0.113
## ..- attr(*, "names")= chr [1:4] "high" "dbh" "ew" "sn"
## $ tot.chi : num 0.18
## $ Ybar : num [1:12, 1:4] 0.01233 -0.01169 -0.00681 0.02357 -0.04023 ...
## ..- attr(*, "scaled:center")= Named num [1:4] 5.29 6.12 2.21 2.15
## .. ..- attr(*, "names")= chr [1:4] "high" "dbh" "ew" "sn"
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:12] "5" "7" "20" "30" ...
## .. ..$ : chr [1:4] "high" "dbh" "ew" "sn"
## ..- attr(*, "METHOD")= chr "PCA"
## $ method : chr "rda"
## $ call : language rda(X = hekou.mean)
## $ pCCA : NULL
## $ CCA : NULL
## $ CA :List of 7
## ..$ eig : Named num [1:4] 0.166144 0.009104 0.004307 0.000869
## .. ..- attr(*, "names")= chr [1:4] "PC1" "PC2" "PC3" "PC4"
## ..$ poseig : NULL
## ..$ u : num [1:12, 1:4] 0.1387 -0.0307 0.0708 -0.3683 0.2186 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:12] "5" "7" "20" "30" ...
## .. .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## ..$ v : num [1:4, 1:4] -0.318 -0.861 -0.312 -0.245 -0.888 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:4] "high" "dbh" "ew" "sn"
## .. .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## ..$ rank : int 4
## ..$ tot.chi: num 0.18
## ..$ Xbar : num [1:12, 1:4] 0.01233 -0.01169 -0.00681 0.02357 -0.04023 ...
## .. ..- attr(*, "scaled:center")= Named num [1:4] 5.29 6.12 2.21 2.15
## .. .. ..- attr(*, "names")= chr [1:4] "high" "dbh" "ew" "sn"
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:12] "5" "7" "20" "30" ...
## .. .. ..$ : chr [1:4] "high" "dbh" "ew" "sn"
## .. ..- attr(*, "METHOD")= chr "PCA"
## $ inertia : chr "variance"
## $ regularization: chr "this is a vegan::rda result object"
## - attr(*, "class")= chr [1:2] "rda" "cca"
plot(hekou.rda)
# 把需要作图的数据提取出来,取个好名字出现在坐标轴上
PC1 = hekou.rda$CA$u[,1]
PC2 = hekou.rda$CA$u[,2]
hekou.gg = as.data.frame(cbind(PC1, PC2))
ggplot(hekou.gg, aes(PC1, PC2, label = row.names(hekou.gg))) +
geom_point() +
geom_text( vjust = "inward", hjust = "inward")
# 这个好看点,但不如vegan自带的作图容易说明问题。所以不用这个,仅作方法保存一下。
# 注意ggplot的用法,若要不同颜色,则需要把颜色对应变量转化为character或factor,默认为数值型,为渐变颜色。
接下来,从不同角度作图,开始。
library(dplyr)
kenli.low = filter(all.data, loca == "垦利", salt == "低")
kenli.low.spe = group_by(kenli.low, spe)
kenli.low.mean = summarise(kenli.low.spe,
high = mean(high),
dbh = mean(dbh),
ew = mean(ew),
sn = mean(sn)
)
kenli.med = filter(all.data, loca == "垦利", salt == "中")
kenli.med.spe = group_by(kenli.med, spe)
kenli.med.mean = summarise(kenli.med.spe,
high = mean(high),
dbh = mean(dbh),
ew = mean(ew),
sn = mean(sn)
)
kenli.high = filter(all.data, loca == "垦利", salt == "高")
kenli.high.spe = group_by(kenli.high, spe)
kenli.high.mean = summarise(kenli.high.spe,
high = mean(high),
dbh = mean(dbh),
ew = mean(ew),
sn = mean(sn)
)
hekou.low = filter(all.data, loca == "河口", salt == "低")
hekou.low.spe = group_by(hekou.low, spe)
hekou.low.mean = summarise(hekou.low.spe,
high = mean(high),
dbh = mean(dbh),
ew = mean(ew),
sn = mean(sn)
)
hekou.med = filter(all.data, loca == "河口", salt == "中")
hekou.med.spe = group_by(hekou.med, spe)
hekou.med.mean = summarise(hekou.med.spe,
high = mean(high),
dbh = mean(dbh),
ew = mean(ew),
sn = mean(sn)
)
hekou.high = filter(all.data, loca == "河口", salt == "高")
hekou.high.spe = group_by(hekou.high, spe)
hekou.high.mean = summarise(hekou.high.spe,
high = mean(high),
dbh = mean(dbh),
ew = mean(ew),
sn = mean(sn)
)
zhanhua.low = filter(all.data, loca == "沾化", salt == "低")
zhanhua.low.spe = group_by(zhanhua.low, spe)
zhanhua.low.mean = summarise(zhanhua.low.spe,
high = mean(high),
dbh = mean(dbh),
ew = mean(ew),
sn = mean(sn)
)
zhanhua.med = filter(all.data, loca == "沾化", salt == "中")
zhanhua.med.spe = group_by(zhanhua.med, spe)
zhanhua.med.mean = summarise(zhanhua.med.spe,
high = mean(high),
dbh = mean(dbh),
ew = mean(ew),
sn = mean(sn)
)
zhanhua.high = filter(all.data, loca == "沾化", salt == "高")
zhanhua.high.spe = group_by(zhanhua.high, spe)
zhanhua.high.mean = summarise(zhanhua.high.spe,
high = mean(high),
dbh = mean(dbh),
ew = mean(ew),
sn = mean(sn)
)
kenli.low.mean = as.data.frame(kenli.low.mean)
kenli.low.mean = kenli.low.mean %>% remove_rownames %>% column_to_rownames(var="spe")
kenli.med.mean = as.data.frame(kenli.med.mean)
kenli.med.mean = kenli.med.mean %>% remove_rownames %>% column_to_rownames(var="spe")
kenli.high.mean = as.data.frame(kenli.high.mean)
kenli.high.mean = kenli.high.mean %>% remove_rownames %>% column_to_rownames(var="spe")
hekou.low.mean = as.data.frame(hekou.low.mean)
hekou.low.mean = hekou.low.mean %>% remove_rownames %>% column_to_rownames(var="spe")
hekou.med.mean = as.data.frame(hekou.med.mean)
hekou.med.mean = hekou.med.mean %>% remove_rownames %>% column_to_rownames(var="spe")
hekou.high.mean = as.data.frame(hekou.high.mean)
hekou.high.mean = hekou.high.mean %>% remove_rownames %>% column_to_rownames(var="spe")
zhanhua.low.mean = as.data.frame(zhanhua.low.mean)
zhanhua.low.mean = zhanhua.low.mean %>% remove_rownames %>% column_to_rownames(var="spe")
zhanhua.med.mean = as.data.frame(zhanhua.med.mean)
zhanhua.med.mean = zhanhua.med.mean %>% remove_rownames %>% column_to_rownames(var="spe")
zhanhua.high.mean = as.data.frame(zhanhua.high.mean)
zhanhua.high.mean = zhanhua.high.mean %>% remove_rownames %>% column_to_rownames(var="spe")
作图了。
library(vegan)
plot(rda(kenli.low.mean))
# 垦利低盐分
plot(rda(kenli.med.mean))
# 垦利中盐分
plot(rda(kenli.high.mean))
# 垦利高盐分
plot(rda(hekou.low.mean))
# 河口低盐分
plot(rda(hekou.med.mean))
# 河口中盐分
plot(rda(hekou.high.mean))
# 河口高盐分
plot(rda(zhanhua.high.mean))
# 沾化高盐分
4. 聚类分析
4.1 聚类分析方法的选择
聚类分析,要先算出样本间距离来,这里直接用平均值。如何体现重复数据需要进一步研究,存疑。
用的函数hclust,方法如下:
1, 数据标准化:decostand(data, “normalize”)
2, 计算欧式距离:vegdist(data, “euc”),或者dist()
3, 不同聚类方法
3.1 Hierarchical Clustering Based on Links Single Linkage Agglomerative Clustering: hclust(distance, method = “single”) Complet Linkage Agglomerative Clustering: hclust(distance, method = “complete”)
3.2 Average Agglomerative Clustering UPGMA best-known method: hclust(distance, method = “average”) 还有其他几种average clustering: UPGMC“centroid”, WPGMA“mcquitty”, WPGMC“median”
3.3 Ward’s Minimus Variance Clustering Based on the linear model criterion of least squares. method = “ward”
- 4,解释聚类结果
Cophenetic Correlation 表型相关,the distance where the two objects become
members of the same group.
聚类的case是每一行的名称,如果是直接在excel里读取的话,存成csv格式,用read.csv(“testdata.csv”, row.names=1)读取,里面数据均参与计算相似系数矩阵。
目前觉得single方法是最合适的,是最近相邻法聚类,能够较好的把不同类别的品种分为多类。
4.2 聚类作图
dis.kenli.high = vegdist(decostand(kenli.high.mean, "normalize"), "euc")
dis.kenli.med = vegdist(decostand(kenli.med.mean, "normalize"), "euc")
dis.kenli.low = vegdist(decostand(kenli.low.mean, "normalize"), "euc")
kenli.high.single = hclust(dis.kenli.high, method = "single")
kenli.med.single = hclust(dis.kenli.med, method = "single")
kenli.low.single = hclust(dis.kenli.low, method = "single")
plot(kenli.high.single)
plot(kenli.med.single)
plot(kenli.low.single)
5 小结
- 树高、胸径、冠幅有时候并不完全一致,使用PCA能从整体的角度找到优良品种;聚类可以直观把这些品种分成不同的团伙。
- 数据整理过程中,对r语言不熟悉的话,不如用数据透视表快。
- 以上分析为了试试用blogdown写博客并用hugo挂网上,检验图能出现,格式正确。
- 真是太方便了,R分析,网站表达,直接打通了。