目录
1.annotgpl参数改为true,联网下载芯片平台的soft文件。(国内网速奇慢经常中断)
上节我们下载了geo数据集,并提取了基因表达矩阵,但是矩阵行名称是芯片探针需要转换为基因名。
下载平台文件
1.annotgpl参数改为true,联网下载芯片平台的soft文件。(国内网速奇慢经常中断)
注意:下载好soft文件,才可以直接可以提取注释,没下载好,注释内容全为空,后续代码可以运行但是,不能得到正确数据的矩阵。
# 提取注释信息
annotation <- featuredata(gse_info[[1]])
library(geoquery)
# 指定geo数据集的id
gse_id <- "gse1297"
# 使用getgeo函数获取数据集的基础信息
gse_info <- getgeo(gse_id, destdir = ".", annotgpl = true )
# 提取注释信息
annotation <- featuredata(gse_info[[1]]) #下载好soft,可以直接可以提取注释,没下载好注释内容全为空
#查看平台文件列名
colnames(annotation)
#仅提取两列,第一列芯片探针名,第十一列基因名
platform_file_set=annotation[,c(1,11)]
#还可以尝试单独下载gpl96平台文件
gse_gp<-getgeo('gpl96',destdir =".") # 网速不佳 下载失败提示 failed to download ./gpl96.soft.gz!
2.手工去geo官网下载
dir() #打印项目文件列表
# 读取芯片平台文件txt
platform_file <- read.delim("gpl96-57554.txt", header = true, sep = "\t", comment.char = "#")
#查看平台文件列名
colnames(platform_file)
#仅提取两列,第一列芯片探针名,第十一列基因名
platform_file_set=platform_file[,c(1,11)]
转换芯片探针id为gene name
先将上节中提取到的表达矩阵转换格式。
表达矩阵是matrix对象,而我们接下来要用到的merge
函数不能对matrix对象使用,因此要先将表达矩阵转换为data.frame对象。否则会报错。error in fix.by(by.x, x) : 'by'必需指定唯一有效的列
。
#将matrix格式表达矩阵转换为data.frame格式
exprset <- data.frame(expression_data)
#给表达矩阵新增加一列id
exprset$id <- rownames(exprset) # 得到表达矩阵,行名为id,需要转换,新增一列
#矩阵表达文件和平台文件有相同列‘id’,使用merge函数合并
express <- merge(x = exprset, y = platform_file_set, by.x = "id")
#删除探针id列
express$id =null
最终将探针id列删除,剩余32列,即得到有基因名称的表达矩阵。
大家观察最后一列,一个芯片探针匹配到多个基因,下节我们来看看处理方案。
拓展:通过bioconductor注释包
gpl96 | hgu133a |
# 找到gpl6244相应的注释包hgu133a.db
gpl bioc_package title
gpl32 mgu74a [mg_u74a] affymetrix murine genome u74a array
gpl33 mgu74b [mg_u74b] affymetrix murine genome u74b array
gpl34 mgu74c [mg_u74c] affymetrix murine genome u74c array
gpl71 ag [ag] affymetrix arabidopsis genome array
gpl72 drosgenome1 [drosgenome1] affymetrix drosophila genome array
gpl74 hcg110 [hc_g110] affymetrix human cancer array
gpl75 mu11ksuba [mu11ksuba] affymetrix murine 11k suba array
gpl76 mu11ksubb [mu11ksubb] affymetrix murine 11k subb array
gpl77 mu19ksuba [mu19ksuba] affymetrix murine 19k suba array
gpl78 mu19ksubb [mu19ksubb] affymetrix murine 19k subb array
gpl79 mu19ksubc [mu19ksubc] affymetrix murine 19k subc array
gpl80 hu6800 [hu6800] affymetrix human full length hugenefl array
gpl81 mgu74av2 [mg_u74av2] affymetrix murine genome u74a version 2 array
gpl82 mgu74bv2 [mg_u74bv2] affymetrix murine genome u74b version 2 array
gpl83 mgu74cv2 [mg_u74cv2] affymetrix murine genome u74 version 2 array
gpl85 rgu34a [rg_u34a] affymetrix rat genome u34 array
gpl86 rgu34b [rg_u34b] affymetrix rat genome u34 array
gpl87 rgu34c [rg_u34c] affymetrix rat genome u34 array
gpl88 rnu34 [rn_u34] affymetrix rat neurobiology u34 array
gpl89 rtu34 [rt_u34] affymetrix rat toxicology u34 array
gpl90 ygs98 [yg_s98] affymetrix yeast genome s98 array
gpl91 hgu95av2 [hg_u95a] affymetrix human genome u95a array
gpl92 hgu95b [hg_u95b] affymetrix human genome u95b array
gpl93 hgu95c [hg_u95c] affymetrix human genome u95c array
gpl94 hgu95d [hg_u95d] affymetrix human genome u95d array
gpl95 hgu95e [hg_u95e] affymetrix human genome u95e array
gpl96 hgu133a [hg-u133a] affymetrix human genome u133a array
gpl97 hgu133b [hg-u133b] affymetrix human genome u133b array
gpl98 hu35ksuba [hu35ksuba] affymetrix human 35k suba array
gpl99 hu35ksubb [hu35ksubb] affymetrix human 35k subb array
gpl100 hu35ksubc [hu35ksubc] affymetrix human 35k subc array
gpl101 hu35ksubd [hu35ksubd] affymetrix human 35k subd array
gpl198 ath1121501 [ath1-121501] affymetrix arabidopsis ath1 genome array
gpl199 ecoli2 [ecoli_asv2] affymetrix e. coli antisense genome array
gpl200 celegans [celegans] affymetrix c. elegans genome array
gpl201 hgfocus [hg-focus] affymetrix human hg-focus target array
gpl339 moe430a [moe430a] affymetrix mouse expression 430a array
gpl340 mouse4302 [moe430b] affymetrix mouse expression 430b array
gpl341 rae230a [rae230a] affymetrix rat expression 230a array
gpl342 rae230b [rae230b] affymetrix rat expression 230b array
gpl570 hgu133plus2 [hg-u133_plus_2] affymetrix human genome u133 plus 2.0 array
gpl571 hgu133a2 [hg-u133a_2] affymetrix human genome u133a 2.0 array
gpl886 hgug4111a agilent-011871 human 1b microarray g4111a (feature number version)
gpl887 hgug4110b agilent-012097 human 1a microarray (v2) g4110b (feature number version)
gpl1261 mouse430a2 [mouse430_2] affymetrix mouse genome 430 2.0 array
gpl1318 xenopuslaevis [xenopus_laevis] affymetrix xenopus laevis genome array
gpl1319 zebrafish [zebrafish] affymetrix zebrafish genome array
gpl1322 drosophila2 [drosophila_2] affymetrix drosophila genome 2.0 array
gpl1352 u133x3p [u133_x3p] affymetrix human x3p array
gpl1355 rat2302 [rat230_2] affymetrix rat genome 230 2.0 array
gpl1708 hgug4112a agilent-012391 whole human genome oligo microarray g4112a (feature number version)
gpl2112 bovine [bovine] affymetrix bovine genome array
gpl2529 yeast2 [yeast_2] affymetrix yeast genome 2.0 array
gpl2891 h20kcod ge healthcare/amersham biosciences codelink™ uniset human 20k i bioarray
gpl2898 adme16cod ge healthcare/amersham biosciences codelink™ adme rat 16-assay bioarray
gpl3154 ecoli2 [e_coli_2] affymetrix e. coli genome 2.0 array
gpl3213 chicken [chicken] affymetrix chicken genome array
gpl3533 porcine [porcine] affymetrix porcine genome array
gpl3738 canine2 [canine_2] affymetrix canine genome 2.0 array
gpl3921 hthgu133a [ht_hg-u133a] affymetrix ht human genome u133a array
gpl3979 canine [canine] affymetrix canine genome 1.0 array
gpl4032 [maize] affymetrix maize genome array
gpl4191 h10kcod codelink uniset human i bioarray
gpl5188 huex10sttranscriptcluster [huex-1_0-st] affymetrix human exon 1.0 st array [probe set (exon) version]
gpl5689 hgug4100a agilent human 1 cdna microarray (g4100a) [layout c]
gpl6097 illuminahumanv1 illumina human-6 v1.0 expression beadchip
gpl6102 illuminahumanv2 illumina human-6 v2.0 expression beadchip
gpl6244 hugene10sttranscriptcluster [hugene-1_0-st] affymetrix human gene 1.0 st array [transcript (gene) version]
gpl6246 mogene10sttranscriptcluster [mogene-1_0-st] affymetrix mouse gene 1.0 st array [transcript (gene) version]
gpl6885 illuminamousev2 illumina mouseref-8 v2.0 expression beadchip
gpl6947 illuminahumanv3 illumina humanht-12 v3.0 expression beadchip
gpl8300 hgu95av2 [hg_u95av2] affymetrix human genome u95 version 2 array
gpl8321 mouse430a2 [mouse430a_2] affymetrix mouse genome 430a 2.0 array
gpl8490 illuminahumanmethylation27k illumina humanmethylation27 beadchip (humanmethylation27_270596_v.1.2)
gpl10558 illuminahumanv4 illumina humanht-12 v4.0 expression beadchip
gpl11532 hugene11sttranscriptcluster [hugene-1_1-st] affymetrix human gene 1.1 st array [transcript (gene) version]
gpl13497 hsagilentdesign026652 agilent-026652 whole human genome microarray 4x44k v2 (probe name version)
gpl13534 illuminahumanmethylation450k illumina humanmethylation450 beadchip (humanmethylation450_15017482)
gpl13667 hgu219 [hg-u219] affymetrix human genome u219 array
gpl14877 hgu133plus2 affymetrix human genome u133 plus 2.0 array [brainarray version 13, hgu133plus2_hs_entrezg]
gpl15380 gghumanmethcancerpanelv1 illumina sentrix array matrix (sam) - goldengate methylation cancer panel i
gpl15396 hthgu133b [ht_hg-u133b] affymetrix ht human genome u133b array [custom cdf: entrez brainarray v. 14]
gpl17556 hugene10sttranscriptcluster [hugene-1_0-st] affymetrix human gene 1.0 st array [hugene10stv1_hs_entrezg_17.0.0]
gpl17897 hthgu133a [ht_hg-u133a] affymetrix human genome u133a array (custom cdf: hthgu133a_hs_entrezg.cdf version 17.0.0)
gpl18190 hugene11sttranscriptcluster [hugene-1_1-st] affymetrix human gene 1.1 st array [cdf: brainarray hugene11stv1_hs_entrezg_15.1.0]
发表评论