拯救那些被EXCEL篡改的基因名

有时候在GEO下载发表的基因表达矩阵的时候,经常遇到如下的基因名被EXCEL自动篡改。这里介绍一个R包:HGNChelper,可以自动识别这些基因,并进行修正。

EXCEL ERROR Corrected Gene Symbol
1-Sep SEPT1
10-Sep SEPT10
11-Sep SEPT11
12-Sep SEPT12

一个简单的例子

1
2
3
4
library(HGNChelper)
human = c("FN1", "tp53", "UNKNOWNGENE","7-Sep", "9/7", "1-Mar", "Oct4", "4-Oct",
"OCT4-PG4", "C19ORF71", "C19orf71")
checkGeneSymbols(human)
1
2
3
4
5
6
7
8
9
10
11
12
Human gene symbols should be all upper-case except for the 'orf' in open reading frames. The case of some letters was corrected.x contains non-approved gene symbols x Approved Suggested.Symbol
1 FN1 TRUE FN1
2 tp53 FALSE TP53
3 UNKNOWNGENE FALSE <NA>
4 7-Sep FALSE SEPT7
5 9/7 FALSE SEPT7
6 1-Mar FALSE MARC1 /// MARCH1
7 Oct4 FALSE POU5F1
8 4-Oct FALSE POU5F1
9 OCT4-PG4 FALSE POU5F1P4
10 C19ORF71 FALSE C19orf71
11 C19orf71 TRUE C19orf71

checkGeneSymbols不光可以教程EXCEL造成的篡改,同样也可以将Alias转换成标准基因名。但是有一些错误是无法解决的,比如1-Mar这样的错误,可能会对应多个Gene Symbol。这样的数据只能舍弃了。


一个典型的应用场景

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
library(HGNChelper)
# expression_matrix_file是一个行为基因,列为样本的基因表达矩阵
mtx <- read.table(expression_matrix_file, sep="\t", header=T, row.names=1)
dim(mtx)
# check gene symtol
t <- checkGeneSymbols(rownames(mtx))
table(t$Approved)
table(is.na(t$Suggested.Symbol))
# delete <NA> and duplicated Suggested.Symbol
mtx$Suggested.Symbol <- t$Suggested.Symbol
mtx <- mtx[!is.na(mtx$Suggested.Symbol), ]
mtx <- mtx[!duplicated(mtx$Suggested.Symbol), ]
# delete multiple Suggested.Symbol
mtx <- mtx[!grepl("///", mtx$Suggested.Symbol), ]
# reset rownames
rownames(mtx) <- mtx$Suggested.Symbol
# delete Suggested.Symbol columns
mtx <- subset(mtx, select = -c(Suggested.Symbol))
dim(mtx)