Character strings can be matched and manipulated in base R by using regular expressions in functions grep, grepl, sub, gsub, regexpr + regmatches and some others.
tidyverse package ‘stringr’ contains analogous verbs with more consistent syntax.
A regular expression is a pattern that describes a set of strings.
Most characters, including all letters and digits, are regular expressions that match themselves.
Whereas, e.g. . matches any single character.
You can refer also to a character class, which is a list of characters enclosed between [ and ], e.g. [[:alnum:]] is same as [A-z0-9].
Most common character classes:
{ | } ~.;The metacharacters in regular expressions are . | ( ) [ { ^ $ * + ?, whether these have a special meaning depending on the context.
When matching any metacharacter as a regular character, precede it with a double backslash \.
Repetition quantifiers put after regex specify how many times regex is matched: ?, optional, at most once; *, zero or more times; +, one or more times; {n}, n times; {n,}, n or more times; {n,m}, n to m times.
The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line.
Locate a pattern match (positions)
Extract a matched pattern
Identify a match to a pattern
Replace a matched pattern
Download test dataset.
Test dataset contains Supplementary file names and some metadata of gene expresion profiling experiments using high-throughput sequencing:
if(!dir.exists("data")){
dir.create("data")
}
## manually download suppfilenames_2017-06-19.RData from rstats-tartu/datasets
## alternatively clone this repo 'rstat-tartu/regex-demo'
load("data/suppfilenames_2017-06-19.RData")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.3 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stringr)
## Filter out rows with missing file names
suppfilenames <- suppfilenames %>%
filter(!map_lgl(SuppFileNames, ~ inherits(., "try-error")))
suppfilenames %>% select(Accession, PDAT, SuppFileNames)
## # A tibble: 8,882 × 3
## Accession PDAT SuppFileNames
## <chr> <chr> <list>
## 1 GSE98414 2017/06/17 <chr [2]>
## 2 GSE98413 2017/06/17 <chr [2]>
## 3 GSE83480 2017/06/17 <chr [2]>
## 4 GSE78140 2017/06/16 <chr [2]>
## 5 GSE98273 2017/06/16 <chr [1]>
## 6 GSE89134 2017/06/16 <chr [4]>
## 7 GSE89206 2017/06/16 <chr [2]>
## 8 GSE89205 2017/06/16 <chr [1]>
## 9 GSE100106 2017/06/16 <chr [2]>
## 10 GSE100077 2017/06/16 <chr [2]>
## # … with 8,872 more rows
## unnest supplementary file names
supfn <- suppfilenames %>% unnest(SuppFileNames)
supfn %>% select(Accession, PDAT, SuppFileNames)
## # A tibble: 23,118 × 3
## Accession PDAT SuppFileNames
## <chr> <chr> <chr>
## 1 GSE98414 2017/06/17 filelist.txt
## 2 GSE98414 2017/06/17 GSE98414_RAW.tar
## 3 GSE98413 2017/06/17 GSE98413_Junb_counts_normalized.tab.gz
## 4 GSE98413 2017/06/17 GSE98413_Junb_counts_raw.tab.gz
## 5 GSE83480 2017/06/17 filelist.txt
## 6 GSE83480 2017/06/17 GSE83480_RAW.tar
## 7 GSE78140 2017/06/16 filelist.txt
## 8 GSE78140 2017/06/16 GSE78140_RAW.tar
## 9 GSE98273 2017/06/16 GSE98273_genes.fpkm_tracking.txt.gz
## 10 GSE89134 2017/06/16 GSE89134_CentralMemoryVsControlNPDay8.txt.gz
## # … with 23,108 more rows
To get the length of a text string (i.e. the number of characters in the string).
str_length("banana")
## [1] 6
str_length("")
## [1] 0
Length of supplementary file names.
supfn <- supfn %>%
select(Accession, PDAT, SuppFileNames) %>%
mutate(strlen = str_length(SuppFileNames))
supfn
## # A tibble: 23,118 × 4
## Accession PDAT SuppFileNames strlen
## <chr> <chr> <chr> <int>
## 1 GSE98414 2017/06/17 filelist.txt 12
## 2 GSE98414 2017/06/17 GSE98414_RAW.tar 16
## 3 GSE98413 2017/06/17 GSE98413_Junb_counts_normalized.tab.gz 38
## 4 GSE98413 2017/06/17 GSE98413_Junb_counts_raw.tab.gz 31
## 5 GSE83480 2017/06/17 filelist.txt 12
## 6 GSE83480 2017/06/17 GSE83480_RAW.tar 16
## 7 GSE78140 2017/06/16 filelist.txt 12
## 8 GSE78140 2017/06/16 GSE78140_RAW.tar 16
## 9 GSE98273 2017/06/16 GSE98273_genes.fpkm_tracking.txt.gz 35
## 10 GSE89134 2017/06/16 GSE89134_CentralMemoryVsControlNPDay8.txt.gz 44
## # … with 23,108 more rows
Plot sizedistribution of supplementary file names:
ggplot(supfn, aes(strlen)) + geom_histogram(bins = 40)
Distribution seems skewed, what if we plot log transformed strlen values?
ggplot(supfn, aes(log2(strlen))) + geom_histogram(bins = 40)
# Single most common filename: filelist.txt
most_common_filename <- supfn %>%
group_by(SuppFileNames) %>%
summarise(N = n()) %>%
arrange(desc(N))
most_common_filename
## # A tibble: 17,697 × 2
## SuppFileNames N
## <chr> <int>
## 1 filelist.txt 5422
## 2 GSE100067_Expression_table.txt.gz 1
## 3 GSE100077_Counts.csv.gz 1
## 4 GSE100077_DifferentialExpression.csv.gz 1
## 5 GSE100106_RAW.tar 1
## 6 GSE11724_RAW.tar 1
## 7 GSE11892_RAW.tar 1
## 8 GSE12075_RAW.tar 1
## 9 GSE12946_RAW.tar 1
## 10 GSE13652_RAW.tar 1
## # … with 17,687 more rows
Filenames are prepended with GSE id
# Supplemental file names with more than N = 10 occurences
cf <- supfn %>%
mutate(common_filenames = str_replace(SuppFileNames, "GSE[0-9]+_", ""),
common_filenames = str_replace(common_filenames, "\\.gz$", ""),
common_filenames = str_to_lower(common_filenames))
cf
## # A tibble: 23,118 × 5
## Accession PDAT SuppFileNames strlen common_filenames
## <chr> <chr> <chr> <int> <chr>
## 1 GSE98414 2017/06/17 filelist.txt 12 filelist.txt
## 2 GSE98414 2017/06/17 GSE98414_RAW.tar 16 raw.tar
## 3 GSE98413 2017/06/17 GSE98413_Junb_counts_norm… 38 junb_counts_normalize…
## 4 GSE98413 2017/06/17 GSE98413_Junb_counts_raw.… 31 junb_counts_raw.tab
## 5 GSE83480 2017/06/17 filelist.txt 12 filelist.txt
## 6 GSE83480 2017/06/17 GSE83480_RAW.tar 16 raw.tar
## 7 GSE78140 2017/06/16 filelist.txt 12 filelist.txt
## 8 GSE78140 2017/06/16 GSE78140_RAW.tar 16 raw.tar
## 9 GSE98273 2017/06/16 GSE98273_genes.fpkm_track… 35 genes.fpkm_tracking.t…
## 10 GSE89134 2017/06/16 GSE89134_CentralMemoryVsC… 44 centralmemoryvscontro…
## # … with 23,108 more rows
cfn <- group_by(cf, common_filenames) %>%
summarise(N = n()) %>%
arrange(desc(N)) %>%
filter(N > 10)
cfn
## # A tibble: 14 × 2
## common_filenames N
## <chr> <int>
## 1 filelist.txt 5422
## 2 raw.tar 5422
## 3 gene_exp.diff 66
## 4 readme.txt 65
## 5 genes.fpkm_tracking 39
## 6 counts.txt 31
## 7 processed_data.txt 28
## 8 rpkm.txt 28
## 9 fpkm.txt 26
## 10 raw_counts.txt 20
## 11 normalized_counts.txt 16
## 12 gene_exp.diff.txt 13
## 13 isoform_exp.diff 13
## 14 genes.fpkm_tracking.txt 11
cfp <- ggplot(cfn, aes(common_filenames, N)) +
geom_point() +
scale_x_discrete(limits = rev(cfn$common_filenames)) +
scale_y_log10() +
coord_flip() +
xlab("Common stubs of SuppFileNames\n(>10 occurences) ") +
ylab("Number of files")
# plot commonfilenames ggplot
cfp
Now we can filter out “filelist.txt” and “RAW.tar” files and replot file name distribution.
filter(supfn, !str_detect(SuppFileNames, "filelist|RAW.tar")) %>%
ggplot(aes(log2(strlen))) + geom_histogram(bins = 40)
str_detect()
## str_detect generates logical vector of matches and nonmatches
## match letter b against alphabet and get index of TRUE values
str_detect(letters, "b") %>% which
## [1] 2
We want to filter out some file types.
# we are looking only for tabular data.
out_string1 <- c("filelist|annotation|readme|error|raw.tar|csfasta|bam|sam|bed|[:punct:]hic|hdf5|bismark|map|barcode|peaks")
out_string2 <- c("tar","gtf","(big)?bed(\\.txt|12|graph|pk)?","bw",
"wig","hic","gct(x)?","tdf","gff(3)?","pdf","png","zip",
"sif","narrowpeak","fa", "r$", "rda(ta)?$")
paste0(out_string2, "(\\.gz|\\.bz2)?$", collapse = "|")
## [1] "tar(\\.gz|\\.bz2)?$|gtf(\\.gz|\\.bz2)?$|(big)?bed(\\.txt|12|graph|pk)?(\\.gz|\\.bz2)?$|bw(\\.gz|\\.bz2)?$|wig(\\.gz|\\.bz2)?$|hic(\\.gz|\\.bz2)?$|gct(x)?(\\.gz|\\.bz2)?$|tdf(\\.gz|\\.bz2)?$|gff(3)?(\\.gz|\\.bz2)?$|pdf(\\.gz|\\.bz2)?$|png(\\.gz|\\.bz2)?$|zip(\\.gz|\\.bz2)?$|sif(\\.gz|\\.bz2)?$|narrowpeak(\\.gz|\\.bz2)?$|fa(\\.gz|\\.bz2)?$|r$(\\.gz|\\.bz2)?$|rda(ta)?$(\\.gz|\\.bz2)?$"
suppfiles_of_interest <- supfn %>%
filter(!str_detect(tolower(SuppFileNames), out_string1),
!str_detect(tolower(SuppFileNames), paste0(out_string2, "(\\.gz|\\.bz2)?$", collapse = "|"))) %>%
mutate(filext = str_extract(str_to_lower(SuppFileNames), "\\.[:alpha:]+([:punct:][bgz2]+)?$"))
suppfiles_of_interest
## # A tibble: 7,928 × 5
## Accession PDAT SuppFileNames strlen filext
## <chr> <chr> <chr> <int> <chr>
## 1 GSE98413 2017/06/17 GSE98413_Junb_counts_normalized.tab.gz 38 .tab.…
## 2 GSE98413 2017/06/17 GSE98413_Junb_counts_raw.tab.gz 31 .tab.…
## 3 GSE98273 2017/06/16 GSE98273_genes.fpkm_tracking.txt.gz 35 .txt.…
## 4 GSE89134 2017/06/16 GSE89134_CentralMemoryVsControlNPDay8.txt… 44 .txt.…
## 5 GSE89134 2017/06/16 GSE89134_Counts.txt.gz 22 .txt.…
## 6 GSE89134 2017/06/16 GSE89134_Foxo1NPDay8VsControlNPDay8.txt.gz 42 .txt.…
## 7 GSE89134 2017/06/16 GSE89134_log2NormalizedCPM.txt.gz 33 .txt.…
## 8 GSE89205 2017/06/16 GSE89205_genes.fpkm_tracking.txt.gz 35 .txt.…
## 9 GSE100077 2017/06/16 GSE100077_Counts.csv.gz 23 .csv.…
## 10 GSE100077 2017/06/16 GSE100077_DifferentialExpression.csv.gz 39 .csv.…
## # … with 7,918 more rows
Most popular file extensions of potentially interesting files.
fext <- group_by(suppfiles_of_interest, filext) %>%
summarise(N = n()) %>%
arrange(desc(N)) %>%
filter(N > 10)
fext
## # A tibble: 9 × 2
## filext N
## <chr> <int>
## 1 .txt.gz 5377
## 2 .csv.gz 675
## 3 .xlsx 592
## 4 .xls.gz 309
## 5 .diff.gz 289
## 6 .tsv.gz 257
## 7 .xlsx.gz 180
## 8 .gz 131
## 9 .tab.gz 61
ggplot(fext, aes(filext, N)) +
geom_point() +
scale_x_discrete(limits = rev(fext$filext)) +
scale_y_log10() +
coord_flip() +
xlab("Common file extensions\n(>10 occurences) ") +
ylab("Number of files")
Let’s find summaries containing word “CRISPR”.
crispr <- suppfilenames %>%
filter(str_detect(str_to_lower(summary), "crispr"))
crispr
## # A tibble: 95 × 29
## Id Accession GDS title summary GPL GSE taxon entryType gdsType
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 200095383 GSE95383 "" Expre… The Pol… 21273 95383 Mus … GSE Expres…
## 2 200099819 GSE99819 "" Genom… The mam… 18573 99819 Homo… GSE Expres…
## 3 200078519 GSE78519 "" Capic… We perf… 16791 78519 Homo… GSE Expres…
## 4 200095455 GSE95455 "" Asses… HeLa ce… 16791 95455 Homo… GSE Expres…
## 5 200095452 GSE95452 "" Asses… By comp… 16791 95452 Homo… GSE Expres…
## 6 200093681 GSE93681 "" CRISP… Epstein… 18573 93681 Homo… GSE Expres…
## 7 200098063 GSE98063 "" Mll3 … Monomet… 19057 98063 Mus … GSE Expres…
## 8 200093395 GSE93395 "" Gene … Recent … 11154 93395 Homo… GSE Expres…
## 9 200098177 GSE98177 "" Trans… We prev… 20301 98177 Homo… GSE Expres…
## 10 200083296 GSE83296 "" Genom… Gene ex… 18573 83296 Homo… GSE Expres…
## # … with 85 more rows, and 19 more variables: ptechType <chr>, valType <chr>,
## # SSInfo <chr>, subsetInfo <chr>, PDAT <chr>, suppFile <chr>, Samples <chr>,
## # Relations <chr>, ExtRelations <chr>, n_samples <chr>, SeriesTitle <chr>,
## # PlatformTitle <chr>, PlatformTaxa <chr>, SamplesTaxa <chr>,
## # PubMedIds <chr>, Projects <chr>, FTPLink <chr>, GEO2R <chr>,
## # SuppFileNames <list>
We have 95 GEO series containing word “crispr” in summary.
When people started to publish experiments using crispr?
crispr %>%
ggplot(aes(lubridate::ymd(PDAT))) +
geom_histogram(aes(y = cumsum(..count..)), bins = 30) +
labs(title = "Number of Entrez GEO series mentioning CRISPR in summary",
caption = "Data: Entrez GEO",
y = "Cumulative number of studies",
x = "Publication date")
str_replace()
Commonly strings are removed by replacing them with an empty string – str_remove()
Let’s suppose we want to fix these ftp links, as these links have string prepended before URL. We want to remove this “SRASRP..” part to get bare URL. We do this by replacing “SRASRP..” with empty string "". (alternatively you can extract URL)
set.seed(2)
ftplinks <- suppfilenames %>%
select(ExtRelations) %>%
sample_n(5)
ftplinks
## # A tibble: 5 × 1
## ExtRelations
## <chr>
## 1 ""
## 2 "SRASRP015711ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/s…
## 3 "SRASRP064758ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/s…
## 4 "SRASRP081655ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/s…
## 5 "SRASRP076951ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/s…
ftplinks$ExtRelations[2] %>% str_replace("SRASRP[0-9]+", "")
## [1] "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP015/SRP015711/"
Let’s split
str_split("A\nB", "\n")
## [[1]]
## [1] "A" "B"
Split summaries by word boundaries/whitespace.
sums <- suppfilenames %>%
# sample_n(10) %>%
select(summary, Accession)
sums
## # A tibble: 8,882 × 2
## summary Accession
## <chr> <chr>
## 1 This SuperSeries is composed of the SubSeries listed below. GSE98414
## 2 Here we identify the activator protein-1 (AP-1) factor JunB as an … GSE98413
## 3 Analysis of whole gene expression during differentiation from hiPS… GSE83480
## 4 Single-cell epigenome sequencing techniques have recently been dev… GSE78140
## 5 We searched for roles of ZEB1and/or ZEB2 during EMT by RNA-seq in … GSE98273
## 6 RNAseq of ex vivo CD8 T cell lineages and in vitro differentiated … GSE89134
## 7 This SuperSeries is composed of the SubSeries listed below. Zinc f… GSE89206
## 8 We searched for roles of ZEB1 during EMT by RNA-seq in breast canc… GSE89205
## 9 Outputs from scRNA seq reads from isolated Mouse E16 and P4 lacrim… GSE100106
## 10 Immunodeficient mouse models have been valuable for studies of hum… GSE100077
## # … with 8,872 more rows
Smaller scale. Go to ?regex
and compare different regexes for splitting.
The symbol matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and is its negation ([^[:alnum:]_]). Symbols , and denote the digit and space classes and their negations (these are all extensions).
summary100 <- sums$summary[100] %>% str_split("\\s+")
summary100 <- summary100 %>% unlist
Use of str_split within dplyr
sums %>%
sample_n(10) %>%
mutate(words = str_split(summary, "\\s")) %>%
select(-summary) %>%
unnest %>%
count(words) %>%
arrange(desc(n))
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(words)`
## # A tibble: 547 × 2
## words n
## <chr> <int>
## 1 the 44
## 2 and 41
## 3 of 39
## 4 in 27
## 5 to 25
## 6 a 11
## 7 cells 9
## 8 that 9
## 9 by 8
## 10 cell 8
## # … with 537 more rows
str_to_lower()
str_to_lower(summary100)
## [1] "the" "mammalian" "genome"
## [4] "contains" "thousands" "of"
## [7] "loci" "that" "transcribe"
## [10] "long" "noncoding" "rnas"
## [13] "(lncrnas)," "some" "of"
## [16] "which" "are" "known"
## [19] "to" "play" "critical"
## [22] "roles" "in" "diverse"
## [25] "cellular" "processes" "through"
## [28] "a" "variety" "of"
## [31] "mechanisms." "while" "some"
## [34] "lncrna" "loci" "encode"
## [37] "rnas" "that" "act"
## [40] "non-locally" "(in" "trans),"
## [43] "emerging" "evidence" "indicates"
## [46] "that" "many" "lncrna"
## [49] "loci" "act" "locally"
## [52] "(in" "cis)" "to"
## [55] "regulate" "expression" "of"
## [58] "nearby" "genes—for" "example,"
## [61] "through" "functions" "of"
## [64] "the" "lncrna" "promoter,"
## [67] "transcription," "or" "transcript"
## [70] "itself." "despite" "their"
## [73] "potentially" "important" "roles,"
## [76] "it" "remains" "challenging"
## [79] "to" "identify" "functional"
## [82] "lncrna" "loci" "and"
## [85] "distinguish" "among" "these"
## [88] "and" "other" "mechanisms."
## [91] "to" "address" "these"
## [94] "challenges," "we" "developed"
## [97] "a" "genome-scale" "crispr-cas9"
## [100] "activation" "screen" "targeting"
## [103] "more" "than" "10,000"
## [106] "lncrna" "transcriptional" "start"
## [109] "sites" "(tsss)" "to"
## [112] "identify" "noncoding" "loci"
## [115] "that" "influence" "a"
## [118] "phenotype" "of" "interest."
## [121] "we" "found" "11"
## [124] "novel" "lncrna" "loci"
## [127] "that," "upon" "recruitment"
## [130] "of" "an" "activator,"
## [133] "each" "mediate" "braf"
## [136] "inhibitor" "resistance" "in"
## [139] "melanoma." "most" "candidate"
## [142] "loci" "appear" "to"
## [145] "regulate" "nearby" "genes."
## [148] "detailed" "analysis" "of"
## [151] "one" "candidate," "termed"
## [154] "emiceri," "revealed" "that"
## [157] "its" "transcriptional" "activation"
## [160] "results" "in" "dosage-dependent"
## [163] "activation" "of" "four"
## [166] "neighboring" "protein-coding" "genes,"
## [169] "one" "of" "which"
## [172] "confers" "the" "resistance"
## [175] "phenotype." "our" "screening"
## [178] "and" "characterization" "approach"
## [181] "provides" "a" "crispr"
## [184] "toolkit" "to" "systematically"
## [187] "discover" "functions" "of"
## [190] "noncoding" "loci" "and"
## [193] "elucidate" "their" "diverse"
## [196] "roles" "in" "gene"
## [199] "regulation" "and" "cellular"
## [202] "function."
suppfilenames$title[1:10]
## [1] "Role of JunB in Th17 cell effector stability"
## [2] "Role of JunB in Th17 cell effector stability [RNA-seq]"
## [3] "Genome-wide analysis of human iPS cell-derived hepatocyte-like cells induced by methoxamine treatment."
## [4] "Single-cell multi-omics sequencing of mouse early embryos and embryonic stem cells"
## [5] "RNA-sequencing in MDA-231-D cells transfected with ZEB1 or ZEB2 siRNAs"
## [6] "Hit-and-run' programing of CAR-T cells using mRNA nanocarriers"
## [7] "ZEB1-regulated inflammatory phenotype in breast cancer cells"
## [8] "RNA-sequencing in TGF-beta treated MDA-231-D cells transfected with ZEB1/ZEB2 siRNAs [RNA-seq]"
## [9] "10X Genomics scRNA sequence Lacrimal Data set"
## [10] "Developmentally-Faithful and Effective Human Erythropoiesis in Immunodeficient and Kit Mutant Mice"
str_to_title(suppfilenames$title[1:10])
## [1] "Role Of Junb In Th17 Cell Effector Stability"
## [2] "Role Of Junb In Th17 Cell Effector Stability [Rna-Seq]"
## [3] "Genome-Wide Analysis Of Human Ips Cell-Derived Hepatocyte-Like Cells Induced By Methoxamine Treatment."
## [4] "Single-Cell Multi-Omics Sequencing Of Mouse Early Embryos And Embryonic Stem Cells"
## [5] "Rna-Sequencing In Mda-231-D Cells Transfected With Zeb1 Or Zeb2 Sirnas"
## [6] "Hit-And-Run' Programing Of Car-T Cells Using Mrna Nanocarriers"
## [7] "Zeb1-Regulated Inflammatory Phenotype In Breast Cancer Cells"
## [8] "Rna-Sequencing In Tgf-Beta Treated Mda-231-D Cells Transfected With Zeb1/Zeb2 Sirnas [Rna-Seq]"
## [9] "10x Genomics Scrna Sequence Lacrimal Data Set"
## [10] "Developmentally-Faithful And Effective Human Erythropoiesis In Immunodeficient And Kit Mutant Mice"
str_to_upper()
suppfilenames$title[1:10]
## [1] "Role of JunB in Th17 cell effector stability"
## [2] "Role of JunB in Th17 cell effector stability [RNA-seq]"
## [3] "Genome-wide analysis of human iPS cell-derived hepatocyte-like cells induced by methoxamine treatment."
## [4] "Single-cell multi-omics sequencing of mouse early embryos and embryonic stem cells"
## [5] "RNA-sequencing in MDA-231-D cells transfected with ZEB1 or ZEB2 siRNAs"
## [6] "Hit-and-run' programing of CAR-T cells using mRNA nanocarriers"
## [7] "ZEB1-regulated inflammatory phenotype in breast cancer cells"
## [8] "RNA-sequencing in TGF-beta treated MDA-231-D cells transfected with ZEB1/ZEB2 siRNAs [RNA-seq]"
## [9] "10X Genomics scRNA sequence Lacrimal Data set"
## [10] "Developmentally-Faithful and Effective Human Erythropoiesis in Immunodeficient and Kit Mutant Mice"
str_to_upper(suppfilenames$title[1:10])
## [1] "ROLE OF JUNB IN TH17 CELL EFFECTOR STABILITY"
## [2] "ROLE OF JUNB IN TH17 CELL EFFECTOR STABILITY [RNA-SEQ]"
## [3] "GENOME-WIDE ANALYSIS OF HUMAN IPS CELL-DERIVED HEPATOCYTE-LIKE CELLS INDUCED BY METHOXAMINE TREATMENT."
## [4] "SINGLE-CELL MULTI-OMICS SEQUENCING OF MOUSE EARLY EMBRYOS AND EMBRYONIC STEM CELLS"
## [5] "RNA-SEQUENCING IN MDA-231-D CELLS TRANSFECTED WITH ZEB1 OR ZEB2 SIRNAS"
## [6] "HIT-AND-RUN' PROGRAMING OF CAR-T CELLS USING MRNA NANOCARRIERS"
## [7] "ZEB1-REGULATED INFLAMMATORY PHENOTYPE IN BREAST CANCER CELLS"
## [8] "RNA-SEQUENCING IN TGF-BETA TREATED MDA-231-D CELLS TRANSFECTED WITH ZEB1/ZEB2 SIRNAS [RNA-SEQ]"
## [9] "10X GENOMICS SCRNA SEQUENCE LACRIMAL DATA SET"
## [10] "DEVELOPMENTALLY-FAITHFUL AND EFFECTIVE HUMAN ERYTHROPOIESIS IN IMMUNODEFICIENT AND KIT MUTANT MICE"
str_trunc()
str_trunc(suppfilenames$title[1:10], width = 30, side = "right")
## [1] "Role of JunB in Th17 cell e..." "Role of JunB in Th17 cell e..."
## [3] "Genome-wide analysis of hum..." "Single-cell multi-omics seq..."
## [5] "RNA-sequencing in MDA-231-D..." "Hit-and-run' programing of ..."
## [7] "ZEB1-regulated inflammatory..." "RNA-sequencing in TGF-beta ..."
## [9] "10X Genomics scRNA sequence..." "Developmentally-Faithful an..."
str_trunc(suppfilenames$title[1:10], width = 30, side = "center")
## [1] "Role of JunB i...tor stability" "Role of JunB i...ity [RNA-seq]"
## [3] "Genome-wide an...ne treatment." "Single-cell mu...ic stem cells"
## [5] "RNA-sequencing...r ZEB2 siRNAs" "Hit-and-run' p... nanocarriers"
## [7] "ZEB1-regulated... cancer cells" "RNA-sequencing...NAs [RNA-seq]"
## [9] "10X Genomics s...imal Data set" "Developmentall...t Mutant Mice"
str_trunc(suppfilenames$title[1:10], width = 30, side = "left")
## [1] "...h17 cell effector stability" "...ffector stability [RNA-seq]"
## [3] "...d by methoxamine treatment." "...os and embryonic stem cells"
## [5] "...ed with ZEB1 or ZEB2 siRNAs" "...lls using mRNA nanocarriers"
## [7] "...type in breast cancer cells" "... ZEB1/ZEB2 siRNAs [RNA-seq]"
## [9] "... sequence Lacrimal Data set" "...ficient and Kit Mutant Mice"
str_wrap()
str_wrap(suppfilenames$title[1], width = 30)
## [1] "Role of JunB in Th17 cell\neffector stability"
paste()
# letters
paste("one", letters)
## [1] "one a" "one b" "one c" "one d" "one e" "one f" "one g" "one h" "one i"
## [10] "one j" "one k" "one l" "one m" "one n" "one o" "one p" "one q" "one r"
## [19] "one s" "one t" "one u" "one v" "one w" "one x" "one y" "one z"
paste("one", letters, collapse = " + ")
## [1] "one a + one b + one c + one d + one e + one f + one g + one h + one i + one j + one k + one l + one m + one n + one o + one p + one q + one r + one s + one t + one u + one v + one w + one x + one y + one z"
paste("one", letters, sep = "+")
## [1] "one+a" "one+b" "one+c" "one+d" "one+e" "one+f" "one+g" "one+h" "one+i"
## [10] "one+j" "one+k" "one+l" "one+m" "one+n" "one+o" "one+p" "one+q" "one+r"
## [19] "one+s" "one+t" "one+u" "one+v" "one+w" "one+x" "one+y" "one+z"
paste0()
paste0("one", letters)
## [1] "onea" "oneb" "onec" "oned" "onee" "onef" "oneg" "oneh" "onei" "onej"
## [11] "onek" "onel" "onem" "onen" "oneo" "onep" "oneq" "oner" "ones" "onet"
## [21] "oneu" "onev" "onew" "onex" "oney" "onez"
str_c()
(analogue of paste0())str_c("XXX", summary100)
## [1] "XXXThe" "XXXmammalian" "XXXgenome"
## [4] "XXXcontains" "XXXthousands" "XXXof"
## [7] "XXXloci" "XXXthat" "XXXtranscribe"
## [10] "XXXlong" "XXXnoncoding" "XXXRNAs"
## [13] "XXX(lncRNAs)," "XXXsome" "XXXof"
## [16] "XXXwhich" "XXXare" "XXXknown"
## [19] "XXXto" "XXXplay" "XXXcritical"
## [22] "XXXroles" "XXXin" "XXXdiverse"
## [25] "XXXcellular" "XXXprocesses" "XXXthrough"
## [28] "XXXa" "XXXvariety" "XXXof"
## [31] "XXXmechanisms." "XXXWhile" "XXXsome"
## [34] "XXXlncRNA" "XXXloci" "XXXencode"
## [37] "XXXRNAs" "XXXthat" "XXXact"
## [40] "XXXnon-locally" "XXX(in" "XXXtrans),"
## [43] "XXXemerging" "XXXevidence" "XXXindicates"
## [46] "XXXthat" "XXXmany" "XXXlncRNA"
## [49] "XXXloci" "XXXact" "XXXlocally"
## [52] "XXX(in" "XXXcis)" "XXXto"
## [55] "XXXregulate" "XXXexpression" "XXXof"
## [58] "XXXnearby" "XXXgenes—for" "XXXexample,"
## [61] "XXXthrough" "XXXfunctions" "XXXof"
## [64] "XXXthe" "XXXlncRNA" "XXXpromoter,"
## [67] "XXXtranscription," "XXXor" "XXXtranscript"
## [70] "XXXitself." "XXXDespite" "XXXtheir"
## [73] "XXXpotentially" "XXXimportant" "XXXroles,"
## [76] "XXXit" "XXXremains" "XXXchallenging"
## [79] "XXXto" "XXXidentify" "XXXfunctional"
## [82] "XXXlncRNA" "XXXloci" "XXXand"
## [85] "XXXdistinguish" "XXXamong" "XXXthese"
## [88] "XXXand" "XXXother" "XXXmechanisms."
## [91] "XXXTo" "XXXaddress" "XXXthese"
## [94] "XXXchallenges," "XXXwe" "XXXdeveloped"
## [97] "XXXa" "XXXgenome-scale" "XXXCRISPR-Cas9"
## [100] "XXXactivation" "XXXscreen" "XXXtargeting"
## [103] "XXXmore" "XXXthan" "XXX10,000"
## [106] "XXXlncRNA" "XXXtranscriptional" "XXXstart"
## [109] "XXXsites" "XXX(TSSs)" "XXXto"
## [112] "XXXidentify" "XXXnoncoding" "XXXloci"
## [115] "XXXthat" "XXXinfluence" "XXXa"
## [118] "XXXphenotype" "XXXof" "XXXinterest."
## [121] "XXXWe" "XXXfound" "XXX11"
## [124] "XXXnovel" "XXXlncRNA" "XXXloci"
## [127] "XXXthat," "XXXupon" "XXXrecruitment"
## [130] "XXXof" "XXXan" "XXXactivator,"
## [133] "XXXeach" "XXXmediate" "XXXBRAF"
## [136] "XXXinhibitor" "XXXresistance" "XXXin"
## [139] "XXXmelanoma." "XXXMost" "XXXcandidate"
## [142] "XXXloci" "XXXappear" "XXXto"
## [145] "XXXregulate" "XXXnearby" "XXXgenes."
## [148] "XXXDetailed" "XXXanalysis" "XXXof"
## [151] "XXXone" "XXXcandidate," "XXXtermed"
## [154] "XXXEMICERI," "XXXrevealed" "XXXthat"
## [157] "XXXits" "XXXtranscriptional" "XXXactivation"
## [160] "XXXresults" "XXXin" "XXXdosage-dependent"
## [163] "XXXactivation" "XXXof" "XXXfour"
## [166] "XXXneighboring" "XXXprotein-coding" "XXXgenes,"
## [169] "XXXone" "XXXof" "XXXwhich"
## [172] "XXXconfers" "XXXthe" "XXXresistance"
## [175] "XXXphenotype." "XXXOur" "XXXscreening"
## [178] "XXXand" "XXXcharacterization" "XXXapproach"
## [181] "XXXprovides" "XXXa" "XXXCRISPR"
## [184] "XXXtoolkit" "XXXto" "XXXsystematically"
## [187] "XXXdiscover" "XXXfunctions" "XXXof"
## [190] "XXXnoncoding" "XXXloci" "XXXand"
## [193] "XXXelucidate" "XXXtheir" "XXXdiverse"
## [196] "XXXroles" "XXXin" "XXXgene"
## [199] "XXXregulation" "XXXand" "XXXcellular"
## [202] "XXXfunction."
str_c(summary100, collapse = " ")
## [1] "The mammalian genome contains thousands of loci that transcribe long noncoding RNAs (lncRNAs), some of which are known to play critical roles in diverse cellular processes through a variety of mechanisms. While some lncRNA loci encode RNAs that act non-locally (in trans), emerging evidence indicates that many lncRNA loci act locally (in cis) to regulate expression of nearby genes—for example, through functions of the lncRNA promoter, transcription, or transcript itself. Despite their potentially important roles, it remains challenging to identify functional lncRNA loci and distinguish among these and other mechanisms. To address these challenges, we developed a genome-scale CRISPR-Cas9 activation screen targeting more than 10,000 lncRNA transcriptional start sites (TSSs) to identify noncoding loci that influence a phenotype of interest. We found 11 novel lncRNA loci that, upon recruitment of an activator, each mediate BRAF inhibitor resistance in melanoma. Most candidate loci appear to regulate nearby genes. Detailed analysis of one candidate, termed EMICERI, revealed that its transcriptional activation results in dosage-dependent activation of four neighboring protein-coding genes, one of which confers the resistance phenotype. Our screening and characterization approach provides a CRISPR toolkit to systematically discover functions of noncoding loci and elucidate their diverse roles in gene regulation and cellular function."
sprintf()
todays_date <- Sys.Date()
todays_date
## [1] "2021-10-07"
todays_temp <- 7
sprintf("Today is %s and temperature is %s", todays_date, 7)
## [1] "Today is 2021-10-07 and temperature is 7"
glue()
library(glue)
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
?glue
glue("Answer to 1 + 1 is {1 + 1}.")
## Answer to 1 + 1 is 2.
Book: http://tidytextmining.com
Unnest tokens (words)
# install.packages("tidytext")
library(tidytext)
tidy_sums <- sums %>%
unnest_tokens(word, summary)
tidy_sums
## # A tibble: 899,380 × 2
## Accession word
## <chr> <chr>
## 1 GSE98414 this
## 2 GSE98414 superseries
## 3 GSE98414 is
## 4 GSE98414 composed
## 5 GSE98414 of
## 6 GSE98414 the
## 7 GSE98414 subseries
## 8 GSE98414 listed
## 9 GSE98414 below
## 10 GSE98413 here
## # … with 899,370 more rows
Now that the data is in one-word-per-row format..
We will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join()
.
data(stop_words)
tidy_sums <- tidy_sums %>%
anti_join(stop_words)
## Joining, by = "word"
tidy_sums
## # A tibble: 560,867 × 2
## Accession word
## <chr> <chr>
## 1 GSE98414 superseries
## 2 GSE98414 composed
## 3 GSE98414 subseries
## 4 GSE98414 listed
## 5 GSE98413 identify
## 6 GSE98413 activator
## 7 GSE98413 protein
## 8 GSE98413 1
## 9 GSE98413 ap
## 10 GSE98413 1
## # … with 560,857 more rows
Now we can count words.
tidy_sums %>%
count(word, sort = TRUE)
## # A tibble: 28,086 × 2
## word n
## <chr> <int>
## 1 cells 8695
## 2 rna 7192
## 3 cell 6485
## 4 expression 6259
## 5 genes 5200
## 6 gene 4623
## 7 seq 4488
## 8 data 3088
## 9 human 2966
## 10 http 2917
## # … with 28,076 more rows
Words with more than 2000 occurences:
tidy_sums %>%
count(word, sort = TRUE) %>%
filter(n > 2000) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
What is the tone of research abstracts?
library(textdata)
get_sentiments()
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
# Different sentiment databases
get_sentiments("nrc")$sentiment %>% unique
## [1] "trust" "fear" "negative" "sadness" "anger"
## [6] "surprise" "positive" "disgust" "joy" "anticipation"
get_sentiments("afinn")$score %>% unique
## Warning: Unknown or uninitialised column: `score`.
## NULL
get_sentiments("bing")$sentiment %>% unique()
## [1] "negative" "positive"
Abstract scores 1
tidy_sums %>%
inner_join(get_sentiments("afinn")) %>%
group_by(Accession) %>%
summarise(sentiment_value = sum(value)) %>%
ggplot(aes(sentiment_value)) +
geom_histogram(bins = 40)
## Joining, by = "word"
Let’s see most negative and positive abstracts:
get_sentiments(lexicon = "bing")$sentiment %>% unique()
## [1] "negative" "positive"
summary_sentiments <- tidy_sums %>%
inner_join(get_sentiments("bing")) %>%
count(Accession, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
summary_sentiments
## # A tibble: 5,593 × 4
## Accession negative positive sentiment
## <chr> <int> <int> <int>
## 1 GSE100067 3 5 2
## 2 GSE100077 0 5 5
## 3 GSE100106 1 0 -1
## 4 GSE11724 0 2 2
## 5 GSE11892 0 1 1
## 6 GSE13652 0 2 2
## 7 GSE14092 11 4 -7
## 8 GSE14605 0 4 4
## 9 GSE15780 2 0 -2
## 10 GSE16190 2 5 3
## # … with 5,583 more rows
Sligtly more negative sentiment in abstracts
summary_sentiments %>%
ggplot(aes(sentiment)) +
geom_histogram(bins = 40)
Most negative summary:
summary_sentiments %>%
arrange(sentiment) %>%
left_join(select(suppfilenames, Accession, summary))
## Joining, by = "Accession"
## # A tibble: 5,593 × 5
## Accession negative positive sentiment summary
## <chr> <int> <int> <int> <chr>
## 1 GSE85013 28 0 -28 "Background: Despite of extensive rese…
## 2 GSE79423 27 3 -24 "Systematic analyses of the temporal d…
## 3 GSE85541 23 2 -21 "Anti-androgen therapies including the…
## 4 GSE86922 23 3 -20 "Caspases regulate cell death programs…
## 5 GSE63756 20 2 -18 "Endoplasmic reticulum (ER) stress occ…
## 6 GSE68140 18 1 -17 "We report positional cloning and char…
## 7 GSE80160 19 2 -17 "Diet-induced obesity is characterized…
## 8 GSE90478 21 4 -17 "Toxoplasma gondii is a ubiquitous api…
## 9 GSE93999 20 3 -17 "The influenza A virus is an acute con…
## 10 GSE68713 17 1 -16 "Maternal stress, anxiety, and depress…
## # … with 5,583 more rows
Let’s see what this GSE85013 is..