Regexes

Find character strings

Character strings can be matched and manipulated in base R by using regular expressions in functions grep, grepl, sub, gsub, regexpr + regmatches and some others.
tidyverse package ‘stringr’ contains analogous verbs with more consistent syntax.
A regular expression is a pattern that describes a set of strings.

Regular Expressions as used in R

Most characters, including all letters and digits, are regular expressions that match themselves.
Whereas, e.g. . matches any single character.
You can refer also to a character class, which is a list of characters enclosed between [ and ], e.g. [[:alnum:]] is same as [A-z0-9].
Most common character classes:
- [:alnum:] includes alphanumerics ([:alpha:] and [:digit:]);
- [:alpha:], includes alphabetic characters ([:upper:] and [:lower:] case);
- [:punct:] includes punctuation characters ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ { | } ~.;
- [:blank:] includes space and tab; etc.
The metacharacters in regular expressions are . | ( ) [ { ^ $ * + ?, whether these have a special meaning depending on the context.
When matching any metacharacter as a regular character, precede it with a double backslash \.
Repetition quantifiers put after regex specify how many times regex is matched: ?, optional, at most once; *, zero or more times; +, one or more times; {n}, n times; {n,}, n or more times; {n,m}, n to m times.
The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line.

Common operations with regular expressions

Locate a pattern match (positions)
Extract a matched pattern
Identify a match to a pattern
Replace a matched pattern

Let’s try out

Download test dataset.

Test dataset contains Supplementary file names and some metadata of gene expresion profiling experiments using high-throughput sequencing:

https://www.ncbi.nlm.nih.gov/gds?term=%22expression+profiling+by+high+throughput+sequencing%22[DataSet+Type]

if(!dir.exists("data")){
  dir.create("data")
}
## manually download suppfilenames_2017-06-19.RData from rstats-tartu/datasets
## alternatively clone this repo 'rstat-tartu/regex-demo'

Load data

load("data/suppfilenames_2017-06-19.RData")

Unnest dataset

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(stringr)
## Filter out rows with missing file names
suppfilenames <- suppfilenames %>% 
  filter(!map_lgl(SuppFileNames, ~ inherits(., "try-error")))
suppfilenames %>% select(Accession, PDAT, SuppFileNames)

## # A tibble: 8,882 × 3
##    Accession PDAT       SuppFileNames
##    <chr>     <chr>      <list>       
##  1 GSE98414  2017/06/17 <chr [2]>    
##  2 GSE98413  2017/06/17 <chr [2]>    
##  3 GSE83480  2017/06/17 <chr [2]>    
##  4 GSE78140  2017/06/16 <chr [2]>    
##  5 GSE98273  2017/06/16 <chr [1]>    
##  6 GSE89134  2017/06/16 <chr [4]>    
##  7 GSE89206  2017/06/16 <chr [2]>    
##  8 GSE89205  2017/06/16 <chr [1]>    
##  9 GSE100106 2017/06/16 <chr [2]>    
## 10 GSE100077 2017/06/16 <chr [2]>    
## # … with 8,872 more rows

## unnest supplementary file names
supfn <-  suppfilenames %>% unnest(SuppFileNames)
supfn %>% select(Accession, PDAT, SuppFileNames)

## # A tibble: 23,118 × 3
##    Accession PDAT       SuppFileNames                               
##    <chr>     <chr>      <chr>                                       
##  1 GSE98414  2017/06/17 filelist.txt                                
##  2 GSE98414  2017/06/17 GSE98414_RAW.tar                            
##  3 GSE98413  2017/06/17 GSE98413_Junb_counts_normalized.tab.gz      
##  4 GSE98413  2017/06/17 GSE98413_Junb_counts_raw.tab.gz             
##  5 GSE83480  2017/06/17 filelist.txt                                
##  6 GSE83480  2017/06/17 GSE83480_RAW.tar                            
##  7 GSE78140  2017/06/16 filelist.txt                                
##  8 GSE78140  2017/06/16 GSE78140_RAW.tar                            
##  9 GSE98273  2017/06/16 GSE98273_genes.fpkm_tracking.txt.gz         
## 10 GSE89134  2017/06/16 GSE89134_CentralMemoryVsControlNPDay8.txt.gz
## # … with 23,108 more rows

Get string length

To get the length of a text string (i.e. the number of characters in the string).

str_length("banana")

## [1] 6

str_length("")

## [1] 0

Length of supplementary file names.

supfn <- supfn %>% 
  select(Accession, PDAT, SuppFileNames) %>% 
  mutate(strlen = str_length(SuppFileNames))
supfn

## # A tibble: 23,118 × 4
##    Accession PDAT       SuppFileNames                                strlen
##    <chr>     <chr>      <chr>                                         <int>
##  1 GSE98414  2017/06/17 filelist.txt                                     12
##  2 GSE98414  2017/06/17 GSE98414_RAW.tar                                 16
##  3 GSE98413  2017/06/17 GSE98413_Junb_counts_normalized.tab.gz           38
##  4 GSE98413  2017/06/17 GSE98413_Junb_counts_raw.tab.gz                  31
##  5 GSE83480  2017/06/17 filelist.txt                                     12
##  6 GSE83480  2017/06/17 GSE83480_RAW.tar                                 16
##  7 GSE78140  2017/06/16 filelist.txt                                     12
##  8 GSE78140  2017/06/16 GSE78140_RAW.tar                                 16
##  9 GSE98273  2017/06/16 GSE98273_genes.fpkm_tracking.txt.gz              35
## 10 GSE89134  2017/06/16 GSE89134_CentralMemoryVsControlNPDay8.txt.gz     44
## # … with 23,108 more rows

Plot sizedistribution of supplementary file names:

ggplot(supfn, aes(strlen)) + geom_histogram(bins = 40)

Distribution seems skewed, what if we plot log transformed strlen values?

ggplot(supfn, aes(log2(strlen))) + geom_histogram(bins = 40)

Let’s look at the filenames

# Single most common filename: filelist.txt
most_common_filename <- supfn %>% 
  group_by(SuppFileNames) %>% 
  summarise(N = n()) %>% 
  arrange(desc(N))
most_common_filename

## # A tibble: 17,697 × 2
##    SuppFileNames                               N
##    <chr>                                   <int>
##  1 filelist.txt                             5422
##  2 GSE100067_Expression_table.txt.gz           1
##  3 GSE100077_Counts.csv.gz                     1
##  4 GSE100077_DifferentialExpression.csv.gz     1
##  5 GSE100106_RAW.tar                           1
##  6 GSE11724_RAW.tar                            1
##  7 GSE11892_RAW.tar                            1
##  8 GSE12075_RAW.tar                            1
##  9 GSE12946_RAW.tar                            1
## 10 GSE13652_RAW.tar                            1
## # … with 17,687 more rows

String manipulation

Filenames are prepended with GSE id

# Supplemental file names with more than N = 10 occurences
cf <- supfn %>%
  mutate(common_filenames = str_replace(SuppFileNames, "GSE[0-9]+_", ""),
         common_filenames = str_replace(common_filenames, "\\.gz$", ""),
         common_filenames = str_to_lower(common_filenames))
cf

## # A tibble: 23,118 × 5
##    Accession PDAT       SuppFileNames              strlen common_filenames      
##    <chr>     <chr>      <chr>                       <int> <chr>                 
##  1 GSE98414  2017/06/17 filelist.txt                   12 filelist.txt          
##  2 GSE98414  2017/06/17 GSE98414_RAW.tar               16 raw.tar               
##  3 GSE98413  2017/06/17 GSE98413_Junb_counts_norm…     38 junb_counts_normalize…
##  4 GSE98413  2017/06/17 GSE98413_Junb_counts_raw.…     31 junb_counts_raw.tab   
##  5 GSE83480  2017/06/17 filelist.txt                   12 filelist.txt          
##  6 GSE83480  2017/06/17 GSE83480_RAW.tar               16 raw.tar               
##  7 GSE78140  2017/06/16 filelist.txt                   12 filelist.txt          
##  8 GSE78140  2017/06/16 GSE78140_RAW.tar               16 raw.tar               
##  9 GSE98273  2017/06/16 GSE98273_genes.fpkm_track…     35 genes.fpkm_tracking.t…
## 10 GSE89134  2017/06/16 GSE89134_CentralMemoryVsC…     44 centralmemoryvscontro…
## # … with 23,108 more rows

cfn <- group_by(cf, common_filenames) %>% 
  summarise(N = n()) %>% 
  arrange(desc(N)) %>% 
  filter(N > 10)
cfn

## # A tibble: 14 × 2
##    common_filenames            N
##    <chr>                   <int>
##  1 filelist.txt             5422
##  2 raw.tar                  5422
##  3 gene_exp.diff              66
##  4 readme.txt                 65
##  5 genes.fpkm_tracking        39
##  6 counts.txt                 31
##  7 processed_data.txt         28
##  8 rpkm.txt                   28
##  9 fpkm.txt                   26
## 10 raw_counts.txt             20
## 11 normalized_counts.txt      16
## 12 gene_exp.diff.txt          13
## 13 isoform_exp.diff           13
## 14 genes.fpkm_tracking.txt    11

cfp <- ggplot(cfn, aes(common_filenames, N)) +
  geom_point() +
  scale_x_discrete(limits = rev(cfn$common_filenames)) +
  scale_y_log10() +
  coord_flip() + 
  xlab("Common stubs of SuppFileNames\n(>10 occurences) ") +
  ylab("Number of files")

# plot commonfilenames ggplot
cfp

File name length distribution 2

Now we can filter out “filelist.txt” and “RAW.tar” files and replot file name distribution.

filter(supfn, !str_detect(SuppFileNames, "filelist|RAW.tar")) %>% 
  ggplot(aes(log2(strlen))) + geom_histogram(bins = 40)

`str_detect()`

## str_detect generates logical vector of matches and nonmatches
## match letter b against alphabet and get index of TRUE values 
str_detect(letters, "b") %>% which

## [1] 2

Regular expressions can be ugly

We want to filter out some file types.

# we are looking only for tabular data. 
out_string1 <- c("filelist|annotation|readme|error|raw.tar|csfasta|bam|sam|bed|[:punct:]hic|hdf5|bismark|map|barcode|peaks")
out_string2 <- c("tar","gtf","(big)?bed(\\.txt|12|graph|pk)?","bw",
                 "wig","hic","gct(x)?","tdf","gff(3)?","pdf","png","zip",
                 "sif","narrowpeak","fa", "r$", "rda(ta)?$")
paste0(out_string2, "(\\.gz|\\.bz2)?$", collapse = "|")

## [1] "tar(\\.gz|\\.bz2)?$|gtf(\\.gz|\\.bz2)?$|(big)?bed(\\.txt|12|graph|pk)?(\\.gz|\\.bz2)?$|bw(\\.gz|\\.bz2)?$|wig(\\.gz|\\.bz2)?$|hic(\\.gz|\\.bz2)?$|gct(x)?(\\.gz|\\.bz2)?$|tdf(\\.gz|\\.bz2)?$|gff(3)?(\\.gz|\\.bz2)?$|pdf(\\.gz|\\.bz2)?$|png(\\.gz|\\.bz2)?$|zip(\\.gz|\\.bz2)?$|sif(\\.gz|\\.bz2)?$|narrowpeak(\\.gz|\\.bz2)?$|fa(\\.gz|\\.bz2)?$|r$(\\.gz|\\.bz2)?$|rda(ta)?$(\\.gz|\\.bz2)?$"

suppfiles_of_interest <- supfn %>%
  filter(!str_detect(tolower(SuppFileNames), out_string1),
         !str_detect(tolower(SuppFileNames), paste0(out_string2, "(\\.gz|\\.bz2)?$", collapse = "|"))) %>%
  mutate(filext = str_extract(str_to_lower(SuppFileNames), "\\.[:alpha:]+([:punct:][bgz2]+)?$"))
suppfiles_of_interest

## # A tibble: 7,928 × 5
##    Accession PDAT       SuppFileNames                              strlen filext
##    <chr>     <chr>      <chr>                                       <int> <chr> 
##  1 GSE98413  2017/06/17 GSE98413_Junb_counts_normalized.tab.gz         38 .tab.…
##  2 GSE98413  2017/06/17 GSE98413_Junb_counts_raw.tab.gz                31 .tab.…
##  3 GSE98273  2017/06/16 GSE98273_genes.fpkm_tracking.txt.gz            35 .txt.…
##  4 GSE89134  2017/06/16 GSE89134_CentralMemoryVsControlNPDay8.txt…     44 .txt.…
##  5 GSE89134  2017/06/16 GSE89134_Counts.txt.gz                         22 .txt.…
##  6 GSE89134  2017/06/16 GSE89134_Foxo1NPDay8VsControlNPDay8.txt.gz     42 .txt.…
##  7 GSE89134  2017/06/16 GSE89134_log2NormalizedCPM.txt.gz              33 .txt.…
##  8 GSE89205  2017/06/16 GSE89205_genes.fpkm_tracking.txt.gz            35 .txt.…
##  9 GSE100077 2017/06/16 GSE100077_Counts.csv.gz                        23 .csv.…
## 10 GSE100077 2017/06/16 GSE100077_DifferentialExpression.csv.gz        39 .csv.…
## # … with 7,918 more rows

Most popular file extensions of potentially interesting files.

fext <- group_by(suppfiles_of_interest, filext) %>% 
  summarise(N = n()) %>% 
  arrange(desc(N)) %>% 
  filter(N > 10)
fext

## # A tibble: 9 × 2
##   filext       N
##   <chr>    <int>
## 1 .txt.gz   5377
## 2 .csv.gz    675
## 3 .xlsx      592
## 4 .xls.gz    309
## 5 .diff.gz   289
## 6 .tsv.gz    257
## 7 .xlsx.gz   180
## 8 .gz        131
## 9 .tab.gz     61

ggplot(fext, aes(filext, N)) +
  geom_point() +
  scale_x_discrete(limits = rev(fext$filext)) +
  scale_y_log10() +
  coord_flip() + 
  xlab("Common file extensions\n(>10 occurences) ") +
  ylab("Number of files")

Look for a word

Let’s find summaries containing word “CRISPR”.

crispr <- suppfilenames %>% 
  filter(str_detect(str_to_lower(summary), "crispr"))
crispr

## # A tibble: 95 × 29
##    Id        Accession GDS   title  summary  GPL   GSE   taxon entryType gdsType
##    <chr>     <chr>     <chr> <chr>  <chr>    <chr> <chr> <chr> <chr>     <chr>  
##  1 200095383 GSE95383  ""    Expre… The Pol… 21273 95383 Mus … GSE       Expres…
##  2 200099819 GSE99819  ""    Genom… The mam… 18573 99819 Homo… GSE       Expres…
##  3 200078519 GSE78519  ""    Capic… We perf… 16791 78519 Homo… GSE       Expres…
##  4 200095455 GSE95455  ""    Asses… HeLa ce… 16791 95455 Homo… GSE       Expres…
##  5 200095452 GSE95452  ""    Asses… By comp… 16791 95452 Homo… GSE       Expres…
##  6 200093681 GSE93681  ""    CRISP… Epstein… 18573 93681 Homo… GSE       Expres…
##  7 200098063 GSE98063  ""    Mll3 … Monomet… 19057 98063 Mus … GSE       Expres…
##  8 200093395 GSE93395  ""    Gene … Recent … 11154 93395 Homo… GSE       Expres…
##  9 200098177 GSE98177  ""    Trans… We prev… 20301 98177 Homo… GSE       Expres…
## 10 200083296 GSE83296  ""    Genom… Gene ex… 18573 83296 Homo… GSE       Expres…
## # … with 85 more rows, and 19 more variables: ptechType <chr>, valType <chr>,
## #   SSInfo <chr>, subsetInfo <chr>, PDAT <chr>, suppFile <chr>, Samples <chr>,
## #   Relations <chr>, ExtRelations <chr>, n_samples <chr>, SeriesTitle <chr>,
## #   PlatformTitle <chr>, PlatformTaxa <chr>, SamplesTaxa <chr>,
## #   PubMedIds <chr>, Projects <chr>, FTPLink <chr>, GEO2R <chr>,
## #   SuppFileNames <list>

We have 95 GEO series containing word “crispr” in summary.

When people started to publish experiments using crispr?

crispr %>% 
  ggplot(aes(lubridate::ymd(PDAT))) +
  geom_histogram(aes(y = cumsum(..count..)), bins = 30) +
  labs(title = "Number of Entrez GEO series mentioning CRISPR in summary",
       caption = "Data: Entrez GEO",
       y = "Cumulative number of studies",
       x = "Publication date")

Replace parts of a string

str_replace()

Commonly strings are removed by replacing them with an empty string – str_remove()

Let’s suppose we want to fix these ftp links, as these links have string prepended before URL. We want to remove this “SRASRP..” part to get bare URL. We do this by replacing “SRASRP..” with empty string "". (alternatively you can extract URL)

set.seed(2)
ftplinks <- suppfilenames %>% 
  select(ExtRelations) %>% 
  sample_n(5)
ftplinks

## # A tibble: 5 × 1
##   ExtRelations                                                                  
##   <chr>                                                                         
## 1 ""                                                                            
## 2 "SRASRP015711ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/s…
## 3 "SRASRP064758ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/s…
## 4 "SRASRP081655ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/s…
## 5 "SRASRP076951ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/s…

ftplinks$ExtRelations[2] %>% str_replace("SRASRP[0-9]+", "")

## [1] "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP015/SRP015711/"

String split

Let’s split

str_split("A\nB", "\n")

## [[1]]
## [1] "A" "B"

Split summaries by word boundaries/whitespace.

sums <- suppfilenames %>%
  # sample_n(10) %>% 
  select(summary, Accession)
sums

## # A tibble: 8,882 × 2
##    summary                                                             Accession
##    <chr>                                                               <chr>    
##  1 This SuperSeries is composed of the SubSeries listed below.         GSE98414 
##  2 Here we identify the activator protein-1 (AP-1) factor JunB as an … GSE98413 
##  3 Analysis of whole gene expression during differentiation from hiPS… GSE83480 
##  4 Single-cell epigenome sequencing techniques have recently been dev… GSE78140 
##  5 We searched for roles of ZEB1and/or ZEB2 during EMT by RNA-seq in … GSE98273 
##  6 RNAseq of ex vivo CD8 T cell lineages and in vitro differentiated … GSE89134 
##  7 This SuperSeries is composed of the SubSeries listed below. Zinc f… GSE89206 
##  8 We searched for roles of ZEB1 during EMT by RNA-seq in breast canc… GSE89205 
##  9 Outputs from scRNA seq reads from isolated Mouse E16 and P4 lacrim… GSE100106
## 10 Immunodeficient mouse models have been valuable for studies of hum… GSE100077
## # … with 8,872 more rows

Smaller scale. Go to ?regex and compare different regexes for splitting.

The symbol matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and is its negation ([^[:alnum:]_]). Symbols , and denote the digit and space classes and their negations (these are all extensions).

summary100 <- sums$summary[100] %>% str_split("\\s+")
summary100 <- summary100 %>% unlist

Use of str_split within dplyr

sums %>%
  sample_n(10) %>% 
  mutate(words = str_split(summary, "\\s")) %>% 
  select(-summary) %>% 
  unnest %>% 
  count(words) %>% 
  arrange(desc(n))

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(words)`

## # A tibble: 547 × 2
##    words     n
##    <chr> <int>
##  1 the      44
##  2 and      41
##  3 of       39
##  4 in       27
##  5 to       25
##  6 a        11
##  7 cells     9
##  8 that      9
##  9 by        8
## 10 cell      8
## # … with 537 more rows

str_to_lower()

str_to_lower(summary100)

##   [1] "the"              "mammalian"        "genome"          
##   [4] "contains"         "thousands"        "of"              
##   [7] "loci"             "that"             "transcribe"      
##  [10] "long"             "noncoding"        "rnas"            
##  [13] "(lncrnas),"       "some"             "of"              
##  [16] "which"            "are"              "known"           
##  [19] "to"               "play"             "critical"        
##  [22] "roles"            "in"               "diverse"         
##  [25] "cellular"         "processes"        "through"         
##  [28] "a"                "variety"          "of"              
##  [31] "mechanisms."      "while"            "some"            
##  [34] "lncrna"           "loci"             "encode"          
##  [37] "rnas"             "that"             "act"             
##  [40] "non-locally"      "(in"              "trans),"         
##  [43] "emerging"         "evidence"         "indicates"       
##  [46] "that"             "many"             "lncrna"          
##  [49] "loci"             "act"              "locally"         
##  [52] "(in"              "cis)"             "to"              
##  [55] "regulate"         "expression"       "of"              
##  [58] "nearby"           "genes—for"        "example,"        
##  [61] "through"          "functions"        "of"              
##  [64] "the"              "lncrna"           "promoter,"       
##  [67] "transcription,"   "or"               "transcript"      
##  [70] "itself."          "despite"          "their"           
##  [73] "potentially"      "important"        "roles,"          
##  [76] "it"               "remains"          "challenging"     
##  [79] "to"               "identify"         "functional"      
##  [82] "lncrna"           "loci"             "and"             
##  [85] "distinguish"      "among"            "these"           
##  [88] "and"              "other"            "mechanisms."     
##  [91] "to"               "address"          "these"           
##  [94] "challenges,"      "we"               "developed"       
##  [97] "a"                "genome-scale"     "crispr-cas9"     
## [100] "activation"       "screen"           "targeting"       
## [103] "more"             "than"             "10,000"          
## [106] "lncrna"           "transcriptional"  "start"           
## [109] "sites"            "(tsss)"           "to"              
## [112] "identify"         "noncoding"        "loci"            
## [115] "that"             "influence"        "a"               
## [118] "phenotype"        "of"               "interest."       
## [121] "we"               "found"            "11"              
## [124] "novel"            "lncrna"           "loci"            
## [127] "that,"            "upon"             "recruitment"     
## [130] "of"               "an"               "activator,"      
## [133] "each"             "mediate"          "braf"            
## [136] "inhibitor"        "resistance"       "in"              
## [139] "melanoma."        "most"             "candidate"       
## [142] "loci"             "appear"           "to"              
## [145] "regulate"         "nearby"           "genes."          
## [148] "detailed"         "analysis"         "of"              
## [151] "one"              "candidate,"       "termed"          
## [154] "emiceri,"         "revealed"         "that"            
## [157] "its"              "transcriptional"  "activation"      
## [160] "results"          "in"               "dosage-dependent"
## [163] "activation"       "of"               "four"            
## [166] "neighboring"      "protein-coding"   "genes,"          
## [169] "one"              "of"               "which"           
## [172] "confers"          "the"              "resistance"      
## [175] "phenotype."       "our"              "screening"       
## [178] "and"              "characterization" "approach"        
## [181] "provides"         "a"                "crispr"          
## [184] "toolkit"          "to"               "systematically"  
## [187] "discover"         "functions"        "of"              
## [190] "noncoding"        "loci"             "and"             
## [193] "elucidate"        "their"            "diverse"         
## [196] "roles"            "in"               "gene"            
## [199] "regulation"       "and"              "cellular"        
## [202] "function."

str_to_title()

suppfilenames$title[1:10]

##  [1] "Role of JunB in Th17 cell effector stability"                                                          
##  [2] "Role of JunB in Th17 cell effector stability [RNA-seq]"                                                
##  [3] "Genome-wide analysis of human iPS cell-derived hepatocyte-like cells induced by methoxamine treatment."
##  [4] "Single-cell multi-omics sequencing of mouse early embryos and embryonic stem cells"                    
##  [5] "RNA-sequencing in MDA-231-D cells transfected with ZEB1 or ZEB2 siRNAs"                                
##  [6] "Hit-and-run' programing of CAR-T cells using mRNA nanocarriers"                                        
##  [7] "ZEB1-regulated inflammatory phenotype in breast cancer cells"                                          
##  [8] "RNA-sequencing in TGF-beta treated MDA-231-D cells transfected with ZEB1/ZEB2 siRNAs [RNA-seq]"        
##  [9] "10X Genomics scRNA sequence Lacrimal Data set"                                                         
## [10] "Developmentally-Faithful and Effective Human Erythropoiesis in Immunodeficient and Kit Mutant Mice"

str_to_title(suppfilenames$title[1:10])

##  [1] "Role Of Junb In Th17 Cell Effector Stability"                                                          
##  [2] "Role Of Junb In Th17 Cell Effector Stability [Rna-Seq]"                                                
##  [3] "Genome-Wide Analysis Of Human Ips Cell-Derived Hepatocyte-Like Cells Induced By Methoxamine Treatment."
##  [4] "Single-Cell Multi-Omics Sequencing Of Mouse Early Embryos And Embryonic Stem Cells"                    
##  [5] "Rna-Sequencing In Mda-231-D Cells Transfected With Zeb1 Or Zeb2 Sirnas"                                
##  [6] "Hit-And-Run' Programing Of Car-T Cells Using Mrna Nanocarriers"                                        
##  [7] "Zeb1-Regulated Inflammatory Phenotype In Breast Cancer Cells"                                          
##  [8] "Rna-Sequencing In Tgf-Beta Treated Mda-231-D Cells Transfected With Zeb1/Zeb2 Sirnas [Rna-Seq]"        
##  [9] "10x Genomics Scrna Sequence Lacrimal Data Set"                                                         
## [10] "Developmentally-Faithful And Effective Human Erythropoiesis In Immunodeficient And Kit Mutant Mice"

str_to_upper()

suppfilenames$title[1:10]

##  [1] "Role of JunB in Th17 cell effector stability"                                                          
##  [2] "Role of JunB in Th17 cell effector stability [RNA-seq]"                                                
##  [3] "Genome-wide analysis of human iPS cell-derived hepatocyte-like cells induced by methoxamine treatment."
##  [4] "Single-cell multi-omics sequencing of mouse early embryos and embryonic stem cells"                    
##  [5] "RNA-sequencing in MDA-231-D cells transfected with ZEB1 or ZEB2 siRNAs"                                
##  [6] "Hit-and-run' programing of CAR-T cells using mRNA nanocarriers"                                        
##  [7] "ZEB1-regulated inflammatory phenotype in breast cancer cells"                                          
##  [8] "RNA-sequencing in TGF-beta treated MDA-231-D cells transfected with ZEB1/ZEB2 siRNAs [RNA-seq]"        
##  [9] "10X Genomics scRNA sequence Lacrimal Data set"                                                         
## [10] "Developmentally-Faithful and Effective Human Erythropoiesis in Immunodeficient and Kit Mutant Mice"

str_to_upper(suppfilenames$title[1:10])

##  [1] "ROLE OF JUNB IN TH17 CELL EFFECTOR STABILITY"                                                          
##  [2] "ROLE OF JUNB IN TH17 CELL EFFECTOR STABILITY [RNA-SEQ]"                                                
##  [3] "GENOME-WIDE ANALYSIS OF HUMAN IPS CELL-DERIVED HEPATOCYTE-LIKE CELLS INDUCED BY METHOXAMINE TREATMENT."
##  [4] "SINGLE-CELL MULTI-OMICS SEQUENCING OF MOUSE EARLY EMBRYOS AND EMBRYONIC STEM CELLS"                    
##  [5] "RNA-SEQUENCING IN MDA-231-D CELLS TRANSFECTED WITH ZEB1 OR ZEB2 SIRNAS"                                
##  [6] "HIT-AND-RUN' PROGRAMING OF CAR-T CELLS USING MRNA NANOCARRIERS"                                        
##  [7] "ZEB1-REGULATED INFLAMMATORY PHENOTYPE IN BREAST CANCER CELLS"                                          
##  [8] "RNA-SEQUENCING IN TGF-BETA TREATED MDA-231-D CELLS TRANSFECTED WITH ZEB1/ZEB2 SIRNAS [RNA-SEQ]"        
##  [9] "10X GENOMICS SCRNA SEQUENCE LACRIMAL DATA SET"                                                         
## [10] "DEVELOPMENTALLY-FAITHFUL AND EFFECTIVE HUMAN ERYTHROPOIESIS IN IMMUNODEFICIENT AND KIT MUTANT MICE"

str_trunc()

str_trunc(suppfilenames$title[1:10], width = 30, side = "right")

##  [1] "Role of JunB in Th17 cell e..." "Role of JunB in Th17 cell e..."
##  [3] "Genome-wide analysis of hum..." "Single-cell multi-omics seq..."
##  [5] "RNA-sequencing in MDA-231-D..." "Hit-and-run' programing of ..."
##  [7] "ZEB1-regulated inflammatory..." "RNA-sequencing in TGF-beta ..."
##  [9] "10X Genomics scRNA sequence..." "Developmentally-Faithful an..."

str_trunc(suppfilenames$title[1:10], width = 30, side = "center")

##  [1] "Role of JunB i...tor stability" "Role of JunB i...ity [RNA-seq]"
##  [3] "Genome-wide an...ne treatment." "Single-cell mu...ic stem cells"
##  [5] "RNA-sequencing...r ZEB2 siRNAs" "Hit-and-run' p... nanocarriers"
##  [7] "ZEB1-regulated... cancer cells" "RNA-sequencing...NAs [RNA-seq]"
##  [9] "10X Genomics s...imal Data set" "Developmentall...t Mutant Mice"

str_trunc(suppfilenames$title[1:10], width = 30, side = "left")

##  [1] "...h17 cell effector stability" "...ffector stability [RNA-seq]"
##  [3] "...d by methoxamine treatment." "...os and embryonic stem cells"
##  [5] "...ed with ZEB1 or ZEB2 siRNAs" "...lls using mRNA nanocarriers"
##  [7] "...type in breast cancer cells" "... ZEB1/ZEB2 siRNAs [RNA-seq]"
##  [9] "... sequence Lacrimal Data set" "...ficient and Kit Mutant Mice"

str_wrap()

str_wrap(suppfilenames$title[1], width = 30)

## [1] "Role of JunB in Th17 cell\neffector stability"

Generate strings from data

paste()

# letters
paste("one", letters)

##  [1] "one a" "one b" "one c" "one d" "one e" "one f" "one g" "one h" "one i"
## [10] "one j" "one k" "one l" "one m" "one n" "one o" "one p" "one q" "one r"
## [19] "one s" "one t" "one u" "one v" "one w" "one x" "one y" "one z"

paste("one", letters, collapse = " + ")

## [1] "one a + one b + one c + one d + one e + one f + one g + one h + one i + one j + one k + one l + one m + one n + one o + one p + one q + one r + one s + one t + one u + one v + one w + one x + one y + one z"

paste("one", letters, sep = "+")

##  [1] "one+a" "one+b" "one+c" "one+d" "one+e" "one+f" "one+g" "one+h" "one+i"
## [10] "one+j" "one+k" "one+l" "one+m" "one+n" "one+o" "one+p" "one+q" "one+r"
## [19] "one+s" "one+t" "one+u" "one+v" "one+w" "one+x" "one+y" "one+z"

paste0()

paste0("one", letters)

##  [1] "onea" "oneb" "onec" "oned" "onee" "onef" "oneg" "oneh" "onei" "onej"
## [11] "onek" "onel" "onem" "onen" "oneo" "onep" "oneq" "oner" "ones" "onet"
## [21] "oneu" "onev" "onew" "onex" "oney" "onez"

str_c() (analogue of paste0())

str_c("XXX", summary100)

##   [1] "XXXThe"              "XXXmammalian"        "XXXgenome"          
##   [4] "XXXcontains"         "XXXthousands"        "XXXof"              
##   [7] "XXXloci"             "XXXthat"             "XXXtranscribe"      
##  [10] "XXXlong"             "XXXnoncoding"        "XXXRNAs"            
##  [13] "XXX(lncRNAs),"       "XXXsome"             "XXXof"              
##  [16] "XXXwhich"            "XXXare"              "XXXknown"           
##  [19] "XXXto"               "XXXplay"             "XXXcritical"        
##  [22] "XXXroles"            "XXXin"               "XXXdiverse"         
##  [25] "XXXcellular"         "XXXprocesses"        "XXXthrough"         
##  [28] "XXXa"                "XXXvariety"          "XXXof"              
##  [31] "XXXmechanisms."      "XXXWhile"            "XXXsome"            
##  [34] "XXXlncRNA"           "XXXloci"             "XXXencode"          
##  [37] "XXXRNAs"             "XXXthat"             "XXXact"             
##  [40] "XXXnon-locally"      "XXX(in"              "XXXtrans),"         
##  [43] "XXXemerging"         "XXXevidence"         "XXXindicates"       
##  [46] "XXXthat"             "XXXmany"             "XXXlncRNA"          
##  [49] "XXXloci"             "XXXact"              "XXXlocally"         
##  [52] "XXX(in"              "XXXcis)"             "XXXto"              
##  [55] "XXXregulate"         "XXXexpression"       "XXXof"              
##  [58] "XXXnearby"           "XXXgenes—for"        "XXXexample,"        
##  [61] "XXXthrough"          "XXXfunctions"        "XXXof"              
##  [64] "XXXthe"              "XXXlncRNA"           "XXXpromoter,"       
##  [67] "XXXtranscription,"   "XXXor"               "XXXtranscript"      
##  [70] "XXXitself."          "XXXDespite"          "XXXtheir"           
##  [73] "XXXpotentially"      "XXXimportant"        "XXXroles,"          
##  [76] "XXXit"               "XXXremains"          "XXXchallenging"     
##  [79] "XXXto"               "XXXidentify"         "XXXfunctional"      
##  [82] "XXXlncRNA"           "XXXloci"             "XXXand"             
##  [85] "XXXdistinguish"      "XXXamong"            "XXXthese"           
##  [88] "XXXand"              "XXXother"            "XXXmechanisms."     
##  [91] "XXXTo"               "XXXaddress"          "XXXthese"           
##  [94] "XXXchallenges,"      "XXXwe"               "XXXdeveloped"       
##  [97] "XXXa"                "XXXgenome-scale"     "XXXCRISPR-Cas9"     
## [100] "XXXactivation"       "XXXscreen"           "XXXtargeting"       
## [103] "XXXmore"             "XXXthan"             "XXX10,000"          
## [106] "XXXlncRNA"           "XXXtranscriptional"  "XXXstart"           
## [109] "XXXsites"            "XXX(TSSs)"           "XXXto"              
## [112] "XXXidentify"         "XXXnoncoding"        "XXXloci"            
## [115] "XXXthat"             "XXXinfluence"        "XXXa"               
## [118] "XXXphenotype"        "XXXof"               "XXXinterest."       
## [121] "XXXWe"               "XXXfound"            "XXX11"              
## [124] "XXXnovel"            "XXXlncRNA"           "XXXloci"            
## [127] "XXXthat,"            "XXXupon"             "XXXrecruitment"     
## [130] "XXXof"               "XXXan"               "XXXactivator,"      
## [133] "XXXeach"             "XXXmediate"          "XXXBRAF"            
## [136] "XXXinhibitor"        "XXXresistance"       "XXXin"              
## [139] "XXXmelanoma."        "XXXMost"             "XXXcandidate"       
## [142] "XXXloci"             "XXXappear"           "XXXto"              
## [145] "XXXregulate"         "XXXnearby"           "XXXgenes."          
## [148] "XXXDetailed"         "XXXanalysis"         "XXXof"              
## [151] "XXXone"              "XXXcandidate,"       "XXXtermed"          
## [154] "XXXEMICERI,"         "XXXrevealed"         "XXXthat"            
## [157] "XXXits"              "XXXtranscriptional"  "XXXactivation"      
## [160] "XXXresults"          "XXXin"               "XXXdosage-dependent"
## [163] "XXXactivation"       "XXXof"               "XXXfour"            
## [166] "XXXneighboring"      "XXXprotein-coding"   "XXXgenes,"          
## [169] "XXXone"              "XXXof"               "XXXwhich"           
## [172] "XXXconfers"          "XXXthe"              "XXXresistance"      
## [175] "XXXphenotype."       "XXXOur"              "XXXscreening"       
## [178] "XXXand"              "XXXcharacterization" "XXXapproach"        
## [181] "XXXprovides"         "XXXa"                "XXXCRISPR"          
## [184] "XXXtoolkit"          "XXXto"               "XXXsystematically"  
## [187] "XXXdiscover"         "XXXfunctions"        "XXXof"              
## [190] "XXXnoncoding"        "XXXloci"             "XXXand"             
## [193] "XXXelucidate"        "XXXtheir"            "XXXdiverse"         
## [196] "XXXroles"            "XXXin"               "XXXgene"            
## [199] "XXXregulation"       "XXXand"              "XXXcellular"        
## [202] "XXXfunction."

str_c(summary100, collapse = " ")

## [1] "The mammalian genome contains thousands of loci that transcribe long noncoding RNAs (lncRNAs), some of which are known to play critical roles in diverse cellular processes through a variety of mechanisms. While some lncRNA loci encode RNAs that act non-locally (in trans), emerging evidence indicates that many lncRNA loci act locally (in cis) to regulate expression of nearby genes—for example, through functions of the lncRNA promoter, transcription, or transcript itself. Despite their potentially important roles, it remains challenging to identify functional lncRNA loci and distinguish among these and other mechanisms. To address these challenges, we developed a genome-scale CRISPR-Cas9 activation screen targeting more than 10,000 lncRNA transcriptional start sites (TSSs) to identify noncoding loci that influence a phenotype of interest. We found 11 novel lncRNA loci that, upon recruitment of an activator, each mediate BRAF inhibitor resistance in melanoma. Most candidate loci appear to regulate nearby genes. Detailed analysis of one candidate, termed EMICERI, revealed that its transcriptional activation results in dosage-dependent activation of four neighboring protein-coding genes, one of which confers the resistance phenotype. Our screening and characterization approach provides a CRISPR toolkit to systematically discover functions of noncoding loci and elucidate their diverse roles in gene regulation and cellular function."

sprintf()

todays_date <- Sys.Date()
todays_date

## [1] "2021-10-07"

todays_temp <- 7
sprintf("Today is %s and temperature is %s", todays_date, 7)

## [1] "Today is 2021-10-07 and temperature is 7"

glue()

library(glue)

## 
## Attaching package: 'glue'

## The following object is masked from 'package:dplyr':
## 
##     collapse

?glue
glue("Answer to 1 + 1 is {1 + 1}.")

## Answer to 1 + 1 is 2.

Tidy text analysis

Tidy data has a specific structure:

Each variable is a column
Each observation is a row
Each type of observational unit is a table

Tidy text format

Tidy text format as being a table with one-token-per-row.
A token is a meaningful unit of text, such as a word,
tokenizing is splitting text to meaningful units

Book: http://tidytextmining.com

Unnest tokens (words)

# install.packages("tidytext")
library(tidytext)
tidy_sums <- sums %>% 
  unnest_tokens(word, summary)
tidy_sums

## # A tibble: 899,380 × 2
##    Accession word       
##    <chr>     <chr>      
##  1 GSE98414  this       
##  2 GSE98414  superseries
##  3 GSE98414  is         
##  4 GSE98414  composed   
##  5 GSE98414  of         
##  6 GSE98414  the        
##  7 GSE98414  subseries  
##  8 GSE98414  listed     
##  9 GSE98414  below      
## 10 GSE98413  here       
## # … with 899,370 more rows

Now that the data is in one-word-per-row format..

We will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join().

data(stop_words)
tidy_sums <- tidy_sums %>%
  anti_join(stop_words)

## Joining, by = "word"

tidy_sums

## # A tibble: 560,867 × 2
##    Accession word       
##    <chr>     <chr>      
##  1 GSE98414  superseries
##  2 GSE98414  composed   
##  3 GSE98414  subseries  
##  4 GSE98414  listed     
##  5 GSE98413  identify   
##  6 GSE98413  activator  
##  7 GSE98413  protein    
##  8 GSE98413  1          
##  9 GSE98413  ap         
## 10 GSE98413  1          
## # … with 560,857 more rows

Now we can count words.

tidy_sums %>%
  count(word, sort = TRUE)

## # A tibble: 28,086 × 2
##    word           n
##    <chr>      <int>
##  1 cells       8695
##  2 rna         7192
##  3 cell        6485
##  4 expression  6259
##  5 genes       5200
##  6 gene        4623
##  7 seq         4488
##  8 data        3088
##  9 human       2966
## 10 http        2917
## # … with 28,076 more rows

Words with more than 2000 occurences:

tidy_sums %>%
  count(word, sort = TRUE) %>%
  filter(n > 2000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Sentiment analysis

What is the tone of research abstracts?

library(textdata)
get_sentiments()

## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

# Different sentiment databases
get_sentiments("nrc")$sentiment %>% unique

##  [1] "trust"        "fear"         "negative"     "sadness"      "anger"       
##  [6] "surprise"     "positive"     "disgust"      "joy"          "anticipation"

get_sentiments("afinn")$score %>% unique

## Warning: Unknown or uninitialised column: `score`.

## NULL

get_sentiments("bing")$sentiment %>% unique()

## [1] "negative" "positive"

Abstract scores 1

tidy_sums %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(Accession) %>% 
  summarise(sentiment_value = sum(value)) %>% 
  ggplot(aes(sentiment_value)) +
  geom_histogram(bins = 40)

## Joining, by = "word"

Let’s see most negative and positive abstracts:

get_sentiments(lexicon = "bing")$sentiment %>% unique()

## [1] "negative" "positive"

summary_sentiments <- tidy_sums %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(Accession, sentiment) %>% 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

summary_sentiments

## # A tibble: 5,593 × 4
##    Accession negative positive sentiment
##    <chr>        <int>    <int>     <int>
##  1 GSE100067        3        5         2
##  2 GSE100077        0        5         5
##  3 GSE100106        1        0        -1
##  4 GSE11724         0        2         2
##  5 GSE11892         0        1         1
##  6 GSE13652         0        2         2
##  7 GSE14092        11        4        -7
##  8 GSE14605         0        4         4
##  9 GSE15780         2        0        -2
## 10 GSE16190         2        5         3
## # … with 5,583 more rows

Sligtly more negative sentiment in abstracts

summary_sentiments %>% 
  ggplot(aes(sentiment)) +
  geom_histogram(bins = 40)

Most negative summary:

summary_sentiments %>% 
  arrange(sentiment) %>%
  left_join(select(suppfilenames, Accession, summary))

## Joining, by = "Accession"

## # A tibble: 5,593 × 5
##    Accession negative positive sentiment summary                                
##    <chr>        <int>    <int>     <int> <chr>                                  
##  1 GSE85013        28        0       -28 "Background: Despite of extensive rese…
##  2 GSE79423        27        3       -24 "Systematic analyses of the temporal d…
##  3 GSE85541        23        2       -21 "Anti-androgen therapies including the…
##  4 GSE86922        23        3       -20 "Caspases regulate cell death programs…
##  5 GSE63756        20        2       -18 "Endoplasmic reticulum (ER) stress occ…
##  6 GSE68140        18        1       -17 "We report positional cloning and char…
##  7 GSE80160        19        2       -17 "Diet-induced obesity is characterized…
##  8 GSE90478        21        4       -17 "Toxoplasma gondii is a ubiquitous api…
##  9 GSE93999        20        3       -17 "The influenza A virus is an acute con…
## 10 GSE68713        17        1       -16 "Maternal stress, anxiety, and depress…
## # … with 5,583 more rows

Let’s see what this GSE85013 is..