Regular Expressions In R
Regular expression is a pattern that describes specific set of strings with common structure. It is very useful in string matching and replacing for cleaning / Extracting information from strings
We now look into 4 functions of R
- grep
- grepl
- sub
- gsub
Regular expressions typically specify charterers to seek possibly with information. This is accomplished with help of meta characters that have specific meaning :
$ * + . ? [ ] ^ { } | ( )
Escape charterer is used to escape the regular meaning of the term for example when you assign Name <- ‘Cote d’Ivore’ in the R console it will return an error to assign that name we need to escape the special meaning of the quote so Name <- ‘Cote d\‘Ivore’ will assign required name. grep function searches for matches to argument pattern within each element of a character vector
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, invert = FALSE)
- pattern - character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector
- x - character vector or text in which element to be found
- value - If set to true function returns values of the vector instead of indices where matched vector is present
- fixed - If TRUE, pattern is a string to be matched as is
grep function returns the the indexes of elements with matched pattern in the supplied vector x,if the value is set to True then it returns the values vector of the matched elements. If the invert option is set to True then the function returns vectors which are not matched
grepl returns a vector same length of input vector with True or false logical
The other characters that require escaping including in the regular expressions
\’ – single quotes you can also use ” ‘ ” to represent single quotes
\” – Double Quotes You can also use ‘ ” ‘ to represent Double Quotes
\n – newline
\r – carriage return
\t -tab
quantifiers specify repetition times in a pattern
* Matches 0 or more times
strings <- c("a","ab","acb","accb","acccb","accccb") grep("ac*b",strings,value = T) [1] "ab" "acb" "accb" "acccb" "accccb"
- Matches 1 or more times
grep("ac+b",strings,value = T)
[1] "acb" "accb" "acccb" "accccb"
? matches at max 1 time
grep("ac?b",strings,value = T)
[1] "ab" "acb"
{n} matches exactly n times {n,} matches at least 1 times {n,m} matches between n to m times n and m are included
grep("ac{2}b",strings,value = T) [1] "accb" grep("ac{2,}b",strings,value = T) [1] "accb" "acccb" "accccb" grep("ac{2,3}b",strings,value = T) [1] "accb" "acccb"
Position of pattern With in String
- ^ matches the start of the string
- $ Matches the end of the string
- \b Matches the empty string at either edge of the word
Character classes
A character class is a list of characters enclosed between [ and ] which matches any single character in that list; unless the first character of the list is the caret ^, when it matches any character not in the list. For example, the regular expression [0123456789] matches any single digit, and [^abc] matches anything except the characters a, b or c. A range of characters may be specified by giving the first and last characters, separated by a hyphen. Character classes allows to specify entire classes of characters, such as numbers, letters, etc. There are two flavors of character classes, one uses [: and :] around a predefined name inside square brackets and the other uses \ and a special character. They are sometimes interchangeable.
[:digit:] or \d: digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9].
\D: non-digits, equivalent to [^0-9].
[:lower:]: lower-case letters, equivalent to [a-z].
[:upper:]: upper-case letters, equivalent to [A-Z].
[:alpha:]: alphabetic characters, equivalent to [[:lower:][:upper:]] or [A-z].
[:alnum:]: alphanumeric characters, equivalent to [[:alpha:][:digit:]] or [A-z0-9].
\w: word characters, equivalent to [[:alnum:]_] or [A-z0-9_].
\W: not word, equivalent to [^A-z0-9_].
[:xdigit:]: hexadecimal digits (base 16), 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f, equivalent to [0-9A-Fa-f].
[:blank:]: blank characters, i.e. space and tab.
[:space:]: space characters: tab, newline, vertical tab, form feed, carriage return, space.
\s: space, .
\S: not space.
[:punct:]: punctuation characters, ! ” # $ % & ’ ( ) * + , – . / : ; < = > ? @ [ ] ^ _ ` { | } ~.
[:graph:]: graphical (human readable) characters: equivalent to [[:alnum:][:punct:]].
[:print:]: printable characters, equivalent to [[:alnum:][:punct:]\\s].
[:cntrl:]: control characters, like \n or \r, [\x00-\x1F\x7F].
[:...:] has to be used inside square brackets, e.g. [[:digit:]]
\\ itself is a special character that needs escape, e.g. `\\d`. Do not confuse these regular expressions with R escape sequences such as \\t
Operators
- . matches any single character, as shown in the first example.
- […] a character list, matches any one of the characters inside the square brackets. We can also use - inside the brackets to specify a range of characters.
- [^…] an inverted character list, similar to […], but matches any characters except those inside the square brackets.
- \ suppress the special meaning of meta characters in regular expression, i.e. $ * + . ? [ ] ^ { } | ( ) \, similar to its usage in escape sequences. Since \ itself needs to be escaped in R, we need to escape these meta characters with double backslash like \$.
- | an “or” operator, matches patterns on either side of the |.
- (…) grouping in regular expressions. This allows you to retrieve the bits that matched various parts of your regular expression so you can alter them or use them for building up a new string. Each group can than be refer using \N, with N being the No. of (…) used. This is called backreference.
Backreference
You can use the backreferences \1 through \9 in the replacement text to reinsert text matched by a capturing group. You cannot use backreferences to groups 10 and beyond. If your regex has named groups, you can use numbered backreferences to the first 9 groups. There is no replacement text token for the overall match. Place the entire regex in a capturing group and then use \1 to insert the whole regex match.
sub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zazbc" "def" "cbzaz a" "zaaz"
gsub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zazbc" "def" "cbzaz zaz" "zaaz"
You can use \U and \L to change the text inserted by all following backreferences to uppercase or lowercase. You can use \E to insert the following backreferences without any change of case. These escapes do not affect literal text.
sub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zAzbc" "def" "cbzAz a" "zAAz"
gsub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zAzbc" "def" "cbzAz zAz" "zAAz"
The difference between sub and gsub functions are that gsub performs a in place string replace ment while sub returns vector of replaced vectors
The regexpr function takes the same arguments as grepl. regexpr returns an integer vector with the same length as the input vector. Each element in the returned vector indicates the character position in each corresponding string element in the input vector at which the (first) regex match was found. A match at the start of the string is indicated with character position 1. If the regex could not find a match in a certain string, its corresponding element in the result vector is -1. The returned vector also has a match.length attribute. This is another integer vector with the number of characters in the (first) regex match in each string, or -1 for strings that didn’t match.
gregexpr is the same as regexpr, except that it finds all matches in each string. It returns a vector with the same length as the input vector. Each element is another vector, with one element for each match found in the string indicating the character position at which that match was found. Each vector element in the returned vector also has a match.length attribute with the lengths of all matches. If no matches could be found in a particular string, the element in the returned vector is still a vector, but with just one element -1.
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] 1 -1 3 1
attr(,"match.length")
[1] 1 -1 1 2
gregexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[[1]] [1] 1 attr(,"match.length") [1] 1
[[2]] [1] -1 attr(,"match.length") [1] -1
[[3]] [1] 3 5 attr(,"match.length") [1] 1 1
[[4]] [1] 1 attr(,"match.length") [1] 2
Use regmatches to get the actual substrings matched by the regular expression. As the first argument, pass the same input that you passed to regexpr or gregexpr . As the second argument, pass the vector returned by regexpr or gregexpr. If you pass the vector from regexpr then regmatches returns a character vector with all the strings that were matched. This vector may be shorter than the input vector if no match was found in some of the elements. If you pass the vector from regexpr then regmatches returns a vector with the same number of elements as the input vector. Each element is a character vector with all the matches of the corresponding element in the input vector, or NULL if an element had no matches.
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
[1] "a" "a" "aa"
m <- gregexpr("a+", x, perl=TRUE)
regmatches(x, m)
[[1]] [1] "a"
[[2]] character(0)
[[3]] [1] "a" "a"
[[4]] [1] "aa"
Feel Free to comment Below or Write Suggestion To [email protected]