Difference between revisions of "Regular expression"
Line 43: | Line 43: | ||
Speaks the same language of the following DFA: | Speaks the same language of the following DFA: | ||
[[File:RegularExpressions_Endlicher_automat.jpg|600px]] | |||
''''' Insert image ''''' | ''''' Insert image ''''' |
Revision as of 15:59, 28 March 2020
Summary
Description what this documentation is about
History of Regular Expressions
Regular Expressions, or regexes in short, is a defined sequence of characters that specify a search pattern. These expressions are used by search machines like a search engine for optimized string searching. The concept of regular expressions arose in the 1950 by the mathematician Stephen Cole Kleene. Kleene described regular languages wit the usage of his mathematical notation called regular events. His notation got used for automata theory and the classification and description of formal languages. With the Rise of Computers many more syntaxes for writing regular expressions got developed. The most popular today are the POSIX standard and the Perl Compatible Regular Expressions (PCRE). Today regexes are implemented into many programming languages like PHP, Python and Java. Furthermore, companies started to produce hardware, FPGA and GPU implementation of PCRE to overcome the performance which is possible by CPUs.
Definition of Regular Expressions
Regular expression gets used to create a pattern that defines the regular language. A finite automata can inspect if the inputted word belongs to the regular language which gets defined by the created pattern.
Syntax of Regular Expressions
The regular expression is using the character of the alphabet Σ. Every alphabet consists at least of the Symbols ε and ∅ and of the operators +, · and *.
The amount of regular expressions over the alphabet Σ gets defined recursively as follows:
- ε and ∅ are regular expressions
- Every character a ∈ ∑ is a regular expression
- If α and β are regular expressions, (α β) and (α + β) are also regular expressions
- If α is a regular expression, α* is ale a regular expression
Operator priorities:
- * ties more than ·, which ties stronger than +
Semantics of Regular Expressions
a) The Language L(α) established by the regular expression α is defined as follows:
- L(∅) = ∅ empty set
- L(ε) = {ε} empty string
- L(a) = {a} for a ∈ ∑ literal character
- L(αβ) = L(α)L(β) concatenation
- L(α+β) = L(α) ∪ L(β) alternation
- L(α)* = L(α)* (Kleene star) arbitrary repetition
b) The two regular expressions α and β are equivalent in character α ≡ β if L(α) = L(β)
Regular expressions and deterministic finite automaton (DFA)
By using a regex processor, the regular expression translated into an internal representation. One possible approach is to use the Thompson's construction algorithm to develop nondeterministic finite automaton (NFA) that gets translated into a deterministic finite automaton (DFA), which only accepts words based on the regular expression.
Example:
(1 + 01 + 001)∗(ε+ 0 + 00)
Speaks the same language of the following DFA:
Insert image
POSIX Standard
There atre three sets of syntaxes defined IEEE POSIX standard. The Basic Regular Expressions (BRE), Extended Regular Expressions (ERE) and the Simple Regular Expressions (SRE). The Simple Regular Expressions syntax is deprecated but the other two provide backwards capabilities.
POSIX basic and extended Semantics
Metacharacter | Description |
---|---|
^ | Matches the starting position within the string. |
. | Acts as a wildcard for any existing character |
[ ] | Matches a singe character that in the brackets |
[^ ] | Matches everything except the character inside the brackets. |
$ | Matches the ending position of the string |
( ) | Defines a marked subexpression |
\n | Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. |
* | Arbitrary repetition of the preceding element |
{m,n} | From m to n repetitions of the preceding element |
Examples:
- ^d.. matches any string with a length of three that starts with d
- .og matches any string with a length of three that ends with og, like dog and fog
- [df]og matches “dog” and “fog”
- [^d]og matches any string which gets accepts by .og exept “dog”
- dog$ matches any string that ends with dog
- b.* matches any word that starts with b like “bee”, “bathroom”
POSIX extended Semantics
Metacharacter | Description |
---|---|
? | Repetition zero or one times of the preceding element |
+ | Repetition one or more times of the preceding element |
❘ | The choice also known as OR |
Examples:
- [e]?l matches “l” and “el”
- [e]+l matches “el”, “eel” and so on
- eel | fish accept one of the two strings
Perl Compatible Regular Expressions (PCRE)
PCRE is library that implements a regular expression engine. The library is inspired of the Pearl programming language and is written in c. The library is Opertings system independent and gets distributed with an POSIX C wrapper.
Perl Compatible Regular Expressions Semantics
Metacharacter | Description |
---|---|
\ | escape character which can be used multiple times |
^ | Matches the starting position within the string. |
$ | Matches the ending position within the string. |
. | Acts as a wildcard for any existing character except newline |
❘ | The choice also known as OR |
? | Repetition zero or one times of the preceding element |
* | Repetition zero to infinite times of the preceding element |
+ | Repetition one or more times of the preceding element |
- | indicates character range |
{ } | min/max quantifier |
Regular Expressions in Python
Python comes with the built in RegEx Module which can be used by the line import re. The module can process Unicode as well as 8-bit strings.
RegEx Modul Metacharacters
Metacharacter | Description |
---|---|
[ ] | Defines a set of characters |
. | Acts as a wildcard for any existing character |
^ | Matches the starting position within the string. |
$ | Matches the ending position within the string. |
* | Repetition zero to infinite times of the preceding element |
+ | Repetition one or more times of the preceding element |
{ } | min/max quantifier |
❘ | The choice also known as OR |
Character Sets
Character set | Description |
---|---|
[abc] | Match either ‘a’ or ‘b’ or ‘c’ |
[a-n] | Match with a character in the range |
[^abc] | Match any character except the in the brackets |
[0-9] | Matches the Digits |
[a-zA-Z] | Match any lower or uppercase character |
[a-zA-Z0-9] | Match any lower or uppercase character and digits |
RegEx Modul functions
Function | Description |
---|---|
findall() | Returns a list containing all matches |
search() | Searches a sting and returns the match object |
split() | Returns a list where the string has been split at each match |
sub() | Replaces the matches with a defined string |
Match object
Function | Description |
---|---|
.span() | Returns the start and the end position of the match |
.string() | Returns the checked string |
.group() | Returns the matched part of the string |
References
- Konzepte der Informatik von Peter Lory gelehrt an der FH-Campus Wien
- https://en.wikipedia.org/wiki/Regular_expression
- https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions
- https://www.w3schools.com/python/python_regex.asp
- https://docs.python.org/3/library/re.html