language-icon Old Web
English
Sign In

Regular expression

A regular expression, regex or regexp (sometimes called a rational expression) is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for 'find' or 'find and replace' operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.The - character is treated as a literal character if it is the last or the first (after the ^, if present) character within the brackets: , . Note that backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^) character: .'Regular expressions' are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them 'regexes' (or 'regexen', when I'm in an Anglo-Saxon mood).Output:Output:Output:Output:Output:Output:Output:Output:Output:(^w|w$|Ww|wW).Output:in Unicode, where the Alphabetic property contains more than Latin letters, and the Decimal_Number property contains more than Arab digits.Output:in Unicode.Output:Output:Output:Output:Output:Output:Output:Output:Output:Output: A regular expression, regex or regexp (sometimes called a rational expression) is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for 'find' or 'find and replace' operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory. The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. The concept came into common use with Unix text-processing utilities. Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard and another, widely used, being the Perl syntax. Regular expressions are used in search engines, search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK and in lexical analysis. Many programming languages provide regex capabilities either built-in or via libraries. The phrase regular expressions, and consequently, regexes, is often used to mean the specific, standard textual syntax (distinct from the mathematical notation described below) for representing patterns for matching text. Each character in a regular expression (that is, each character in the string describing its pattern) is either a metacharacter, having a special meaning, or a regular character that has a literal meaning. For example, in the regex a., a is a literal character which matches just 'a', while '.' is a meta character that matches every character except a newline. Therefore, this regex matches, for example, 'a ', or 'ax', or 'a0'. Together, metacharacters and literal characters can be used to identify text of a given pattern, or process a number of instances of it. Pattern matches may vary from a precise equality to a very general similarity, as controlled by the metacharacters. For example, . is a very general pattern, (match all lower case letters from 'a' to 'z') is less general and a is a precise pattern (matches just 'a'). The metacharacter syntax is designed specifically to represent prescribed targets in a concise and flexible way to direct the automation of text processing of a variety of input data, in a form easy to type using a standard ASCII keyboard. A very simple case of a regular expression in this syntax is to locate a word spelled two different ways in a text editor, the regular expression serialie matches both 'serialise' and 'serialize'. Wildcards also achieve this, but are more limited in what they can pattern, as they have fewer metacharacters and a simple language-base. The usual context of wildcard characters is in globbing similar names in a list of files, whereas regexes are usually employed in applications that pattern-match text strings in general. For example, the regex ^+|+$ matches excess whitespace at the beginning or end of a line. An advanced regular expression that matches any numeral is ?(d+(.d+)?|.d+)(?d+)?. A regex processor translates a regular expression in the above syntax into an internal representation which can be executed and matched against a string representing the text being searched in. One possible approach is the Thompson's construction algorithm to construct a nondeterministic finite automaton (NFA), which is then made deterministic and the resulting deterministic finite automaton (DFA) is run on the target text string to recognize substrings that match the regular expression.The picture shows the NFA scheme N(s*) obtained from the regular expression s*, where s denotes a simpler regular expression in turn, which has already been recursively translated to the NFA N(s). Regular expressions originated in 1951, when mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular sets. These arose in theoretical computer science, in the subfields of automata theory (models of computation) and the description and classification of formal languages. Other early implementations of pattern matching include the SNOBOL language, which did not use regular expressions, but instead its own pattern matching constructs. Regular expressions entered popular use from 1968 in two uses: pattern matching in a text editor and lexical analysis in a compiler. Among the first appearances of regular expressions in program form was when Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files. For speed, Thompson implemented regular expression matching by just-in-time compilation (JIT) to IBM 7094 code on the Compatible Time-Sharing System, an important early example of JIT compilation. He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions ('grep' is a word derived from the command for regular expression searching in the ed editor: g/re/p meaning 'Global search for Regular Expression and Print matching lines'). Around the same time when Thompson developed QED, a group of researchers including Douglas T. Ross implemented a tool based on regular expressions that is used for lexical analysis in compiler design.

[ "Algorithm", "Theoretical computer science", "Discrete mathematics", "Programming language", "Operating system", "ReDoS", "Automata construction", "Metacharacter", "schema inference" ]
Parent Topic
Child Topic
    No Parent Topic