SilverAge Software

Search and Replace. Edit. Transform.

Use Cases

A Quick and Tiny Regular Expression Tutorial

This is not actually a techsupport case. This page is an excerpt from the Text Workbench help system which is provided as a tutorial or reference information.

  • What are Regular Expressions?
  • Why use Regular Expressions
  • Matching Operators
  • Repetition Qualifiers
  • Patterns and Alternatives
  • Expressions
  • What are Regular Expressions?

    Regular expressions are a way to search for substrings ("matches") in strings. This is done by searching with "patterns" through the string.

    Example

    You probably know the '*' and '?' charachters used in the dir command on the DOS command line. The '*' character means "zero or more arbitrary characters" and the '?' means "one arbitrary character". 

    When using a pattern like "text?.*", it will find files like

    But it will not find files like

    This is exactly the way regular expressions work. While the '*' and '?' are a very limited subset of patterns, regular expressions supply a much broader spectrum of describing patterns.

    The best way to learn regular expressions is to use Regular Expression Laboratory. You can test all of the examples listed here using the Laboratory front-end.

    Why use Regular Expressions

    Example usages could be:

    Matching Operators

    Any operator or set of operators represent a pattern.

    Any Character

    You will probably need to match some patterns containing symbols that may differ and vary in some way. For example, you want to find words starting with tom and having four characters in length. The operator that matches any character is dot (.). Thus, the following pattern would match all these words: tom.

    This example will also find text like tom., tom>, tom!, etc.

    Sets of Characters

    To prevent the pattern tom. from matching not meaningful phrases, we should narrow the search criteria to only alphabetic symbols. This can be done using character sets. A set is specified with square brackets. Sets may include individual symbols and ranges. For example, the following set will match any one symbol of a, t, z and 8: [atz8]. And this set will match all lowercase letters: [a-z].

    Thus, to limit the previous example to meaningful phrases, we could write a pattern: tom[a-z].

    Negative Sets of Characters

    Sometimes you need to find all symbols except some. Writing a large set including all possible symbols is ineffective. So we better use a negation operator in a set: ^. For example, the following set will match any one symbol except @: [^\@]. Please note that the symbol @ is escaped as it is not alphanumeric.

    Repetition Qualifiers

    Regular expressions would be of no use unless they might match any text of any length. To achieve this, repetition qualifiers were introduced, which allows matching nearly any text.

    Match 0 or 1 times

    In the previous example, a pattern tom[a-z] would successfully find any word of four symbols in length except tom itself. To force the pattern to match tom, we should instruct it to do so. The qualifier ? tells to match the preceding pattern 0 or 1 times. The following pattern will match tom as well: tom[a-z]?

    Greedy and Non-greedy Matches

    Before we proceed with the other repetition qualifiers, we should understand one important thing about repetition modes.

    Imagine a text that contains some occurrences of a character. For example, one, two, three, four. This text has 3 entries of a comma. Now we want to instruct the regular expression engine to "match all characters but stop before a comma".

    A greedy mode will match all characters and stop before the last comma:

    one, two, three, four.

    A non-greedy mode will match all characters and stop before the first comma:

    one, two, three, four.

    Match previous pattern 0 or more times

    Let us extend the previous example by introducing a new condition: match all text starting from tom but ending with full-stop. So we need to:

    1. match tom;
    2. match any character;
    3. repeat the preceding condition 0 or more times until the first occurrence of the next match (4) is found;
    4. match a full-stop (a dot).

    The following table shows the corresponding operators:

    Part Operator Comment
    Match tom tom A simple text
    Match any character . A dot-operator
    Repeat the preceding condition until the first occurrence of the next match is found @ Repeat qualifier:
    Match previous pattern 0 or more times (non-greedy).
    Match a full-stop (a dot) \. A dot. Escape is added to instruct to process the dot as a common symbol, not operator.


    Thus, the pattern would look like:

    tom.@\.

    Patterns and Alternatives

    Say you need to find one of the words: macrocoding and macrocode. There are several ways to do that. For example, we can split each word into macrocod+ing and macrocod+e. Now, we will need a pattern that would:

    When we say "or", we say "or". When a regular expression says "or", it says "|". Armed with this knowledge, we write: macrocoding|e.

    Looks rather meaningless, doesn't it? What would this expression do: match macrocoding or e or match macrocodin and g or e  ? That's why a pattern operator had been developed.

    A pattern operator concatenates several stand-alone symbols or patterns to form one pattern. For example, a single symbol e is a pattern. The first symbol (i) in the "ing" is a stand-alone pattern. To form a single pattern from "ing", we should enclose it in braces:

    {ing}

    Now, ing is a single pattern.

    This allows us to write the following pattern:

    {macrocod{ing}|{e}}

    This is a correct well-formed single pattern.

    Expressions

    In terms of semantics, expression and pattern operators are the same. The difference is that the text that matches the expression is stored and can be referenced further, for example, when replacing.

    For example, we could alter the previous example to make an expression out of the ending {ing}|{e} by enclosing it in the round braces:

    {macrocod({ing}|{e})}

    Now we can reference the ending with the operator \1. 1 stands for the number of the expression. We can write the replace pattern that would insert a plus sign between macrocod and the ending:

    macrocod+\1