Write Performant Regular Expressions

Here we discuss some of the important aspects that you must consider while crafting performant regular expressions.

For the regular-expression constructs that can be used with parsers, labels, and datafilters, see Java Platform Standard Ed. 8 Documentation.

In case of lookups and query-time regex, refer to the RE2J syntax at Java Implementation of RE2.

Character Classes

The character classes specify the characters that you're trying or not trying to match. Ensure to replace the . in your .*s with a more specific character. The .* will invariably shift to the end of your input and will then backtrack, that is return to a previously saved state to continue the search for a match. When using a specific character class, you have control over how many characters the * will cause the regex engine to consume, giving you the power to stop the rampant backtracking.

Consider the following example regular expression:

(\d{4})(\d{2})(\d{2}),(\S+),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([\S\s]*),([0-9.]+),([0-9.]+),([0-9.]+),([0-9.]+),([0-9.]+),([0-9.]+)

For the following input:

20150220,201502,16798186260,tvN,Entertainment/Music,Female,67,2,Individual,ollehtv Economic,Commercial Housing,5587,0,2,0,1,1

As the result of the specified regular expression, the match can run into backtracking. This situation is detected by Oracle Logging Analytics and the match operation is aborted.

By changing the regular expression to the example below, you can ensure that the match completes faster. Notice that [\S\s]* is changed to [^,] which avoids unnecessary backtracking.

(\d{4})(\d{2})(\d{2}),(\S+),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([0-9.]+),([0-9.]+),([0-9.]+),([0-9.]+),([0-9.]+),([0-9.]+)

Lazy Quantifiers

In many regexes, greedy quantifiers (.*s) can be safely replaced by lazy quantifiers (.*?s), thus giving the regex a performance boost without changing the result.

Consider the input:

Trace file /u01/app/oracle/diag/rdbms/navisdb/NAVISDB/trace/NAVISDB_arc0_3941.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1
System name: Linux
Node name: NAVISDB
Release: 2.6.18-308.el5
Version: #1 SMP Fri Jan 27 17:17:51 EST 2012
Machine: x86_64
Instance name: NAVISDB
Redo thread mounted by this instance: 1
Oracle process number: 21
Unix process pid: 3941, image: oracle@NAVISDB (ARC0)

Consider the following greedy regex for the given input:

Trace\sfile\s(\S*).*ORACLE_HOME\s*[:=]\s*(\S*).*System\sname:\s*(\S*).*Node\sname:\s*(\S*).*Release:\s*(\S*).*Machine:\s*(\S*).*Instance\sname:\s*(\S*).*Redo\sthread\smounted\sby\sthis\sinstance:\s(\d*).*Oracle\sprocess\snumber:\s*(\d*).*Unix\sprocess\spid:\s(\d*).*image:\s+([^\n\r]*)

The regex engine shoots to the end of the input every time it encounters .*.. The first time that the .* appears, it consumes all the input and then backtracks until it gets to ORACLE_HOME. This is an inefficient way of matching. The alternative lazy regex is as shown below:

Trace\sfileRelease:\s*(\S*).*?Machine:\s*(\S*).*?Instance\sname:\s*(\S*).*?Redo\sthread\smounted\sby\sthis\sinstance:\s(\d*).*?Oracle\sprocess\snumber:\s*(\d*).*?Unix\sprocess\spid:\s(\d*).*?image:\s+([^\n\r]*)

The above lazy regex consumes starting from the beginning of the string until it reaches ORACLE_HOME, at which point it can proceed to match the rest of the string.

Note: If the ORACLE_HOME field appears toward the beginning of the input, the lazy quantifier should be used. If the ORACLE_HOME field appears toward the end, it might be appropriate to use the greedy quantifier.

Anchors

Anchors tell the regex engine that you intend the cursor to be in a particular place in the input. The most common anchors are ^ and $, indicating the beginning and end of the input.

Consider the following regexes to find an IPv4 address:

\d{1,3}\.d{1,3}.\d{1,3}.\d{1,3}
^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Notice that the second regex begins with ^ and is specific about the IP address appearing at the beginning of the input.

We're searching for the regex in input that looks like the following example:

107.21.20.1 - - [07/Dec/2012:18:55:53 -0500] "GET
      /extension/bsupport/design/cl/images/btn_letschat.png HTTP/1.1" 200
    2144

A non-matching input would look similar to the following example:

[07/Dec/2012:23:57:13 +0000] 1354924633 GET "/favicon.ico" "" HTTP/1.1 200 82726 "-"
      "ELB-HealthChecker/1.0"

The second regex (which starts with ^) runs faster on the non-matching input because it discards the non-matching input immediately.

The Importance of Alternation

The order of alternation counts, so place the more common options in the front so they can be matched faster. If the rarer options are placed first, then the regex engine will waste time in checking those before checking the more common options which are likelier to succeed. Also, try to extract common patterns. For example, instead of (abcd|abef) use ab(cd|ef).

Observe the following regexes:

{TIMEDATE}\s{0,1}:\s*(?:|\[)(\w+)(?:\:|\])(?:\[|)(\d+)(?:\:|\])(.{1,1000}).*
{TIMEDATE}\s{0,1}:\s*(?:\[|)(\w+)(?:\:|\])(?:\[|)(\d+)(?:\:|\])(.{1,1000}).*

On the following input:

2014-06-16 12:13:46.743: [UiServer][1166092608] {0:7:2} Done for
ctx=0x2aaab45d8330

The second regex matches faster as the alternation looks for character [ first, followed by null. As the input has [, the match runs faster.

Sample Parse Expressions

You can refer to the following sample parse expressions to create a suitable parse expression for extracting values from your log file.

A log file comprises entries that are generated by concatenating multiple field values. You may not need to view all the field values for analyzing a log file of a particular format. Using a parser, you can extract the values from only those fields that you want to view.

A parser extracts fields from a log file based on the parse expression that you’ve defined. A parse expression is written in the form of a regular expression that defines a search pattern. In a parse expression, you enclose search patterns with parentheses (), for each matching field that you want to extract from a log entry. Any value that matches a search pattern that’s outside the parentheses isn’t extracted.

Example 1

If you want to parse the following sample log entries:

Jun 20 15:19:29 hostabc rpc.gssd[2239]: ERROR: can't open clnt5aa9: No such file or directory
Jul 29 11:26:28 hostabc kernel: FS-Cache: Loaded
Jul 29 11:26:28 hostxyz kernel: FS-Cache: Netfs 'nfs' registered for caching

Following should be your parse expression:

(\S+)\s+(\d+)\s(\d+):(\d+):(\d+)\s(\S+)\s(?:([^:\[]+)(?:\[(\d+)\])?:\s+)?(.+)

In the preceding example, some of the values that the parse expression captures are:

  • (\S+): Multiple non-whitespace characters for the month

  • (\d+): Multiple non-whitespace characters for the day

  • (?:([^:\[]+): (Optional) All the characters except ^, :, \, []; this is for the service name

  • (.+): (Optional) Primary message content

Example 2

If you want to parse the following sample log entries:

####<Apr 27, 2014 4:01:42 AM PDT> <Info> <EJB> <host> <AdminServer> <[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'> <OracleSystemUser> <BEA1-13E2AD6CAC583057A4BD> <b3c34d62475d5b0b:6e1e6d7b:143df86ae85:-8000-000000000000cac6> <1398596502577> <BEA-010227> <EJB Exception occurred during invocation from home or business: weblogic.ejb.container.internal.StatelessEJBHomeImpl@2f9ea244 threw exception: javax.ejb.EJBException: what do i do: seems an odd quirk of the EJB spec. The exception is:java.lang.StackOverflowError>
####<Jul 30, 2014 8:43:48 AM PDT> <Info> <RJVM> <example.com> <> <Thread-9> <> <> <> <1406735028770> <BEA-000570> <Network Configuration for Channel "AdminServer" Listen Address example.com:7002 (SSL) Public Address N/A Http Enabled true Tunneling Enabled false Outbound Enabled false Admin Traffic Enabled true ResolveDNSName Enabled false> 

Following should be your parse expression::

####<(\p{Upper}\p{Lower}{2})\s+([\d]{1,2}),\s+([\d]{4})\s+([\d]{1,2}):([\d]{2}):([\d]{2})\s+(\p{Upper}{2})(?:\s+(\w+))?>\s+<(.*?)>\s+<(.*?)>\s+<(.*?)>\s+<(.*?)>\s+<(.*?)>\s+<(.*?)>\s+<(.*?)>\s+<(.*?)>\s+<\d{10}\d{3}>\s+<(.*?)>\s+<(.*?)(?:\n(.*))?>\s*

In the preceding example, some of the values that the parse expression captures are:

  • (\p{Upper}\p{Lower}{2}): 3-letter short name for the month; with the first letter in uppercase followed by two lowercase letters

  • ([\d]{1,2}): 1-or-2-digit day

  • ([\d]{4}): 4-digit year

  • ([\d]{1,2}): 1-or-2-digit hour

  • ([\d]{2}): 2-digit minute

  • ([\d]{2}): 2-digit second

  • (\p{Upper}{2}): 2-letter AM/PM in uppercase

  • (?:\s+(\w+)): (Optional, some entries may not return any value for this) Multiple alpha-numeric characters for the time zone

  • (.*?): (Optional, some entries may not return any value for this) One or multiple characters for the severity level; in this case <INFO>

  • (.*): Any additional details along with the message

Search Patterns

Some of the commonly used patterns are explained in the following table:

Pattern Description Example
. Any character except line break d.f matches def, daf, dbf, and so on
* Zero or more times D*E*F* matches DDEEFF, DEF, DDFF, EEFF, and so on
? Once or none; optional colou?r matches both colour and color
+ One or more Stage \w-\w+ matches Stage A-b1_1, Stage B-a2, and so on
{2} Exactly two times [\d]{2} matches 01, 11, 21, and so on
{1,2} Two to four times [\d]{1,2} matches 1, 12, and so on
{3,} Three or more times [\w]{3,} matches ten, hello, h2134, and so on
[ … ] One of the characters in the brackets [AEIOU] matches one uppercase vowel
[x-y] One of the characters in the range from x to y [A-Z]+ matches ACT, ACTION, BAT, and so on
[^x] One character that is not x [^/d]{2} matches AA, BB, AC, and so on
[^x-y] One of the characters not in the range from x to y [^a-z]{2} matches A1, BB, B2, and so on
[\d\D] One character that is a digit or a non-digit [\d\D]+ matches any character, including new lines, which the regular dot doesn't match
\s A whitespace (\S+)\s+(\d+) matches AA 123, a_ 221, and so on
\S One character that is not a whitespace (\S+) matches abcd, ABC, A1B2C3, and so on
\n A new line (\d)\n(\w) matches:

1

A

\w An alphanumeric character [\w-\w\w\w] matches a-123, 1–aaa, and so on
\p{Lower} Lowercase letters \p{Lower}{2} matches aa, ab, ac, bb, and so on
\p{Upper} Uppercase letters \p{Upper} matches A, B, C, and so on
\ followed by ?, [], *, . Escape character; to use the characters after \ as literals \? returns ?