Coronavirus (COVID-19) Information Read here

← Farsight Blog

What's a Regular Expression?

By

RSS

I. Introduction

Today Farsight Security announced DNSDB 2.0 Flexible Search for DNSDB API. Flexible Search offers powerful new search capabilities that enhance DNSDB API, and which make it possible to easily do the DNSDB searches you've always wished you could make.

Early Adopter Access will be available on August 19th, 2020 and General Availability is scheduled for October 20th, 2020. If you're interested in applying for Early Adopter access, please contact support@farsightsecurity.com.

Flexible Search will be bundled at no charge for paid DNSDB API customers (and customers given access to DNSDB API under a grant from Farsight), but will NOT be included as part of DNSDB Community Edition, the free, entry-level version of our flagship solution.

II. Search Syntax Options

Flexible Search is a "finding aid" that supplements and enhances (but does not replace) Standard DNSDB API.

Flexible Search offers users three search syntax modes in DNSDB Scout, and two otherwise.

  • Keyword Search: easily search for a brand name or domain name – just type in a word or string of characters to match.

    Keyword Search is meant to provide an easy starting point for novice searchers. [Available In DNSDB Scout Only]

  • Regular Expressions: Regular expressions are the industry-standard way of expressing search patterns. Regular expressions support simple keyword searches, but also gives you the most power when you want or need to begin making more-complex pattern searches.

    This article is meant to give you the chance to learn a little about regular expressions in general now, before Flexible Search is actually made available for your use.

  • Globbing: Globbing is the other pattern matching option that will be available in DNSDB 2.0. We offer globbing as an option for those who may prefer it, but please note that it is simultaneously:

    o More syntactically complex when it comes to doing basic keyword searches, and
    o Much more limited when it comes to supporting non-trivial pattern matching.

    If you are nonetheless interested in it, see our new blogpost, "What is Globbing?. Most users who aren't familiar with either regular expressions or globbing should focus on learning to work with regular expressions, as described in this article, instead.

The goal of this article is to give you an introduction to regular expressions ("regexes") for those who find themselves wondering "what the heck ARE these 'regex' things that some 'techies' keep talking about?"

At its most basic, a regular expression (or "regex") is just a string that describes a pattern to be matched.

For example, imagine a program scanning lines in one or more files, looking for lines that contain the regular expression pattern of interest. When it finds a line with that pattern, it prints that line out. Simple as that sounds, regexes can be extremely powerful and useful. Regexes are routinely used in the cybersecurity world by:

  • Analysts searching logs and other large data files
  • Data scientists massaging input data files so they can be ingested into machine learning models
  • Developers validating input fields and so on.

In a comparatively short article like this one, we can only "scratch the surface" when it comes to all the features and capabilities of regular expressions, but we hope that even this short introduction will still serve to pique your interest in regular expressions and motivate you to learn more about them. If that happens, there are a number of regular expression books you should check out, including O'Reilly's:

III. Sample Data Set

To help illustrate how regular expressions work, we've created a small sample data file with fifty-two assorted domain names called domains.txt (see Appendix I). We're going to use that file as data for our examples.

IV. The Tool We'll Use To Demonstrate Regular Expression Matching: GNU egrep

While some of you may participate in our Early Adopter Program, most of you won't have access to DNSDB Flexible Search until October 20th, 2020. Therefore, we've couched the following discussion in terms of the commonly available GNU egrep command. That command will treat regular expressions the same way that DNSDB 2.0 Flexible Search regular expressions will work, allowing eager people to get some time in learning and practicing before Flexible Search goes to General Availability status.

Once you get access to Flexible Search, using regular expressions will be as simple as plugging them into the Find field in DNSDB Scout, or using the --regex qualifier to the dnsdbflex command-line client.

The Unix "grep" command name is a "portmanteaux" word. It was built from parts of the words in the phrase "globally search for a regular expression and print."

It is a staple command-line utility on Unix systems (and on Unix-like operating systems such as Linux and Mac OS X) and should exist (in some form or another) on virtually every Unix or Unix-like system. egrep is an enhanced version of grep. It's what we're going to use for the examples shown in this article.

On the latest version of Mac OSX (aka Catalina), the system-provided egrep appears to (still) be:

$ /usr/bin/egrep --version
egrep (BSD grep) 2.5.1-FreeBSD

The 2.5.1 version of egrep is known to have bugs, bugs which have been fixed in later versions. Unfortunately, because the later versions use a different open source license, Mac OSX has not updated to one of the later version(s) of egrep where that bug has been corrected. The bug is serious enough that it visibly impacts the results you may get for even relatively simple queries.

Thus, we normally prefer to use, and recommend that you use, the GNU version of egrep. If GNU egrep is installed on your system (and used by default), you should see something like:

$ egrep --version
grep (GNU grep) 3.3
[etc]

If GNU egrep is not installed, you may be able to install GNU egrep via your operating system's package manager. For example, on a Mac using homebrew you can install GNU grep by saying:

$ brew install grep

You can also download and install GNU egrep from source.

V. Regular Expression "Building Block" Characters

Regular expressions are just strings (sometimes quite cryptic-looking strings, but still, just strings). We'll normally put regular expressions inside single quote marks. Each regex gets built using a combination of:

  • "Literal" characters (e.g., uppercase and lowercase letters, numbers, and some symbols)

  • "Meta" characters – there are many symbols that serve as a shorthand for special things such as "match any one character here." The meta characters that normally do special things in at least some circumstances are as follows:

Character	Name			Special Thing This Symbol Means
\		backslash		"Escapes" the character after this one
.		dot			Match any one character here
*		star			Repeat the previous zero or more times
^		caret			Match start of line
$		dollar sign		Match end of line
?		question mark		Optional (match zero or one time only)
+		plus sign		Matches one or more time
|		vertical bar		Logical or (match either)
{ and }		curly braces		Repetition count {min}, {min, max}, or {,max}
( and )		parentheses		Define logical subexpression ("create grouping")
[		left square bracket	Define character class

If you want to literally match those metacharacters, prefix them with a backslash. [Note: Some versions of egrep may attempt to guess if a metacharacter "should" be treated as a metacharacter or as the literal character. That is risky, however, so we generally urge you to explicitly indicate if you want a metacharacter to be treated as a literal.]

  • "Character classes:" These can be of two types: shorthand character classes, and bracketed character classes.

Shorthand character classes (such as \w, \d, \s) are used in some regular expression implementations, but will not be available in DNSDB's Flexible Search regex implementation.

Bracketed character classes are either predefined character classes that look like the following (this is not an exhaustive list of these):

[[:alpha:]]		Any upper or lower case alphabetic character
[[:digit:]]		Any digit from 0 to 9
[[:alnum:]]		Any alphanumeric character
[[:xdigit:]]		Any hexadecimal digit (e.g., 0-9 plus A-F or a-f)

or classes that the user defines, such as

[aeiouy]		Matches any vowel (or pseudo-vowel, in the case of "y")
[^aeiouy]		Matches any NON-vowel (including other letters, numbers, symbols, etc.)

Note that MOST metacharacters lose their special meanings within square brackets (a notable exception is the caret symbol, as just shown in the [^aeiou] example).

VI. The Simplest of Regular Expression: Matching A Literal Substring

Let's use a regex to find lines from our sample data file that contain the literal string "off". We'll run the egrep command from a Terminal window on our Mac:

$ egrep 'off' domains.txt
coffee.com
office.com
office365.com

This is a pretty straight-forward command: it takes a regular expression (in this case the literal string off, in single quote marks) and looks for matching lines in the specified file (domains.txt). Three "hits" are found: coffee.com, office.com and office365.com. Those get printed out when we run that command.

While in this case we just looked for a short three-character string, we could have looked for a single character, many characters, or even multiple words. (Just be sure to enclose the literal string to be matched in single quote marks if the string includes spaces!)

If we wanted to find lines that DON'T contain the string 'off', we can use the egrep -v option to find lines that DON'T match the specified pattern:

$ egrep -v 'off' domains.txt

all lines EXCEPT coffee.com, office.com and office365.com get output here

VII. Case (In)sensitivity

Regular expressions are case sensitive by default (so if we'd looked for 'OFF' instead of 'off', we wouldn't have found any matches).

If we want egrep to do case Insensitive matches, we can add the -i option to our egrep command:

$ egrep -i 'OFF' domains.txt
coffee.com
office.com
office365.com

VIII. Figuring Out WHAT'S Matching

Another handy option to egrep is the --color option. It highlights the text that matches the regular expression we supplied:

$ egrep --color 'off' domains.txt
coffee.com
office.com
office365.com

We don't urgently need this option to understand such a simple match, but when regexes get more complex – or we make a mistake constructing our regex – highlighting the text that matched a regex can really come in handy as a debugging tool.

IX. Matching EITHER of Two Literal Substrings

Let's do another literal substring regex.

What if we want to find lines that have EITHER the literal substring 'go' OR the literal substring 'off'?

GNU egrep can help use do that with the vertical bar (or "pipe") meta character.

$ egrep --color 'go|off' domains.txt
coffee.com
duckduckgo.com
eugene-or.gov
google.com
house.gov
office.com
office365.com
oregonstate.edu
senate.gov
supremecourt.gov
uoregon.edu
whitehouse.gov

Note that the vertical bar ("pipe") characters is a metacharacter – it does NOT need to be physically part of the string text we're matching.

If helpful or necessary, you can also use parentheses to set off the limits of an alternating match. For example:

$ egrep --color 'e(go|ug)' domains.txt
eugene-or.gov
oregonstate.edu
uoregon.edu

That pattern matches all records that have ego or eug in them.

X. The Dot

Up until this part of the article, we've been matching literal strings. That's cool and useful, but the real power of regular expressions comes when we begin to work with wildcards – in this case literally the dot (".") character. Dot is a metacharacter that matches any single character.

$ egrep --color 'g.p' domains.txt
blogspot.com

If we have two dots in a row, that matches any two characters:

$ egrep --color 'r..e' domains.txt
lclark.edu
marines.mil
supremecourt.gov

and we could also search for any three characters in a row, any four characters in a row, etc.

Note that if we want to match an ACTUAL dot (and dots are obviously VERY common in domain names), we need to ask to match an escaped ("backslashed") dot:

$ egrep --color '\.k12\.' domains.txt
bethel.k12.or.us
cal.k12.or.us
springfield.k12.or.us

If we didn't remember to escape those "real dots," specifying an unescaped dot might coincidentally match real dots, but they'd also match any OTHER single character in that spot, too.

XI. The Power of "dot star"

If you think dot was cool, wait until you learn about dot star ('.*') – it's VERY cool!

  • dot stands for "match any character"
  • star stands for "repeat the previous match zero or more times"

If we had a regular expression that was simply '.*' it would match all lines.

Therefore, most matches that contain '.*' also include other specific patterns to match. For example, let's find lines that have a b, then zero or more other characters, then a c:

$ egrep --color 'b.*c' domains.txt
bing.com
blogspot.com
crabcake.com
ebay.com
facebook.com
github.com
youtube.com

If we didn't have the star metacharacter to give us flexibility here, we'd have to write a much "clunkier" regex with all possible patterns of zero or more dots in between the two letters of interest:

egrep '(bc|b.c|b..c|b...c|b....c|b.....c|b......c|b.......c|b........c)' domains.txt
same output as the previous example omitted here

Yuck! And just imagine how ugly that expression would get if one of the domain names in the file happened to be a long name with a b near the start and a c twenty or thirty characters later! Truly, the "magic of dot star" is a huge convenience when it comes to writing some regular expressions.

XII. A Note About "dot star" Matches: "Greed Is Good"

When GNU egrep finds matches, sometimes there are different options that might work. For example, if you asked to match '^st.*o' there are three ways it could match one line from our sample data:

stackoverflow.com OR stackoverflow.com OR stackoverflow.com

All three of those matches start with "st" and end with "o", right? But which one of those will GNU egrep return by default?

The answer is that GNU egrep agrees with the fictional character Gordon Gekko, played by Kirk Douglas in the 1987 movie "Wall Street," who became (in)famous for saying "Greed is good."

By default, wildcards in grep will always try to match as much as possible while still satisfying the requested pattern. So in this example, it will match as shown in the last of the possible result, stackoverflow.com.

XIII. Matching A Single Character That's Part of an Enumerated Set of Characters

We've seen how dot matches any SINGLE character, and dot star matches any ZERO OR MORE characters. But what if want to match a single character from just an enumerated set of characters? For example, what if we want to match:

  • the character b
  • zero or more other characters
  • at least one vowel (a, e, i, o, u, or y)
  • zero or more other characters
  • the character c

It turns out that regular expressions can help us do this as well, using square brackets (as introduced in Section 4, above) to define a character set:

$ egrep --color 'b.*[aeiouy].*c' domains.txt
bing.com
blogspot.com
crabcake.com
ebay.com
facebook.com
youtube.com

If you're referring to a contiguous range of characters, rather than a short, enumerated list of characters, you can take advantage of the dash character to avoid having to type a long list:

  • [a-z] (lowercase letters)
  • [a-zA-Z] (uppercase and lowercase letters)
  • [a-zA-Z0-9] (uppercase and lowercase letters plus digits)

If you want to put a literal caret (^) in a list of characters, you can, just don't put it first (if you do, it will be interpreted as meaning "take the complement of the characters that follow).

If you want to include a literal right square bracket (]) in a list of characters, you can, you just must use it as the FIRST character in the list of characters.

If you want to put a literal dash (-) in a list of characters in square brackets, put it LAST.  

XIV. Repetition Factors ("Counts")

You can also use "repetition factors" or "counts" to ask for multiples of patterns. For example, if you wanted to find names from our sample file that had two successive vowels, you could write:

$ egrep --color '[aeiouy]{2}' domains.txt
coffee.com
ebay.com
eou.edu
eugene-or.gov
facebook.com
freedom.com
geoduck.com
google.com
house.gov
oit.edu
paypal.com
reed.edu
sou.edu
springfield.k12.or.us
supremecourt.gov
uoregon.edu
whitehouse.gov
wikipedia.org
wou.edu
yahoo.com
youtube.com

In addition to asking for exactly a specific value, you can also specify a repetition range, such as:

  • {2,5} (meaning match between two and five times)
  • {3,} (meaning match if present at least 3 times)
  • ? (can appear zero or one time only – same as saying {0,1}
  • * (can appear zero or more times – same as saying {0,}
  • + (can appear one or more times – same as saying {1,}

For example:

$ egrep --color 'ube?\.com$' domains.txt 
github.com
youtube.com

In this case, the "e" was optional, which is why youtube.com AND github.com successfully matched.

XV. Anchors

The patterns that we've been matching have all been "floating" patterns. Those patterns can potentially match suitable text seen anywhere in lines they're scanning. But what if we only want to match a particular pattern at the start of a line, or at the end of a line? Those type of searches are called "anchored searches," and we can use special metacharacters to limit our results:

^ (the caret symbol)		"At the start of the line"
$ (the dollar sign)		"At the end of the line"

For example, let's find the domains in the file that are '.edu' domains:

$ egrep --color '\.edu$' domains.txt
4j.lane.edu
eou.edu
lanecc.edu
lclark.edu
oit.edu
oregonstate.edu
pdx.edu
reed.edu
sou.edu
uoregon.edu
willamette.edu 

Important note: In DNSDB 2.0 Flexible Search, domain names in RRnames (and some Rdata) are written with a "formal ending dot." Literal dots are also escaped with a backslash. That means that the domain name

wou.edu

would be written in regular expression format as:

wou\.edu\.$

If that's the case for the stuff you're matching against the anchored search would need to be written:

$ egrep --color '\.edu\.$' domains.txt

rather than just

$ egrep --color '\.edu$' domains.txt

Or as another example, let's find the domains that begin with an s:

$ egrep --color '^s' domains.txt
senate.gov
sou.edu
springfield.k12.or.us
stackoverflow.com
supremecourt.gov

XVI. Specialty Versions of grep

You may also want to know that there are some "specialty" versions of grep, such as:

  • agrep: "approximate GREP for fast fuzzy string searching."

  • cidrgrep: "A grep-like tool used to filter IP addresses against one or more CIDR network patterns." (see also grepcidr)

  • ripgrep: an extremely fast grep implementation that also supports searching non-UTF8 files, searching compressed files, and much more.

XVII. Conclusion

You've now had a bit of a whirlwind introduction to regular expressions. If you want to learn more, check out the books mentioned in the introduction, or consider trying one of the online interactive regular expression tutorials.

Regular expressions may feel a bit like they're "brain teasers" or puzzles from The New York Times puzzle page, but if you tackle them with the right attitude, you may find they're exceptionally powerful and sort of fun, too!

Acknowledgements

The author would like to acknowledge valuable reviews and commentary from colleagues, including (in alphabetical order) Chris Mikkelson, Jeremy Reed, Chuq Von Rospach, Stephen Watt, and Eric Ziegast. Any remaining issues or errors are solely the responsibility of the author. 

APPENDIX 1. DOMAINS USED IN THIS ARTICLE'S EXAMPLES
$ cat domains.txt 
4j.lane.edu
af.mil
amazon.com
apple.com
army.mil
bethel.k12.or.us
bing.com
blogspot.com
cal.k12.or.us
coffee.com
crabcake.com
duckduckgo.com
ebay.com
eou.edu
eugene-or.gov
facebook.com
freedom.com
geoduck.com
github.com
google.com
house.gov
instagram.com
lanecc.edu
lclark.edu
linkedin.com
live.com
marines.mil
microsoft.com
msn.com
navy.mil
netflix.com
office.com
office365.com
oit.edu
oregonstate.edu
paypal.com
pdx.edu
reddit.com
reed.edu
senate.gov
sou.edu
springfield.k12.or.us
stackoverflow.com
supremecourt.gov
twitter.com
uoregon.edu
whitehouse.gov
wikipedia.org
willamette.edu
wou.edu
yahoo.com
youtube.com
ENDNOTE

More on validating input fields: An example of input validation might be a rule that says "the employee salary field can only contain numbers, commas, a decimal point and/or a dollar sign." "$83,412.15" would pass that validation definition but "$7K/month" would not. More carefully-defined validation rules might be used to screen out typos/data entry errors such as "8$3,412.15" or "$83,,412.15" or "$83,412.155" Validation rules might also be used to identify likely-out-of-range-values such as "$8341215.

Joe St Sauver Ph.D. is a Distinguished Scientist and Director of Research with Farsight Security®, Inc..