How to do overlapping matches with regular expressions

Published on in JavaScript and Regular expressions

Last updated on

E.g. how to capture overlapping pairs of letters from the string 'abcde', i.e. 'ab', 'bc', 'cd' and 'de'. Spoiler: with lookahead and lookbehind assertions.

Table of contents

Consuming characters vs looking ahead/behind

We could try matching two letters at a time, like this:

'abcde'.match(/[a-z][a-z]/g)
//=> ['ab', 'cd']

But it doesn't work. Why not?

Turns out that when the regular expression matches the first two letters (a and b), it consumes them, meaning that the next match will be done against the rest of the input string, i.e. 'cde'. That's why the above regular expression doesn't capture overlapping pairs of letters.

This is what lookahead and lookbehind assertions are for: they look ahead or behind to assert (check) that the input contains some characters, but they don't consume those characters.

We have two options:

const lookahead = /(?=[a-z][a-z])/g
const lookbehind = /(?<=[a-z][a-z])/g

The first regular expression above has a lookahead assertion. When it's used to loop through the input string, it asserts that the current position is followed by two letters. It doesn't consume them because they are inside the lookahead assertion. In fact, the regular expression doesn't consume any characters because all matchable characters are inside the lookahead assertion.

The second regular expression above has a lookbehind assertion. It works similarly, but it asserts that the current character is preceded by two letters.

Here's a table of the lookahead/lookbehind assertion results using the two regular expressions above for the string 'abcde':

Index Position (|) Match Look­ahead Look­behind
0 |abcde nothing ab nothing
1 a|bcde nothing bc nothing
2 ab|cde nothing cd ab
3 abc|de nothing de bc
4 abcd|e nothing nothing cd
5 abcde| nothing nothing de

How does this help us? It does if we use capturing groups.

Capturing groups inside lookahead/lookbehind assertions

We now know how to loop through the input string and look ahead/behind to find all (overlapping) pairs of letters. We can capture the pairs by using capturing groups inside the lookahead/lookbehind assertions:

// Before
const lookahead = /(?=[a-z][a-z])/g
const lookbehind = /(?<=[a-z][a-z])/g

// After
const lookahead = /(?=([a-z][a-z]))/g
//                    ^          ^
const lookbehind = /(?<=([a-z][a-z]))/g
//                      ^          ^

One caveat, though: String.prototype.match() ignores capturing groups when using the global flag (g). So these don't work:

'abcde'.match(/(?=([a-z][a-z]))/g)
//=> ['', '', '', '']

'abcde'.match(/(?<=([a-z][a-z]))/g)
//=> ['', '', '', '']

They don't work because:

  • all matchable characters are inside the lookahead/lookbehind assertions, so the regular expressions capture nothing
  • match() ignores capturing groups when using the global flag (g).

Luckily for us, String.prototype.matchAll() does not ignore capturing groups:

'abcde'.matchAll(/(?=([a-z][a-z]))/g)
//=> RegExpStringIterator {}

'abcde'.matchAll(/(?<=([a-z][a-z]))/g)
//=> RegExpStringIterator {}

But oh, matchAll() returns an iterator. Let's convert the iterators to arrays:

Array.from('abcde'.matchAll(/(?=([a-z][a-z]))/g))
//=> [
//     ["", "ab", index: 0, input: "abcde", groups: undefined],
//     ["", "bc", index: 1, input: "abcde", groups: undefined],
//     ["", "cd", index: 2, input: "abcde", groups: undefined],
//     ["", "de", index: 3, input: "abcde", groups: undefined],
//   ]

Array.from('abcde'.matchAll(/(?<=([a-z][a-z]))/g))
//=> [
//     ["", "ab", index: 2, input: "abcde", groups: undefined],
//     ["", "bc", index: 3, input: "abcde", groups: undefined],
//     ["", "cd", index: 4, input: "abcde", groups: undefined],
//     ["", "de", index: 5, input: "abcde", groups: undefined],
//   ]

We can see that each array item is an array with additional properties. The first items are the matches – they are empty strings because all matchable characters are inside the lookahead/lookbehind assertions. The second items are the matches from the capturing groups.

Because we are using Array.from(), we can easily get only the second array items:

Array.from('abcde'.matchAll(/(?=([a-z][a-z]))/g), (match) => match[1])
//=> ['ab', 'bc', 'cd', 'de']

Array.from('abcde'.matchAll(/(?<=([a-z][a-z]))/g), (match) => match[1])
//=> ['ab', 'bc', 'cd', 'de']

And that's how we can do overlapping matches with regular expressions. Or that's the gist of it; depending on your needs, you can also:

  • use negative lookahead/lookbehind assertions
  • put stuff outside the lookahead/lookbehind assertion.

Practical applications

Using overlapping regex matches is one way to convert a directory path or URL pathname into cumulative segments; e.g. converting '/foo/bar/baz/' into an array of '/', '/foo/', '/foo/bar/' and '/foo/bar/baz/'.

Anthony Labarre emailed me about another interesting use case:

I used this as an exercise in bash scripting for Chinese students, where I had them extract Chinese "proverbs" from books. A faster solution than searching the book for each proverb in a list consists in generating all possibly overlapping subsequences of 4, 5, … characters from the book, and then using comm to identify which subsequences match known proverbs from our list. On my machine, this brought the running time from about 15 seconds to about 1 second.

For completeness, here's how I did it in bash [...]. (Except \w does not work for Chinese characters, you need to substitute it with \p{Han} and add the -u flag to pcregrep.)

Have you found overlapping regex matches useful in other scenarios? Let me know! (Email on the front page.)

Browser support

As of March 2022:

Further resources

Wiktor Stribiżew's answer to How can I match overlapping strings with regex? on Stack Overflow shows how to do overlapping matches by using RegExp.prototype.exec() and modifying the regular expression instance's lastIndex property. This is useful if you can't use matchAll(), which isn't supported by Internet Explorer.

Wiktor's answer also shows the solution that I have presented here; I actually learned it from Wiktor's answer (thanks!). This blog post is mainly me teaching myself the topic by teaching it to you.