How to do overlapping matches with regular expressions

Published on in JavaScript

E.g. how to capture overlapping pairs of letters from the string 'abcde', i.e. 'ab', 'bc', 'cd' and 'de'. Spoiler: with lookahead and lookbehind assertions.

Table of contents

Consuming characters vs looking ahead/behind

We could try matching two letters at a time, like this:

'abcde'.match(/[a-z][a-z]/g)
//=> ['ab', 'cd']

But it doesn't work. Why not?

Turns out that when the regular expression matches the first two letters (a and b), it consumes them, meaning that the next match will be done against the rest of the input string, i.e. 'cde'. That's why the above regular expression doesn't capture overlapping pairs of letters.

This is what lookahead and lookbehind assertions are for: they look ahead or behind to assert (check) that the input contains some characters, but they don't consume those characters.

We have two options:

const lookahead = /(?=[a-z][a-z])/g
const lookbehind = /(?<=[a-z][a-z])/g

The first regular expression has a lookahead assertion. When it's used to loop through the input string, it asserts that the current position is followed by two letters. It doesn't consume them because they are inside the lookahead assertion. In fact, the regular expression doesn't consume any characters because all matchable characters are inside the lookahead assertion.

The second regular expression has a lookbehind assertion. It works similarly, but it asserts that the current character is preceded by two letters.

Here's a table of the lookahead/lookbehind assertion results:

Index Position (|) Match Look­ahead Look­behind
0 |abcde nothing ab nothing
1 a|bcde nothing bc nothing
2 ab|cde nothing cd ab
3 abc|de nothing de bc
4 abcd|e nothing nothing cd
5 abcde| nothing nothing de

How does this help us? It does if we use:

Capturing groups inside lookahead/lookbehind assertions

We now know how to loop through the input string and look ahead/behind to find all (overlapping) pairs of letters. We can capture the pairs by using capturing groups inside the lookahead/lookbehind assertions:

// Before
const lookahead = /(?=[a-z][a-z])/g
const lookbehind = /(?<=[a-z][a-z])/g

// After
const lookahead = /(?=([a-z][a-z]))/g
const lookbehind = /(?<=([a-z][a-z]))/g

One caveat, though: String.prototype.match() ignores capturing groups when using the global flag (g). So these don't work:

'abcde'.match(/(?=([a-z][a-z]))/g)
//=> ['', '', '', '']

'abcde'.match(/(?<=([a-z][a-z]))/g)
//=> ['', '', '', '']

They don't work because:

  • all matchable characters are inside the lookahead/lookbehind assertions, so the regular expressions capture nothing
  • match() ignores capturing groups when using the global flag (g).

Luckily for us, String.prototype.matchAll() does not ignore capturing groups:

'abcde'.matchAll(/(?=([a-z][a-z]))/g)
//=> RegExpStringIterator {}

'abcde'.matchAll(/(?<=([a-z][a-z]))/g)
//=> RegExpStringIterator {}

But oh, matchAll() returns an iterator. Let's convert the iterators to arrays:

Array.from('abcde'.matchAll(/(?=([a-z][a-z]))/g))
//=> [
// ["", "ab", index: 0, input: "abcde", groups: undefined],
// ["", "bc", index: 1, input: "abcde", groups: undefined],
// ["", "cd", index: 2, input: "abcde", groups: undefined],
// ["", "de", index: 3, input: "abcde", groups: undefined],
// ]

Array.from('abcde'.matchAll(/(?<=([a-z][a-z]))/g))
//=> [
// ["", "ab", index: 2, input: "abcde", groups: undefined],
// ["", "bc", index: 3, input: "abcde", groups: undefined],
// ["", "cd", index: 4, input: "abcde", groups: undefined],
// ["", "de", index: 5, input: "abcde", groups: undefined],
// ]

We can see that each array item is an array with additional properties. The first items are the matches – they are empty strings because all matchable characters are inside the lookahead/lookbehind assertions. The second items are the matches from the capturing groups.

Because we are using Array.from(), we can easily get only the second array items:

Array.from('abcde'.matchAll(/(?=([a-z][a-z]))/g), (match) => match[1])
//=> ['ab', 'bc', 'cd', 'de']

Array.from('abcde'.matchAll(/(?<=([a-z][a-z]))/g), (match) => match[1])
//=> ['ab', 'bc', 'cd', 'de']

And that's how we can do overlapping matches with regular expressions. Or that's the gist of it; depending on your needs, you can also:

  • use negative lookahead/lookbehind assertions
  • put stuff outside of the lookahead/lookbehind assertion.

Practical applications

There are probably many. For now, I can think of one:

Further resources

Wiktor Stribiżew's answer to How can I match overlapping strings with regex? on Stack Overflow shows how to do overlapping matches by using RegExp.prototype.exec() and modifying the regular expression instance's lastIndex property. This is useful if you can't use matchAll(), which isn't supported by Internet Explorer.

Wiktor's answer also shows the solution that I have presented here; I actually learned it from Wiktor's answer (thanks!). This blog post is mainly me teaching myself the topic better by teaching it to you.