How to do overlapping matches with regular expressions
Published on in JavaScript and Regular expressions
Last updated on
E.g. how to capture overlapping pairs of letters from the string 'abcde'
,
i.e. 'ab'
, 'bc'
, 'cd'
and 'de'
.
Spoiler: with lookahead and lookbehind assertions.
Table of contents
Consuming characters vs looking ahead/behind
We could try matching two letters at a time, like this:
'abcde'.match(/[a-z][a-z]/g)
//=> ['ab', 'cd']
But it doesn't work. Why not?
Turns out that when the regular expression matches the first two letters
(a
and b
),
it consumes them,
meaning that
the next match will be done against the rest of the input string,
i.e. 'cde'
.
That's why the above regular expression doesn't capture overlapping pairs of letters.
This is what lookahead and lookbehind assertions are for: they look ahead or behind to assert (check) that the input contains some characters, but they don't consume those characters.
We have two options:
const lookahead = /(?=[a-z][a-z])/g
const lookbehind = /(?<=[a-z][a-z])/g
The first regular expression above has a lookahead assertion. When it's used to loop through the input string, it asserts that the current position is followed by two letters. It doesn't consume them because they are inside the lookahead assertion. In fact, the regular expression doesn't consume any characters because all matchable characters are inside the lookahead assertion.
The second regular expression above has a lookbehind assertion. It works similarly, but it asserts that the current character is preceded by two letters.
Here's a table of the lookahead/lookbehind assertion results
using the two regular expressions above for the string 'abcde'
:
Index | Position (| ) |
Match | Lookahead | Lookbehind |
---|---|---|---|---|
0 | |abcde |
nothing | ab |
nothing |
1 | a|bcde |
nothing | bc |
nothing |
2 | ab|cde |
nothing | cd |
ab |
3 | abc|de |
nothing | de |
bc |
4 | abcd|e |
nothing | nothing | cd |
5 | abcde| |
nothing | nothing | de |
How does this help us? It does if we use capturing groups.
Capturing groups inside lookahead/lookbehind assertions
We now know how to loop through the input string and look ahead/behind to find all (overlapping) pairs of letters. We can capture the pairs by using capturing groups inside the lookahead/lookbehind assertions:
// Before
const lookahead = /(?=[a-z][a-z])/g
const lookbehind = /(?<=[a-z][a-z])/g
// After
const lookahead = /(?=([a-z][a-z]))/g
// ^ ^
const lookbehind = /(?<=([a-z][a-z]))/g
// ^ ^
One caveat, though:
String.prototype.match()
ignores capturing groups when using the global flag (g
).
So these don't work:
'abcde'.match(/(?=([a-z][a-z]))/g)
//=> ['', '', '', '']
'abcde'.match(/(?<=([a-z][a-z]))/g)
//=> ['', '', '', '']
They don't work because:
- all matchable characters are inside the lookahead/lookbehind assertions, so the regular expressions capture nothing
match()
ignores capturing groups when using the global flag (g
).
Luckily for us,
String.prototype.matchAll()
does not ignore capturing groups:
'abcde'.matchAll(/(?=([a-z][a-z]))/g)
//=> RegExpStringIterator {}
'abcde'.matchAll(/(?<=([a-z][a-z]))/g)
//=> RegExpStringIterator {}
But oh, matchAll()
returns an iterator.
Let's convert the iterators to arrays:
Array.from('abcde'.matchAll(/(?=([a-z][a-z]))/g))
//=> [
// ["", "ab", index: 0, input: "abcde", groups: undefined],
// ["", "bc", index: 1, input: "abcde", groups: undefined],
// ["", "cd", index: 2, input: "abcde", groups: undefined],
// ["", "de", index: 3, input: "abcde", groups: undefined],
// ]
Array.from('abcde'.matchAll(/(?<=([a-z][a-z]))/g))
//=> [
// ["", "ab", index: 2, input: "abcde", groups: undefined],
// ["", "bc", index: 3, input: "abcde", groups: undefined],
// ["", "cd", index: 4, input: "abcde", groups: undefined],
// ["", "de", index: 5, input: "abcde", groups: undefined],
// ]
We can see that each array item is an array with additional properties. The first items are the matches – they are empty strings because all matchable characters are inside the lookahead/lookbehind assertions. The second items are the matches from the capturing groups.
Because we are using
Array.from()
,
we can easily get only the second array items:
Array.from('abcde'.matchAll(/(?=([a-z][a-z]))/g), (match) => match[1])
//=> ['ab', 'bc', 'cd', 'de']
Array.from('abcde'.matchAll(/(?<=([a-z][a-z]))/g), (match) => match[1])
//=> ['ab', 'bc', 'cd', 'de']
And that's how we can do overlapping matches with regular expressions. Or that's the gist of it; depending on your needs, you can also:
- use negative lookahead/lookbehind assertions
- put stuff outside the lookahead/lookbehind assertion.
Practical applications
Using overlapping regex matches is one way to
convert a directory path or URL pathname into cumulative segments;
e.g. converting
'/foo/bar/baz/'
into an array of
'/'
, '/foo/'
, '/foo/bar/'
and '/foo/bar/baz/'
.
Anthony Labarre emailed me about another interesting use case:
I used this as an exercise in bash scripting for Chinese students, where I had them extract Chinese "proverbs" from books. A faster solution than searching the book for each proverb in a list consists in generating all possibly overlapping subsequences of 4, 5, … characters from the book, and then using
comm
to identify which subsequences match known proverbs from our list. On my machine, this brought the running time from about 15 seconds to about 1 second.For completeness, here's how I did it in bash [...]. (Except
\w
does not work for Chinese characters, you need to substitute it with\p{Han}
and add the-u
flag topcregrep
.)
Have you found overlapping regex matches useful in other scenarios? Let me know! (Email on the front page.)
Browser support
As of March 2022:
- Lookahead assertions are supported by all browsers because they "are part of JavaScript's original regular expression support."
- Lookbehind assertions are supported by all modern browsers except Safari and iOS Safari.
String.prototype.matchAll()
is supported by all modern browsers.
Further resources
Wiktor Stribiżew's answer to How can I match overlapping strings with regex? on Stack Overflow
shows how to do overlapping matches by using
RegExp.prototype.exec()
and modifying the
regular expression instance's lastIndex
property.
This is useful if you can't use matchAll()
,
which isn't supported by Internet Explorer.
Wiktor's answer also shows the solution that I have presented here; I actually learned it from Wiktor's answer (thanks!). This blog post is mainly me teaching myself the topic by teaching it to you.