\d
in a regex isn't always equal to [0-9]
Published on in JavaScript and Regular expressions
In JavaScript, \d
and [0-9]
are equal and match only Arabic numerals (the numbers 0–9).
But in some other languages, \d
matches also non-Arabic numerals.
There are 680 Unicode characters in the "Number, Decimal Digit" category. For example:
Name | Characters |
---|---|
Digits (Arabic numerals) | 0123456789 |
Arabic-Indic digits | ٠١٢٣٤٥٦٧٨٩ |
Extended Arabic-Indic digits | ۰۱۲۳۴۵۶۷۸۹ |
NKo digits | ߀߁߂߃߄߅߆߇߈߉ |
Devanagari digits | ०१२३४५६७८९ |
Testing on regex101.com:
\d
in JavaScript matches 10 of them (Arabic numerals only)\d
in C# matches 370 of them\d
in Python matches 540 of them\d
in Rust matches all 680
(I haven't tested whether \d
in those three languages matches also other characters than those 680.)
[0-9]
matches only Arabic numerals in those four languages,
but even [0-9]
can't be always trusted
(a few typos fixed in the quotation):
It is generally believed that
[0-9]
matches only the ASCII digits0123456789
. That is painfully false in some instances: Linux in some locale that is not "C" (June 2020) systems, for example:Assume:
str='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'
Try
grep
to discover that it allows most of them:$ echo "$str" | grep -o '[0-9]\+' 0123456789 ٠١٢٣٤٥٦٧٨ ۰۱۲۳۴۵۶۷۸ ߀߁߂߃߄߅߆߇߈ ०१२३४५६७८
sed
has some troubles. Should remove only0123456789
but removes almost all digits. That means that it accepts most digits but not some nines (???):$ echo "$str" | sed 's/[0-9]\{1,\}//g' ٩ ۹ ߉ ९
Even
expr
suffers from the same issues assed
:expr "$str" : '\([0-9 ]*\)' # also matching spaces 0123456789 ٠١٢٣٤٥٦٧٨
And also
ed
:printf '%s\n' 's/[0-9]/x/g' '1,p' Q | ed -v <(echo "$str") 105 xxxxxxxxxx xxxxxxxxx٩ xxxxxxxxx۹ xxxxxxxxx߉ xxxxxxxxx९
Huh. Curious.
I guess I'd better remember to avoid \d
and prefer [0-9]
or even [0123456789]
when using other languages than JavaScript.
When writing JavaScript,
I'll continue using \d
as it's as clear (d = digit) as and shorter than [0-9]
.