\d in a regex isn't always equal to [0-9]

Published on in JavaScript and Regular expressions

In JavaScript, \d and [0-9] are equal and match only Arabic numerals (the numbers 0–9). But in some other languages, \d matches also non-Arabic numerals.

There are 680 Unicode characters in the "Number, Decimal Digit" category. For example:

Name Characters
Digits (Arabic numerals) 0123456789
Arabic-Indic digits ٠١٢٣٤٥٦٧٨٩
Extended Arabic-Indic digits ۰۱۲۳۴۵۶۷۸۹
NKo digits ߀߁߂߃߄߅߆߇߈߉
Devanagari digits ०१२३४५६७८९

Testing on regex101.com:

(I haven't tested whether \d in those three languages matches also other characters than those 680.)

[0-9] matches only Arabic numerals in those four languages, but even [0-9] can't be always trusted (a few typos fixed in the quotation):

It is generally believed that [0-9] matches only the ASCII digits 0123456789. That is painfully false in some instances: Linux in some locale that is not "C" (June 2020) systems, for example:

Assume:

str='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'

Try grep to discover that it allows most of them:

$ echo "$str" | grep -o '[0-9]\+'
0123456789
٠١٢٣٤٥٦٧٨
۰۱۲۳۴۵۶۷۸
߀߁߂߃߄߅߆߇߈
०१२३४५६७८

sed has some troubles. Should remove only 0123456789 but removes almost all digits. That means that it accepts most digits but not some nines (???):

$ echo "$str" | sed 's/[0-9]\{1,\}//g'
 ٩ ۹ ߉ ९

Even expr suffers from the same issues as sed:

expr "$str" : '\([0-9 ]*\)' # also matching spaces
0123456789 ٠١٢٣٤٥٦٧٨

And also ed:

printf '%s\n' 's/[0-9]/x/g' '1,p' Q | ed -v <(echo "$str")
105
xxxxxxxxxx xxxxxxxxx٩ xxxxxxxxx۹ xxxxxxxxx߉ xxxxxxxxx९

Huh. Curious.

I guess I'd better remember to avoid \d and prefer [0-9] or even [0123456789] when using other languages than JavaScript.

When writing JavaScript, I'll continue using \d as it's as clear (d = digit) as and shorter than [0-9].