\d in a regex isn't always equal to [0-9]
Published on in JavaScript and Regular expressions
In JavaScript, \d and [0-9] are equal and match only Arabic numerals (the numbers 0–9).
But in some other languages, \d matches also non-Arabic numerals.
There are 680 Unicode characters in the "Number, Decimal Digit" category. For example:
| Name | Characters |
|---|---|
| Digits (Arabic numerals) | 0123456789 |
| Arabic-Indic digits | ٠١٢٣٤٥٦٧٨٩ |
| Extended Arabic-Indic digits | ۰۱۲۳۴۵۶۷۸۹ |
| NKo digits | ߀߁߂߃߄߅߆߇߈߉ |
| Devanagari digits | ०१२३४५६७८९ |
Testing on regex101.com:
\din JavaScript matches 10 of them (Arabic numerals only)\din C# matches 370 of them\din Python matches 540 of them\din Rust matches all 680
(I haven't tested whether \d in those three languages matches also other characters than those 680.)
[0-9] matches only Arabic numerals in those four languages,
but even [0-9] can't be always trusted
(a few typos fixed in the quotation):
It is generally believed that
[0-9]matches only the ASCII digits0123456789. That is painfully false in some instances: Linux in some locale that is not "C" (June 2020) systems, for example:Assume:
str='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'Try
grepto discover that it allows most of them:$ echo "$str" | grep -o '[0-9]\+' 0123456789 ٠١٢٣٤٥٦٧٨ ۰۱۲۳۴۵۶۷۸ ߀߁߂߃߄߅߆߇߈ ०१२३४५६७८
sedhas some troubles. Should remove only0123456789but removes almost all digits. That means that it accepts most digits but not some nines (???):$ echo "$str" | sed 's/[0-9]\{1,\}//g' ٩ ۹ ߉ ९Even
exprsuffers from the same issues assed:expr "$str" : '\([0-9 ]*\)' # also matching spaces 0123456789 ٠١٢٣٤٥٦٧٨And also
ed:printf '%s\n' 's/[0-9]/x/g' '1,p' Q | ed -v <(echo "$str") 105 xxxxxxxxxx xxxxxxxxx٩ xxxxxxxxx۹ xxxxxxxxx߉ xxxxxxxxx९
Huh. Curious.
I guess I'd better remember to avoid \d and prefer [0-9] or even [0123456789]
when using other languages than JavaScript.
When writing JavaScript,
I'll continue using \d as it's as clear (d = digit) as and shorter than [0-9].