JavaScript Sets and ranges […]


Let’s get deeper into the details of regular expressions. In this chapter, we will show you how to use sets and ranges in JavaScript.

Putting several characters or character classes inside square brackets allows searching for any character among the given.

To be precise, let’s consider an example. Here, [lam] means any of the given three characters ‘l’, ‘a’, or ‘m’. It is known as a “set”. You can use them with regular characters in a regexp like this:

Although multiple characters exist in the set, they match exactly a single character in the match.

So, there are no matches in the example below:

The pattern looks for W, then one of these letters [3D], and, finally, ocs.

So, here could be a match for W3ocs or WDocs.

Ranges

Square brackets can also include the so-called character ranges.

For example, [a-m] is a character in range from “a” to “m”, and [0-7] is a digit from “0” to “7”.

Let’s see an example where “x” is followed by two digits or letters from A toF.

So, in the example above,[0-9A-F] includes two ranges: it looks for a character that is either a digit from “0” to “9” or a letter from “A” to “F”.

In case you want to search for lowercase letters, you can either add the a-f range or add the e flag.

Inside […], you can also use character classes.

For example, if you try to search for the character \w or a hyphen -, then the set will be [\w-]. You can also combine different classes such as [\s\d].

Multilanguage \w

As \w is a shorthand for [a-zA-Z0-9_] it’s not capable of finding Cyrillic letters, Chinese hieroglyphs, and so on.

A more universal pattern can be written. It can search for wordy characters in every language. With Unicode properties, it’s quite easy:

[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}].

Let’s interpret it. Like \w, it includes characters with Unicode properties, like here:

  • for letters –Alphabetic (Alpha).
  • for accents – Mark (M).
  • for digits – Decimal_Number (Nd).
  • for underscore and similar characters –Connector_Punctuation (Pc).
  • for ligatures such as Arabic are used two special codes 200c and 200d - Join_Control (Join_C).
  • the . + ( ) symbols don’t need escaping.
  • a hyphen - is not escaped at the start or the end.
  • a caret ^ is not escaped at the start.
  • the closing square bracket is always escaped.
  • the left half of Ģ(1).
  • the right half of Ģ(2).
  • the left half of Ç(3).
  • the right half of Ç(4).


Leave a Reply

Your email address will not be published. Required fields are marked *