SQL REGEXP


In SQL, pattern matching is a vital feature when you need to search for complex string patterns. While the LIKE operator allows basic pattern matching, it is limited to simple wildcard characters like % and _. For more complex and flexible pattern matching, SQL provides the REGEXP (regular expression) operator.

The REGEXP operator allows you to perform advanced searches on text fields by using regular expressions (regex), which are a powerful tool for matching complex string patterns.


1. What is REGEXP in SQL?

The REGEXP operator allows you to match strings based on regular expressions, which are a set of patterns that can describe text. Regular expressions offer much more flexibility than simple wildcard matching and can be used for complex search patterns such as:

  • Matching specific patterns (e.g., phone numbers, email addresses).
  • Searching for multiple possible characters.
  • Matching strings with a certain number of occurrences or conditions.

The syntax for using REGEXP is as follows:

SELECT column_name
FROM table_name
WHERE column_name REGEXP 'pattern';
  • column_name: The column where you want to perform the pattern match.
  • pattern: The regular expression pattern to match.

2. Basic Syntax of REGEXP

The REGEXP operator in SQL works by evaluating whether the column value matches a given regular expression. Here’s the basic syntax:

SELECT column_name
FROM table_name
WHERE column_name REGEXP 'regular_expression';

You can also use NOT REGEXP to find rows that do not match the regular expression.

SELECT column_name
FROM table_name
WHERE column_name NOT REGEXP 'regular_expression';

3. Commonly Used Regular Expression Patterns

Regular expressions are a powerful tool for text matching. Here are some common patterns that you can use with the REGEXP operator:

3.1 Basic Character Matching

  • .: Matches any single character (except newline).
  • []: Matches any one of the enclosed characters.
    • Example: REGEXP '[aeiou]' matches any vowel.
  • ^: Matches the beginning of the string.
    • Example: REGEXP '^A' matches strings that start with 'A'.
  • $: Matches the end of the string.
    • Example: REGEXP 'ing$' matches strings that end with 'ing'.

3.2 Quantifiers

  • *: Matches 0 or more occurrences of the preceding element.
    • Example: REGEXP 'a*' matches zero or more occurrences of 'a'.
  • +: Matches 1 or more occurrences of the preceding element.
    • Example: REGEXP 'a+' matches one or more occurrences of 'a'.
  • {n}: Matches exactly n occurrences of the preceding element.
    • Example: REGEXP 'a{2}' matches exactly two consecutive 'a' characters.
  • {n,}: Matches n or more occurrences of the preceding element.
    • Example: REGEXP 'a{2,}' matches two or more consecutive 'a' characters.

3.3 Special Characters

  • \d: Matches any digit (equivalent to [0-9]).
    • Example: REGEXP '\\d' matches any single digit.
  • \w: Matches any alphanumeric character (equivalent to [A-Za-z0-9_]).
    • Example: REGEXP '\\w' matches any alphanumeric character.
  • \s: Matches any whitespace character (spaces, tabs, line breaks).
    • Example: REGEXP '\\s' matches any whitespace character.

3.4 Grouping and Alternation

  • |: Acts as a logical OR in regex, meaning it matches either the expression before or after it.
    • Example: REGEXP 'cat|dog' matches either 'cat' or 'dog'.
  • () : Used for grouping patterns.
    • Example: REGEXP '(cat|dog)s' matches 'cats' or 'dogs'.

4. Practical Examples of Using REGEXP in SQL

Let’s look at some practical use cases for the REGEXP operator.

4.1 Finding Rows That Match a Pattern

Let’s say we have a table users with a column email, and we want to find all the rows where the email addresses contain "gmail.com".

Example:
SELECT email
FROM users
WHERE email REGEXP 'gmail.com';

This query will return all email addresses that contain the text 'gmail.com'.

4.2 Matching Multiple Patterns

You can use REGEXP to match multiple patterns. For instance, if we want to find email addresses that end with either "gmail.com" or "yahoo.com":

Example:
SELECT email
FROM users
WHERE email REGEXP 'gmail.com|yahoo.com';

This query will return email addresses that either end in 'gmail.com' or 'yahoo.com'.

4.3 Extracting Data with a Regular Expression

Another common use of REGEXP is extracting data that matches a pattern. For example, if we have a phone_numbers table, and we want to find all phone numbers that match a specific pattern (e.g., phone numbers with a 3-digit area code):

Example:
SELECT phone_number
FROM phone_numbers
WHERE phone_number REGEXP '^[0-9]{3}-[0-9]{3}-[0-9]{4}$';

This query searches for phone numbers that match the pattern of ###-###-#### (e.g., 123-456-7890).

4.4 Case-Insensitive Matching

In many SQL implementations, REGEXP is case-sensitive. To make it case-insensitive, you can use the REGEXP BINARY modifier or convert both the column value and the pattern to the same case.

Example:
SELECT name
FROM employees
WHERE name REGEXP BINARY '^j';

This query will match names starting with 'j', but only in lowercase, because of the BINARY keyword.


5. Using REGEXP in SQL to Validate Data

One of the most common use cases for regular expressions is validating user input. For example, to check if a string is a valid email address, you can use REGEXP to define a regular expression pattern that matches the general structure of an email.

Example:
SELECT email
FROM users
WHERE email REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';

This query checks for valid email formats by matching a typical email pattern.


6. Limitations of REGEXP in SQL

While REGEXP is powerful, it is important to note that it might not be supported in every SQL database system (e.g., older versions of MySQL or SQLite might not support all regex features). Furthermore, complex regular expressions can sometimes lead to performance issues, especially when querying large datasets. Always optimize queries and use indexes where possible.