Dowemo
0 0 0 0


Question:

I have a table with ~500k rows; varchar(255) UTF8 column filename contains a file name;

I'm trying to strip out various strange characters out of the filename - thought I'd use a character class: [^a-zA-Z0-9()_ .-]

Now, is there a function in MySQL that lets you replace through a regular expression? I'm looking for a similar functionality to REPLACE() function - simplified example follows:

SELECT REPLACE('stackowerflow', 'ower', 'over');
Output: "stackoverflow"
/* does something like this exist? */
SELECT X_REG_REPLACE('Stackoverflow','/[A-Zf]/','-'); 
Output: "-tackover-low"

I know about REGEXP/RLIKE, but those only check if there is a match, not what the match is.

(I could do a "SELECT pkey_id,filename FROM foo WHERE filename RLIKE '[^a-zA-Z0-9()_ .-]'" from a PHP script, do a preg_replace and then "UPDATE foo ... WHERE pkey_id=...", but that looks like a last-resort slow & ugly hack)


Best Answer:


The following expands upon the function provided by Rasika Godawatte but trawls through all necessary substrings rather than just testing single characters:

CREATE FUNCTION reg_replace(subject VARCHAR(21845), pattern VARCHAR(21845),
  replacement VARCHAR(21845), greedy BOOLEAN, minMatchLen INT, maxMatchLen INT)
RETURNS VARCHAR(21845) DETERMINISTIC BEGIN 
  DECLARE result, subStr, usePattern VARCHAR(21845); 
  DECLARE startPos, prevStartPos, startInc, len, lenInc INT;
  IF subject REGEXP pattern THEN
    SET result = '';
    -- Sanitize input parameter values
    SET minMatchLen = IF(minMatchLen < 1, 1, minMatchLen);
    SET maxMatchLen = IF(maxMatchLen < 1 OR maxMatchLen > CHAR_LENGTH(subject),
                         CHAR_LENGTH(subject), maxMatchLen);
    -- Set the pattern to use to match an entire string rather than part of a string
    SET usePattern = IF (LEFT(pattern, 1) = '^', pattern, CONCAT('^', pattern));
    SET usePattern = IF (RIGHT(pattern, 1) = '$', usePattern, CONCAT(usePattern, '$'));
    -- Set start position to 1 if pattern starts with ^ or doesn't end with $.
    IF LEFT(pattern, 1) = '^' OR RIGHT(pattern, 1) <> '$' THEN
      SET startPos = 1, startInc = 1;
    -- Otherwise (i.e. pattern ends with $ but doesn't start with ^): Set start position
    -- to the min or max match length from the end (depending on "greedy" flag).
    ELSEIF greedy THEN
      SET startPos = CHAR_LENGTH(subject) - maxMatchLen + 1, startInc = 1;
    ELSE
      SET startPos = CHAR_LENGTH(subject) - minMatchLen + 1, startInc = -1;
    END IF;
    WHILE startPos >= 1 AND startPos <= CHAR_LENGTH(subject)
      AND startPos + minMatchLen - 1 <= CHAR_LENGTH(subject)
      AND !(LEFT(pattern, 1) = '^' AND startPos <> 1)
      AND !(RIGHT(pattern, 1) = '$'
            AND startPos + maxMatchLen - 1 < CHAR_LENGTH(subject)) DO
      -- Set start length to maximum if matching greedily or pattern ends with $.
      -- Otherwise set starting length to the minimum match length.
      IF greedy OR RIGHT(pattern, 1) = '$' THEN
        SET len = LEAST(CHAR_LENGTH(subject) - startPos + 1, maxMatchLen), lenInc = -1;
      ELSE
        SET len = minMatchLen, lenInc = 1;
      END IF;
      SET prevStartPos = startPos;
      lenLoop: WHILE len >= 1 AND len <= maxMatchLen
                 AND startPos + len - 1 <= CHAR_LENGTH(subject)
                 AND !(RIGHT(pattern, 1) = '$' 
                       AND startPos + len - 1 <> CHAR_LENGTH(subject)) DO
        SET subStr = SUBSTRING(subject, startPos, len);
        IF subStr REGEXP usePattern THEN
          SET result = IF(startInc = 1,
                          CONCAT(result, replacement), CONCAT(replacement, result));
          SET startPos = startPos + startInc * len;
          LEAVE lenLoop;
        END IF;
        SET len = len + lenInc;
      END WHILE;
      IF (startPos = prevStartPos) THEN
        SET result = IF(startInc = 1, CONCAT(result, SUBSTRING(subject, startPos, 1)),
                        CONCAT(SUBSTRING(subject, startPos, 1), result));
        SET startPos = startPos + startInc;
      END IF;
    END WHILE;
    IF startInc = 1 AND startPos <= CHAR_LENGTH(subject) THEN
      SET result = CONCAT(result, RIGHT(subject, CHAR_LENGTH(subject) + 1 - startPos));
    ELSEIF startInc = -1 AND startPos >= 1 THEN
      SET result = CONCAT(LEFT(subject, startPos), result);
    END IF;
  ELSE
    SET result = subject;
  END IF;
  RETURN result;
END//

Demo

SQL Fiddle Demoor Rextester Demo

Limitations

  • This method is of course going to take a while when the subject string is large. Update: Have now added minimum and maximum match length parameters for improved efficiency when these are known (zero = unknown/unlimited).
  • It won't allow substitution of backreferences (e.g. 1, 2etc.) to replace capturing groups. If this functionality is needed, please see this answer which attempts to provide a workaround by updating the function to allow a secondary find and replace within each found match (at the expense of increased complexity).
  • If ^and/or $ is used in the pattern, they must be at the very start and very end respectively - e.g. patterns such as (^start|end$) are not supported.
  • There is a "greedy" flag to specify whether the overall matching should be greedy or non-greedy. Combining greedy and lazy matching within a single regular expression (e.g. a.*?b.*) is not supported.



  • Copyright © 2011 Dowemo All rights reserved.    Creative Commons   AboutUs