In order to redact PII (Personal Identifiable Information) like name, numbers, a company name or an email from text strings such as comments we have one generic regex pattern to perform the action. This could be used in a SQL statement to remove (the majority of) PII.
By our estimations it is currently removing around about 95% of PII in a sentence structure, however it would need some AI input to remove all firstname surname instances that are typed in lower case in a sentence, as our pattern is making the presumption that users will 9 times out of 10 write their name in Capitalised Words. We are not sure whether there is certain cultural slant to type your name in lowercase or Capitalised Case. But the presumption is majority capitalised.
Our regex pattern will also throw away entire sentences WHEN A USER TYPES IN ALL CAPITAL LETTERS. Which we think is not such a bad thing!!
It also needs some improvement when a user might sign off with just their Firstname at the end of a comment as we have tried to retain single capitalised words at the beginning of any sentences. Further tweaking and testing may be required to look for single capitalised words at the end of an entire comment string. As well as looking for any other edge cases that drop through the net. Again this could be more a task well done for a machine learning model or some other AI function.
This is the solution we used for the PII Redact Regex in a SQL regexp_replace statement.
\d+|(?:.)(?<!\.\s|\p{Sc})[[A-Z]{1}&&[^.]]\w*|\S+@\S+\.\S+
Breaking the regular expression statement into the | delimiter parts.
The first part \d+ matches any digits such as a currency amount $1000 or a phone number +44 1234 567 891
\d+
The second part will match any Name or Company Name that is capitalised or capitals. Such as “Firstname Surname” or “JOHN JONES”
[[A-Z]{1}&&[^.]]\w*
Inserted before the second main part of match we insert ?<!\.\s
which will negate any full stops, or white space at the beginning of a sentence and so the match will not match any capitalised words at the beginning of sentences. And \p{Sc}
will negate any currency symbols such as $ £ € ¥ ₨ ฿
(?<!\.\s|\p{Sc})
The final part is a simple form of an email pattern match
\S+@\S+\.\S+
UPDATE 13/07/2023
We did this update (below) a while back to extract the single capitalised name sign offs at the end of structured sentence in addition as discussed above.
\d+|(?:.)(?<!\.\s|\p{Sc})[[A-Z]{1}&&[^.]]\w*|\S+@\S+\.\S+|[[A-Z]{1}&&[^.]]\w*$
Try the example here –> https://regex101.com/r/nG9SMM/1 and let us know if this works.
Try to spot the one thing that we found would break it. But in our opinion it should perform better than prior regex.
Please leave a comment, post or share.
Leave a Reply