Tidy URLs for aggregation in Presto SQL

The longer tittle includes the text – “with regexp_replace and regexp_extract”. This is the important bit and since we love Regular Expressions we had to include it in the first paragraph instead.

This was our solution for a recent requirement on tidying URLs in a dataset for pageviews with the aim to make them more aggregated for easier, better analysis and user journey charting.

Say you have URLs that contain numbers in them at the end or have alphanumeric sequences inside them like.

mydomain.com/directory/1
mydomain.com/er3998-0klh9-982njkns/directory/insert/1

BUT we want to keep year ranges in the URLs at the end such as

mydomain.com/page-level1/2010-2011

So we want to look for numbers of up to 1-3 chars length at the end of URL, without a query string at end (as we strip this out in a separate sequence to start off with) and where alphanumeric sequences are inside the URL.

Regex used

\/[0-9]+[A-Za-z].+[\-].+|\/[A-Za-z]+[0-9].+[\-].+|\/(?![0-9]{4}\-[0-9]{4})(?:[A-Z]|[0-9])+[\-].+\/|\/[0-9]{1,3}(?:$|\/)

In our SQL we will then regexp_replace these numbers and alphanumerics with [AN] as an acronyn for alphanumeric to aggregate them.

We have used some regex_replace and regex_extract in superset SQL editor in combination with a coalese function to achieve our objective = Tidied URLs ready for aggregated analysis in Apache Superset data visualiaton charts.

You can read about Presto or Trino SQL regular expression functions

For Presto regular expression functions >>

https://prestodb.io/docs/current/functions/regexp.html

For Trino regular expression functions >>

https://trino.io/docs/current/functions/regexp.html

coalesce(regexp_replace(regexp_extract(page, '(.*)\?.*$', 1),'\/\d{1,3}$', '/[AN]/' ), -- # alphanumeric could be /123/
regexp_replace(page,'\/[0-9]+[A-Za-z].+[\-].+|\/[A-Za-z]+[0-9].+[\-].+|\/(?![0-9]{4}\-[0-9]{4})(?:[A-Z]|[0-9])+[\-].+\/|\/[0-9]{1,3}(?:$|\/)', '/[AN]/' ),-- this [AN]or alphanumeric could be /123/ or /I48-TR18-T7S/ or /YPB-EMZG-IPLA/ or /dd6c6129-9956-42c3-ab19-8184669c04ec but should retain urls that have say /2021-2022/ in them such as https://www.mydomain/pagelevel/2021-2022/regex-example
regexp_extract(page, '(.*)\?.*$', 1),
regexp_extract(page, '(.*)', 1)) AS "short_page",

-- short_page first extracts query string and replaces any 1-3 digits at end that remain at end of url with /[AN]/
-- then looks for numbers without a query string at end to replace with /AN/ at end or in the string as well as replacing alphanumeric strings with /[AN]/
-- then extracts where there is just a query string left i.e. urls with text+query string
-- then extracts anything else that doesn't meet these conditions

We are sure a shortened solution exists so let us know in the comments if you like or would use a different regex solution.

Presto SQL uses JAVA 8 flavour of regex when using regex101 if you want to test this there.

One response to “Tidy URLs for aggregation in Presto SQL”

Leave a Reply Cancel reply

Written By :

Category :

Posted On :

Share This :

Quick Links

Contact