Analytics Engineering Â· SQL Â· Regex

Taming URL Chaos: Regex Strategies for Normalising GOV.UK Page Paths

How to group analytics events by meaningful page structure when your URLs contain user IDs, tax reference numbers, and random alphanumeric tokens

The Complete Query

Here’s the full SQL in context. The sections below break down each regex component in detail.

SQL

select DATE, generated_at, referrer,
regexp_extract(concat('/', try(array_join(slice(short_referrer_array, 4, cardinality(short_referrer_array) - 1), '/'))),
'(?>/check-income-tax|^)(personal-account|.*)', 1) AS "short_referrer",
short_page, user_agent_string, event_id, request_id,
session_id, transaction_name, detail_method, page_path, Subject
from (
SELECT
DATE, generated_at,
detail.referrer AS "referrer",
split(coalesce(
regexp_replace(regexp_extract(detail.referrer, '(.*)\?.*$', 1),'\/\d{1,3}$', '/[AN]/'),
regexp_replace(detail.referrer,'\/[0-9]+[A-Za-z].+[\-].+|\/[A-Za-z]+[0-9].+[\-].+|\/(?![0-9]{4}\-[0-9]{4})(?:[A-Z]|[0-9])+[\-].+\/|\/[0-9]{1,3}(?:$|\/)', '/[AN]/'),
regexp_extract(detail.referrer, '(.*)\?.*$', 1),
regexp_extract(detail.referrer, '(.*)', 1)),
'/') AS "short_referrer_array",
detail.user_agent_string AS "user_agent_string",
event_id, tags.x_request_id AS "request_id",
tags.x_session_id AS "session_id",
tags.transaction_name AS "transaction_name",
detail.method AS "detail_method",
tags.path AS "page_path",
coalesce(
regexp_replace(regexp_extract(tags.path, '/check-income-tax(.*)\?.*$', 1),'\/\d{1,3}$', '/[AN]/'),
regexp_replace(regexp_extract(tags.path, '/check-income-tax(.*)', 1),'\/[0-9]+[A-Za-z].+[\-].+|\/[A-Za-z]+[0-9].+[\-].+|\/(?![0-9]{4}\-[0-9]{4})(?:[A-Z]|[0-9])+[\-].+\/|\/[0-9]{1,3}(?:$|\/)', '/[AN]/'),
regexp_extract(tags.path, '/check-income-tax(.*)\?.*$', 1),
regexp_extract(tags.path, '/check-income-tax(.*)', 1)) AS "short_page",
CASE
WHEN regexp_like(tags.path, '.*print-your.*') THEN 'Printing'
WHEN regexp_like(tags.path, '.*current-year.*') THEN 'Current Year'
WHEN regexp_like(tags.path, '.*tax-code.*') THEN 'Tax Code'
WHEN regexp_like(tags.path, '.*employment.*') THEN 'Employment'
WHEN regexp_like(tags.path, '.*pension.*') THEN 'Pension'
WHEN regexp_like(tags.path, '.*underpayment.*') THEN 'Underpayment'
WHEN regexp_like(tags.path, '.*state-benefits.*') THEN 'State Benefits'
WHEN regexp_like(tags.path, '.*income-tax-history.*') THEN 'Income Tax History'
WHEN regexp_like(tags.path, '.*signout.*') THEN 'Signout page'
WHEN regexp_like(tags.path, '.*what-do-you-want-to-do.*') THEN 'Home Dashboard'
ELSE 'Other'
END AS "Subject"
FROM {{time_aware_table("[YOUR TABLE NAME]", timerange = "7 days")}}
WHERE tags.path NOT LIKE '%/assets/%'
AND tags.path NOT LIKE '%/images/%'
AND tags.path NOT LIKE '%/fonts/%'
AND detail.status_code = '200'
)

The Problem

When you’re doing analytics on a government tax service, your page paths look deceptively clean at first glance â until you realise that URLs like these all represent the same page:

/check-income-tax/tax-code/1250L/employment
/check-income-tax/tax-code/K497/employment
/check-income-tax/tax-code/BR/employment

And paths like these each represent different users visiting the same underpayment summary page:

/check-income-tax/underpayment/I48-TR18-T7S
/check-income-tax/underpayment/YPB-EMZG-IPLA
/check-income-tax/underpayment/dd6c6129-9956-42c3-ab19-8184669c04ec

To produce meaningful page-level analytics â counting sessions per page, tracking journeys, building funnels â you need to strip those dynamic segments out and replace them with a stable placeholder like [AN], while not stripping segments that are actually meaningful, such as the tax year range /2021-2022/.

That’s what the regex chain in this SQL does.

The Overall Strategy: COALESCE as a Waterfall

Rather than writing one enormous regex that tries to handle every case, the query uses Presto/Athena’s COALESCE function as a waterfall of increasingly general patterns. The first expression that returns a non-null value wins.

regexp_replace(regexp_extract(path, '/check-income-tax(.*)\?.*$', 1), '\/\d{1,3}$', '/[AN]/')

URLs with a query string â strip the query string first, then replace any trailing 1â3 digit number.

regexp_replace(regexp_extract(path, '/check-income-tax(.*)', 1), '\/[0-9]+[A-Za-z]â¦', '/[AN]/')

URLs without a query string â apply the full alphanumeric normalisation regex to handle short numeric IDs, tax-code-like values, hyphenated reference tokens, and UUID-like identifiers.

regexp_extract(path, '/check-income-tax(.*)\?.*$', 1)

Fallback for paths where the query string was present but step 1 returned null â return the clean path.

regexp_extract(path, '/check-income-tax(.*)', 1)

Last resort â return whatever follows /check-income-tax as-is.

The same four-step COALESCE pattern is applied to both the page path tags.path and the referrer URL detail.referrer. The referrer version skips the /check-income-tax anchor since referrers may originate from other services.

Step 1 â Handling URLs with Query Strings

regexp_extract(path, ‘/check-income-tax(.)\?.$’, 1)

This strips the query string from a URL, capturing only the path segment after /check-income-tax and before the ?.

/check-income-tax

Literal anchor â only match paths that belong to this service

(.*)

Capture group â grab everything after the anchor

\?.*$

Match the ? and everything after it, effectively stripping the query string

regexp_replace(â¦, ‘\/\d{1,3}$’, ‘/[AN]/’)

A literal forward slash

\d{1,3}

One to three digits

Anchored to the end of the string

Input path	After step 1
/check-income-tax/employment/123?taxYear=2023	/employment/[AN]
/check-income-tax/tax-code/details?ref=abc	/tax-code/details unchanged

Step 2 â The Main Normalisation Regex

This handles several distinct ID formats in a single regexp_replace pass using the | OR operator.

Regex

\/[0-9]+[A-Za-z].+[\-].+          -- Pattern A
|\/[A-Za-z]+[0-9].+[\-].+         -- Pattern B
|\/(?![0-9]{4}\-[0-9]{4})(?:[A-Z]|[0-9])+[\-].+\/  -- Pattern C
|\/[0-9]{1,3}(?:$|\/)             -- Pattern D

Pattern A â Numbers first, then letters, then a hyphen

Targets identifiers that start with digits, mix in letters, and contain a hyphen.

Matches	Doesn’t match
/1A-TR7	/1250L â no hyphen
/42X-CORP-REF	/2021-2022 â protected elsewhere

Pattern B â Letters first, then numbers, then a hyphen

The mirror of pattern A. It targets identifiers such as /I48-TR18-T7S.

Matches	Doesn’t match
/I48-TR18-T7S	/employment â pure letters, no digit
/YPB3-EMZG-IPLA	/tax-code â pure letters with a hyphen, no digit

Pattern C â Uppercase/digit hyphenated tokens, excluding tax years

This catches uppercase/digit hyphenated tokens such as /YPB-EMZG-IPLA/, while deliberately not matching tax year ranges like /2021-2022/. Lowercase UUID-like values may be caught elsewhere in the wider regex chain, especially by the mixed letter/number patterns.

Path segment boundary

(?![0-9]{4}\-[0-9]{4})

Negative lookahead â protects tax years such as 2021-2022

(?:[A-Z]|[0-9])+

One or more uppercase letters or digits

[\-].+

A hyphen followed by more content

Trailing slash

The trailing \/ is important. The trade-off is that a token at the very end of a path without a trailing slash may not be caught by this specific pattern.

Input	Outcome
/YPB-EMZG-IPLA/	matched â [AN]
/dd6c6129-9956-42c3-ab19…/	usually matched by the wider mixed alphanumeric rules, not specifically Pattern C
/2021-2022/	protected by negative lookahead â kept as-is
/right-amount	no digit/uppercase token â kept as-is

Pattern D â Short numeric segments

This catches a 1â3 digit number at the end of a path or as an interior segment.

Input	Outcome
/employment/7	/employment/[AN]
/page/123/details	/page/[AN]/details
/tax-paid/2021-2022	not matched â 4 digits

The Referrer Version

The same logic is applied to the referrer URL, but without the /check-income-tax anchor since referrers can come from other GOV.UK services.

Regex extract

regexp_extract(
  concat('/', try(array_join(slice(short_referrer_array, 4, cardinality(short_referrer_array) - 1), '/'))),
  '(?>/check-income-tax|^)(personal-account|.*)',
  1
)

This slices the referrer URL array from the fourth segment onwards, joins it back into a path, then extracts either a personal-account path or anything else following the service prefix.

Putting It Together: The Subject Classification

Once the path is normalised into short_page, a CASE statement maps it to a human-readable subject area.

SQL CASE

WHEN regexp_like(tags.path, '.*tax-code.*') THEN 'Tax Code'
WHEN regexp_like(tags.path, '.*employment.*') THEN 'Employment'
WHEN regexp_like(tags.path, '.*underpayment.*') THEN 'Underpayment'
WHEN regexp_like(tags.path, '.*pension.*') THEN 'Pension'

Key Design Decisions Worth Noting

Why not just one big regex? A single pattern trying to handle query strings, short numerics, mixed alphanumerics, and UUID-like identifiers simultaneously becomes difficult to reason about and maintain. The COALESCE waterfall makes each case explicit and independently testable.

Why [AN] as the placeholder? It is short, visually distinctive, and documents that a dynamic alphanumeric segment was present.

Why the 1â3 digit limit? Tax year ranges such as /2021-2022/ contain four-digit numbers. Setting the ceiling at three digits helps avoid accidentally tokenising tax years.

Why a negative lookahead? The lookahead (?![0-9]{4}\-[0-9]{4}) protects /2021-2022/ while still allowing other hyphenated token patterns to be normalised.

Taming URL Chaos: Regex Strategies for Normalising GOV.UK Page Paths

Taming URL Chaos: Regex Strategies for Normalising GOV.UK Page Paths

The Complete Query

The Problem

The Overall Strategy: COALESCE as a Waterfall

Step 1 â Handling URLs with Query Strings

regexp_extract(path, ‘/check-income-tax(.)\?.$’, 1)

regexp_replace(â¦, ‘\/\d{1,3}$’, ‘/[AN]/’)

Step 2 â The Main Normalisation Regex

Pattern A â Numbers first, then letters, then a hyphen

Pattern B â Letters first, then numbers, then a hyphen

Pattern C â Uppercase/digit hyphenated tokens, excluding tax years

Pattern D â Short numeric segments

The Referrer Version

Putting It Together: The Subject Classification

Key Design Decisions Worth Noting

1 thought on “Taming URL Chaos: Regex Strategies for Normalising GOV.UK Page Paths”

Leave a Comment Cancel Reply

Taming URL Chaos: Regex Strategies for Normalising GOV.UK Page Paths

The Complete Query

The Problem

The Overall Strategy: COALESCE as a Waterfall

Step 1 â Handling URLs with Query Strings

regexp_extract(path, ‘/check-income-tax(.*)\?.*$’, 1)

regexp_replace(â¦, ‘\/\d{1,3}$’, ‘/[AN]/’)

Step 2 â The Main Normalisation Regex

Pattern A â Numbers first, then letters, then a hyphen

Pattern B â Letters first, then numbers, then a hyphen

Pattern C â Uppercase/digit hyphenated tokens, excluding tax years

Pattern D â Short numeric segments

The Referrer Version

Putting It Together: The Subject Classification

Key Design Decisions Worth Noting

1 thought on “Taming URL Chaos: Regex Strategies for Normalising GOV.UK Page Paths”

Leave a Comment Cancel Reply

Step 1 â Handling URLs with Query Strings

regexp_extract(path, ‘/check-income-tax(.)\?.$’, 1)

regexp_replace(â¦, ‘\/\d{1,3}$’, ‘/[AN]/’)

Step 2 â The Main Normalisation Regex

Pattern A â Numbers first, then letters, then a hyphen

Pattern B â Letters first, then numbers, then a hyphen

Pattern C â Uppercase/digit hyphenated tokens, excluding tax years

Pattern D â Short numeric segments