Taming URL Chaos: Regex Strategies for Normalising GOV.UK Page Paths

Analytics Engineering · SQL · Regex

Taming URL Chaos: Regex Strategies for Normalising GOV.UK Page Paths

How to group analytics events by meaningful page structure when your URLs contain user IDs, tax reference numbers, and random alphanumeric tokens

The Complete Query

Here’s the full SQL in context. The sections below break down each regex component in detail.

SQL

select DATE, generated_at, referrer,
regexp_extract(concat(‘/’, try(array_join(slice(short_referrer_array, 4, cardinality(short_referrer_array) – 1), ‘/’))),
‘(?>/check-income-tax|^)(personal-account|.*)’, 1) AS “short_referrer”,
short_page, user_agent_string, event_id, request_id,
session_id, transaction_name, detail_method, page_path, Subject
from (
SELECT
DATE, generated_at,
detail.referrer AS “referrer”,
split(coalesce(
regexp_replace(regexp_extract(detail.referrer, ‘(.*)\?.*$’, 1),‘\/\d{1,3}$’, ‘/[AN]/’),
regexp_replace(detail.referrer,‘\/[0-9]+[A-Za-z].+[\-].+|\/[A-Za-z]+[0-9].+[\-].+|\/(?![0-9]{4}\-[0-9]{4})(?:[A-Z]|[0-9])+[\-].+\/|\/[0-9]{1,3}(?:$|\/)’, ‘/[AN]/’),
regexp_extract(detail.referrer, ‘(.*)\?.*$’, 1),
regexp_extract(detail.referrer, ‘(.*)’, 1)),
‘/’) AS “short_referrer_array”,
detail.user_agent_string AS “user_agent_string”,
event_id, tags.x_request_id AS “request_id”,
tags.x_session_id AS “session_id”,
tags.transaction_name AS “transaction_name”,
detail.method AS “detail_method”,
tags.path AS “page_path”,
coalesce(
regexp_replace(regexp_extract(tags.path, ‘/check-income-tax(.*)\?.*$’, 1),‘\/\d{1,3}$’, ‘/[AN]/’),
regexp_replace(regexp_extract(tags.path, ‘/check-income-tax(.*)’, 1),‘\/[0-9]+[A-Za-z].+[\-].+|\/[A-Za-z]+[0-9].+[\-].+|\/(?![0-9]{4}\-[0-9]{4})(?:[A-Z]|[0-9])+[\-].+\/|\/[0-9]{1,3}(?:$|\/)’, ‘/[AN]/’),
regexp_extract(tags.path, ‘/check-income-tax(.*)\?.*$’, 1),
regexp_extract(tags.path, ‘/check-income-tax(.*)’, 1)) AS “short_page”,
CASE
WHEN regexp_like(tags.path, ‘.*print-your.*’) THEN ‘Printing’
WHEN regexp_like(tags.path, ‘.*current-year.*’) THEN ‘Current Year’
WHEN regexp_like(tags.path, ‘.*tax-code.*’) THEN ‘Tax Code’
WHEN regexp_like(tags.path, ‘.*employment.*’) THEN ‘Employment’
WHEN regexp_like(tags.path, ‘.*pension.*’) THEN ‘Pension’
WHEN regexp_like(tags.path, ‘.*underpayment.*’) THEN ‘Underpayment’
WHEN regexp_like(tags.path, ‘.*state-benefits.*’) THEN ‘State Benefits’
WHEN regexp_like(tags.path, ‘.*income-tax-history.*’) THEN ‘Income Tax History’
WHEN regexp_like(tags.path, ‘.*signout.*’) THEN ‘Signout page’
WHEN regexp_like(tags.path, ‘.*what-do-you-want-to-do.*’) THEN ‘Home Dashboard’
ELSE ‘Other’
END AS “Subject”
FROM {{time_aware_table(“[YOUR TABLE NAME]”, timerange = “7 days”)}}
WHERE tags.path NOT LIKE ‘%/assets/%’
AND tags.path NOT LIKE ‘%/images/%’
AND tags.path NOT LIKE ‘%/fonts/%’
AND detail.status_code = ‘200’
)


The Problem

When you’re doing analytics on a government tax service, your page paths look deceptively clean at first glance — until you realise that URLs like these all represent the same page:

/check-income-tax/tax-code/1250L/employment
/check-income-tax/tax-code/K497/employment
/check-income-tax/tax-code/BR/employment

And paths like these each represent different users visiting the same underpayment summary page:

/check-income-tax/underpayment/I48-TR18-T7S
/check-income-tax/underpayment/YPB-EMZG-IPLA
/check-income-tax/underpayment/dd6c6129-9956-42c3-ab19-8184669c04ec

To produce meaningful page-level analytics — counting sessions per page, tracking journeys, building funnels — you need to strip those dynamic segments out and replace them with a stable placeholder like [AN] (alphanumeric), while not stripping segments that are actually meaningful, such as the tax year range /2021-2022/.

That’s what the regex chain in this SQL does. Let’s break it apart.


The Overall Strategy: COALESCE as a Waterfall

Rather than writing one enormous regex that tries to handle every case, the query uses Presto/Athena’s COALESCE function as a waterfall of increasingly general patterns. The first expression that returns a non-null value wins.

1
regexp_replace(regexp_extract(path, '/check-income-tax(.*)\?.*$', 1), '\/\d{1,3}$', '/[AN]/')

URLs with a query string — strip the query string first, then replace any trailing 1–3 digit number.

2
regexp_replace(regexp_extract(path, '/check-income-tax(.*)', 1), '\/[0-9]+[A-Za-z]…', '/[AN]/')

URLs without a query string — apply the full alphanumeric normalisation regex to handle IDs, tax codes, and UUIDs.

3
regexp_extract(path, '/check-income-tax(.*)\?.*$', 1)

Fallback for paths where the query string was present but step 1 returned null — return the clean path.

4
regexp_extract(path, '/check-income-tax(.*)', 1)

Last resort — return whatever follows /check-income-tax as-is.

The same four-step COALESCE pattern is applied to both the page path (tags.path) and the referrer URL (detail.referrer). The referrer version skips the /check-income-tax anchor since referrers may originate from other services.

Step 1 — Handling URLs with Query Strings

regexp_extract(path, ‘/check-income-tax(.*)\?.*$’, 1)

This strips the query string from a URL, capturing only the path segment after /check-income-tax and before the ?.

/check-income-tax
Literal anchor — only match paths that belong to this service
(.*)
Capture group — grab everything after the anchor (returned as group 1)
\?.*$
Match the ? and everything after it — consumed but not captured, effectively stripping the query string

Then regexp_replace is applied to that result:

regexp_replace(…, ‘\/\d{1,3}$’, ‘/[AN]/’)

\/
A literal forward slash — the segment separator before the number
\d{1,3}
One to three digits — catches short numeric IDs at the end of a path
$
Anchored to end of string — only replaces if the number is the final segment, not mid-path
Input path After step 1
/check-income-tax/employment/123?taxYear=2023 /employment/[AN]
/check-income-tax/tax-code/details?ref=abc /tax-code/details (no numeric tail — unchanged)

Step 2 — The Main Normalisation Regex

This is the most complex part. It handles several distinct ID formats in a single regexp_replace pass using the | OR operator.

The full pattern

regex

\/[0-9]+[A-Za-z].+[\-].+ — Pattern A
|\/[A-Za-z]+[0-9].+[\-].+ — Pattern B
|\/(?![0-9]{4}\-[0-9]{4})(?:[A-Z]|[0-9])+[\-].+\/ — Pattern C
|\/[0-9]{1,3}(?:$|\/) — Pattern D

Pattern A — Numbers first, then letters, then a hyphen

Targets tax codes like /1250L or identifiers that start with digits, mix in letters, and contain a hyphen.

\/
Leading slash — path segment boundary
[0-9]+
One or more digits at the start of the segment
[A-Za-z]
At least one letter following the digits — the mix requirement
.+
Any further characters
[\-].+
A hyphen must appear — distinguishes these from simple numeric-suffix codes
Matches Doesn’t match
/1A-TR7 /1250L (no hyphen)
/42X-CORP-REF /2021-2022 (handled by the negative lookahead in C)

Pattern B — Letters first, then numbers, then a hyphen

The mirror of pattern A. Targets things like /I48-TR18-T7S — identifiers that start with a letter, blend in numbers, and contain a hyphen.

\/
Path segment boundary
[A-Za-z]+
One or more letters at the start
[0-9]
At least one digit following — the mix requirement
.+[\-].+
More content, then a hyphen, then more content
Matches Doesn’t match
/I48-TR18-T7S /employment (pure letters, no digit)
/YPB3-EMZG-IPLA /tax-code (pure letters with a hyphen — no digit)

Pattern C — Pure uppercase/digit tokens with a hyphen, excluding tax years

The most carefully tuned pattern. Catches things like /YPB-EMZG-IPLA and UUIDs, while deliberately not matching tax year ranges like /2021-2022.

\/
Path segment boundary
(?![0-9]{4}\-[0-9]{4})
Negative lookahead — if what follows looks like 2021-2022 (four digits, hyphen, four digits), the pattern fails immediately. This is the tax-year guard.
(?:[A-Z]|[0-9])+
One or more uppercase letters or digits — non-capturing group
[\-].+
A hyphen then more content — confirms this is a hyphenated identifier
\/
Trailing slash — anchors the match to a complete path component
The trailing \/ is important. Without it, this pattern could inadvertently match content at the end of a path. The trade-off is that UUIDs at the very end of a path (no trailing slash) won’t be caught here — they fall through to the COALESCE fallback.
Input Outcome
/YPB-EMZG-IPLA/ matched → [AN]
/dd6c6129-9956-42c3-ab19…/ matched → [AN]
/2021-2022/ protected by negative lookahead → kept as-is
/right-amount no digit/uppercase start → kept as-is

Pattern D — Short numeric segments

The simplest pattern: catches a 1–3 digit number at the end of a path or as an interior segment.

\/
Path segment boundary
[0-9]{1,3}
Exactly 1–3 digits — catches pagination numbers, small record IDs
(?:$|\/)
End of string OR another slash — confirms this is a complete segment, not the start of something like /2021 in /2021-2022/
Input Outcome
/employment/7 /employment/[AN]
/page/123/details /page/[AN]/details
/tax-paid/2021-2022 not matched — 4 digits, doesn’t meet 1–3 limit

The Referrer Version

The exact same logic is applied to the referrer URL, but without the /check-income-tax anchor since referrers can come from other GOV.UK services. After normalisation, the referrer is trimmed to a shorter form using:

regex extract

regexp_extract(
concat(‘/’, try(array_join(slice(short_referrer_array, 4, cardinality(short_referrer_array) – 1), ‘/’))),
‘(?>/check-income-tax|^)(personal-account|.*)’, 1
)

This slices the referrer URL array from the 4th segment onwards (skipping protocol, empty string, and domain), joins it back into a path, then uses a regex with an atomic group (?>) to extract either a personal-account path or anything else following the service prefix.


Putting It Together: The Subject Classification

Once the path is normalised into short_page, a CASE statement maps it to a human-readable subject area. This works precisely because the normalisation removed all user-specific tokens — the pattern .*tax-code.* now reliably matches all tax code pages regardless of the specific tax code in the original URL.

SQL CASE

WHEN regexp_like(tags.path, ‘.*tax-code.*’) THEN ‘Tax Code’
WHEN regexp_like(tags.path, ‘.*employment.*’) THEN ‘Employment’
WHEN regexp_like(tags.path, ‘.*underpayment.*’) THEN ‘Underpayment’
WHEN regexp_like(tags.path, ‘.*pension.*’) THEN ‘Pension’


Key Design Decisions Worth Noting

Why not just one big regex? A single pattern trying to handle query strings, short numerics, mixed alphanumerics, and UUIDs simultaneously becomes very difficult to reason about and maintain. The COALESCE waterfall makes each case explicit and independently testable.

Why [AN] as the placeholder? It’s short, visually distinctive, and survives additional regex passes without being mistaken for a real URL segment. It also documents intent — anyone reading the normalised path immediately knows a dynamic segment was here.

Why the 1–3 digit limit? Tax year ranges like /2021-2022/ contain four-digit numbers. Setting the ceiling at three digits means they’re never accidentally tokenised, preserving the ability to segment analytics by tax year.

Why a negative lookahead instead of a simpler pattern? The lookahead (?![0-9]{4}\-[0-9]{4}) in pattern C is the most surgical tool for protecting /2021-2022/. A character class approach would require excluding hyphens entirely, which would then miss UUIDs. The lookahead keeps the pattern broad while carving out exactly the shape we want to preserve.

1 thought on “Taming URL Chaos: Regex Strategies for Normalising GOV.UK Page Paths”

Leave a Comment

Your email address will not be published. Required fields are marked *