Analytics Engineering · SQL · Regex
Taming URL Chaos: Regex Strategies for Normalising GOV.UK Page Paths
How to group analytics events by meaningful page structure when your URLs contain user IDs, tax reference numbers, and random alphanumeric tokens
The Complete Query
Here’s the full SQL in context. The sections below break down each regex component in detail.
regexp_extract(concat(‘/’, try(array_join(slice(short_referrer_array, 4, cardinality(short_referrer_array) – 1), ‘/’))),
‘(?>/check-income-tax|^)(personal-account|.*)’, 1) AS “short_referrer”,
short_page, user_agent_string, event_id, request_id,
session_id, transaction_name, detail_method, page_path, Subject
from (
SELECT
DATE, generated_at,
detail.referrer AS “referrer”,
split(coalesce(
regexp_replace(regexp_extract(detail.referrer, ‘(.*)\?.*$’, 1),‘\/\d{1,3}$’, ‘/[AN]/’),
regexp_replace(detail.referrer,‘\/[0-9]+[A-Za-z].+[\-].+|\/[A-Za-z]+[0-9].+[\-].+|\/(?![0-9]{4}\-[0-9]{4})(?:[A-Z]|[0-9])+[\-].+\/|\/[0-9]{1,3}(?:$|\/)’, ‘/[AN]/’),
regexp_extract(detail.referrer, ‘(.*)\?.*$’, 1),
regexp_extract(detail.referrer, ‘(.*)’, 1)),
‘/’) AS “short_referrer_array”,
detail.user_agent_string AS “user_agent_string”,
event_id, tags.x_request_id AS “request_id”,
tags.x_session_id AS “session_id”,
tags.transaction_name AS “transaction_name”,
detail.method AS “detail_method”,
tags.path AS “page_path”,
coalesce(
regexp_replace(regexp_extract(tags.path, ‘/check-income-tax(.*)\?.*$’, 1),‘\/\d{1,3}$’, ‘/[AN]/’),
regexp_replace(regexp_extract(tags.path, ‘/check-income-tax(.*)’, 1),‘\/[0-9]+[A-Za-z].+[\-].+|\/[A-Za-z]+[0-9].+[\-].+|\/(?![0-9]{4}\-[0-9]{4})(?:[A-Z]|[0-9])+[\-].+\/|\/[0-9]{1,3}(?:$|\/)’, ‘/[AN]/’),
regexp_extract(tags.path, ‘/check-income-tax(.*)\?.*$’, 1),
regexp_extract(tags.path, ‘/check-income-tax(.*)’, 1)) AS “short_page”,
CASE
WHEN regexp_like(tags.path, ‘.*print-your.*’) THEN ‘Printing’
WHEN regexp_like(tags.path, ‘.*current-year.*’) THEN ‘Current Year’
WHEN regexp_like(tags.path, ‘.*tax-code.*’) THEN ‘Tax Code’
WHEN regexp_like(tags.path, ‘.*employment.*’) THEN ‘Employment’
WHEN regexp_like(tags.path, ‘.*pension.*’) THEN ‘Pension’
WHEN regexp_like(tags.path, ‘.*underpayment.*’) THEN ‘Underpayment’
WHEN regexp_like(tags.path, ‘.*state-benefits.*’) THEN ‘State Benefits’
WHEN regexp_like(tags.path, ‘.*income-tax-history.*’) THEN ‘Income Tax History’
WHEN regexp_like(tags.path, ‘.*signout.*’) THEN ‘Signout page’
WHEN regexp_like(tags.path, ‘.*what-do-you-want-to-do.*’) THEN ‘Home Dashboard’
ELSE ‘Other’
END AS “Subject”
FROM {{time_aware_table(“[YOUR TABLE NAME]”, timerange = “7 days”)}}
WHERE tags.path NOT LIKE ‘%/assets/%’
AND tags.path NOT LIKE ‘%/images/%’
AND tags.path NOT LIKE ‘%/fonts/%’
AND detail.status_code = ‘200’
)
The Problem
When you’re doing analytics on a government tax service, your page paths look deceptively clean at first glance — until you realise that URLs like these all represent the same page:
/check-income-tax/tax-code/K497/employment
/check-income-tax/tax-code/BR/employment
And paths like these each represent different users visiting the same underpayment summary page:
/check-income-tax/underpayment/YPB-EMZG-IPLA
/check-income-tax/underpayment/dd6c6129-9956-42c3-ab19-8184669c04ec
To produce meaningful page-level analytics — counting sessions per page, tracking journeys, building funnels — you need to strip those dynamic segments out and replace them with a stable placeholder like [AN] (alphanumeric), while not stripping segments that are actually meaningful, such as the tax year range /2021-2022/.
That’s what the regex chain in this SQL does. Let’s break it apart.
The Overall Strategy: COALESCE as a Waterfall
Rather than writing one enormous regex that tries to handle every case, the query uses Presto/Athena’s COALESCE function as a waterfall of increasingly general patterns. The first expression that returns a non-null value wins.
regexp_replace(regexp_extract(path, '/check-income-tax(.*)\?.*$', 1), '\/\d{1,3}$', '/[AN]/')
URLs with a query string — strip the query string first, then replace any trailing 1–3 digit number.
regexp_replace(regexp_extract(path, '/check-income-tax(.*)', 1), '\/[0-9]+[A-Za-z]…', '/[AN]/')
URLs without a query string — apply the full alphanumeric normalisation regex to handle IDs, tax codes, and UUIDs.
regexp_extract(path, '/check-income-tax(.*)\?.*$', 1)
Fallback for paths where the query string was present but step 1 returned null — return the clean path.
regexp_extract(path, '/check-income-tax(.*)', 1)
Last resort — return whatever follows /check-income-tax as-is.
tags.path) and the referrer URL (detail.referrer). The referrer version skips the /check-income-tax anchor since referrers may originate from other services.Step 1 — Handling URLs with Query Strings
regexp_extract(path, ‘/check-income-tax(.*)\?.*$’, 1)
This strips the query string from a URL, capturing only the path segment after /check-income-tax and before the ?.
? and everything after it — consumed but not captured, effectively stripping the query stringThen regexp_replace is applied to that result:
regexp_replace(…, ‘\/\d{1,3}$’, ‘/[AN]/’)
| Input path | After step 1 |
|---|---|
| /check-income-tax/employment/123?taxYear=2023 | /employment/[AN] |
| /check-income-tax/tax-code/details?ref=abc | /tax-code/details (no numeric tail — unchanged) |
Step 2 — The Main Normalisation Regex
This is the most complex part. It handles several distinct ID formats in a single regexp_replace pass using the | OR operator.
The full pattern
|\/[A-Za-z]+[0-9].+[\-].+ — Pattern B
|\/(?![0-9]{4}\-[0-9]{4})(?:[A-Z]|[0-9])+[\-].+\/ — Pattern C
|\/[0-9]{1,3}(?:$|\/) — Pattern D
Pattern A — Numbers first, then letters, then a hyphen
Targets tax codes like /1250L or identifiers that start with digits, mix in letters, and contain a hyphen.
| Matches | Doesn’t match |
|---|---|
| /1A-TR7 | /1250L (no hyphen) |
| /42X-CORP-REF | /2021-2022 (handled by the negative lookahead in C) |
Pattern B — Letters first, then numbers, then a hyphen
The mirror of pattern A. Targets things like /I48-TR18-T7S — identifiers that start with a letter, blend in numbers, and contain a hyphen.
| Matches | Doesn’t match |
|---|---|
| /I48-TR18-T7S | /employment (pure letters, no digit) |
| /YPB3-EMZG-IPLA | /tax-code (pure letters with a hyphen — no digit) |
Pattern C — Pure uppercase/digit tokens with a hyphen, excluding tax years
The most carefully tuned pattern. Catches things like /YPB-EMZG-IPLA and UUIDs, while deliberately not matching tax year ranges like /2021-2022.
2021-2022 (four digits, hyphen, four digits), the pattern fails immediately. This is the tax-year guard.\/ is important. Without it, this pattern could inadvertently match content at the end of a path. The trade-off is that UUIDs at the very end of a path (no trailing slash) won’t be caught here — they fall through to the COALESCE fallback.| Input | Outcome |
|---|---|
| /YPB-EMZG-IPLA/ | matched → [AN] |
| /dd6c6129-9956-42c3-ab19…/ | matched → [AN] |
| /2021-2022/ | protected by negative lookahead → kept as-is |
| /right-amount | no digit/uppercase start → kept as-is |
Pattern D — Short numeric segments
The simplest pattern: catches a 1–3 digit number at the end of a path or as an interior segment.
/2021 in /2021-2022/| Input | Outcome |
|---|---|
| /employment/7 | /employment/[AN] |
| /page/123/details | /page/[AN]/details |
| /tax-paid/2021-2022 | not matched — 4 digits, doesn’t meet 1–3 limit |
The Referrer Version
The exact same logic is applied to the referrer URL, but without the /check-income-tax anchor since referrers can come from other GOV.UK services. After normalisation, the referrer is trimmed to a shorter form using:
concat(‘/’, try(array_join(slice(short_referrer_array, 4, cardinality(short_referrer_array) – 1), ‘/’))),
‘(?>/check-income-tax|^)(personal-account|.*)’, 1
)
This slices the referrer URL array from the 4th segment onwards (skipping protocol, empty string, and domain), joins it back into a path, then uses a regex with an atomic group (?>) to extract either a personal-account path or anything else following the service prefix.
Putting It Together: The Subject Classification
Once the path is normalised into short_page, a CASE statement maps it to a human-readable subject area. This works precisely because the normalisation removed all user-specific tokens — the pattern .*tax-code.* now reliably matches all tax code pages regardless of the specific tax code in the original URL.
WHEN regexp_like(tags.path, ‘.*employment.*’) THEN ‘Employment’
WHEN regexp_like(tags.path, ‘.*underpayment.*’) THEN ‘Underpayment’
WHEN regexp_like(tags.path, ‘.*pension.*’) THEN ‘Pension’
…
Key Design Decisions Worth Noting
Why not just one big regex? A single pattern trying to handle query strings, short numerics, mixed alphanumerics, and UUIDs simultaneously becomes very difficult to reason about and maintain. The COALESCE waterfall makes each case explicit and independently testable.
Why [AN] as the placeholder? It’s short, visually distinctive, and survives additional regex passes without being mistaken for a real URL segment. It also documents intent — anyone reading the normalised path immediately knows a dynamic segment was here.
Why the 1–3 digit limit? Tax year ranges like /2021-2022/ contain four-digit numbers. Setting the ceiling at three digits means they’re never accidentally tokenised, preserving the ability to segment analytics by tax year.
Why a negative lookahead instead of a simpler pattern? The lookahead (?![0-9]{4}\-[0-9]{4}) in pattern C is the most surgical tool for protecting /2021-2022/. A character class approach would require excluding hyphens entirely, which would then miss UUIDs. The lookahead keeps the pattern broad while carving out exactly the shape we want to preserve.
Hi, this is a comment.
To get started with moderating, editing, and deleting comments, please visit the Comments screen in the dashboard.
Commenter avatars come from Gravatar.