Backfill user_id in GA4 events exported to Bigquery
Jul 14, 2022 | 4 minute read
Jul 14, 2022 | 4 minute read
Let’s say your SaaS analytics setup looks like this:
user_id
to GA4 in appNow, your data contains events where user is recognized (user_id
is present) and events where you only have anonymous id (user_pseudo_id
in GA4) - and these might be generated by the same person.
What we want to do here is to take the combination of user_id
and user_pseudo_id
, find all occurences of user_pseudo_id
and add user_id
to it. That way we can assign user_id
to traffic which happend before the user was recognized for the first time.
Super simple solution would look like this:
with ga4_user_ids as (
select distinct
user_id,
user_pseudo_id
from analytics_12345678.events_*
where user_id is not null
)
select
*,
ga4_user_ids.user_id
from analytics_12345678.events_*
left join ga4_user_ids on ga4_user_ids.user_pseudo_id = events.user_pseudo_id
This is nice, but you need to be aware of two things:
user_id
on the fly, which is not perfect for performance reasonsuser_ids
can be joined to single user_pseudo_id
(ex. shared computer), which might mess up your dataNow, instead of joining user_ids on the fly, we can use the same principle with combination of user_id
and user_pseudo_id
to update old events with MERGE
:
MERGE analytics_12345678.events_* events
USING
(
select distinct
user_id,
user_pseudo_id
from analytics_12345678.events_*
where user_id is not null
and user_pseudo_id is not null
qualify row_number() over (partition by user_pseudo_id) = 1
order by user_pseudo_id
) user_ids
ON events.user_pseudo_id = user_ids.user_pseudo_id
WHEN MATCHED THEN
UPDATE SET events.user_id = user_ids.user_id
Notice the usage of qualify
which we can use to remove duplicates of user_pseudo_ids
joined to the same user_id
. This will not ensure 100% perfect user recognition, but at least we won’t end up with duplicated events.
Disclaimer: This code will traverse all collected events, be wary of cost.