Backfill user_id in GA4 events exported to Bigquery

Jul 14, 2022 | 4 minute read

Let’s say your SaaS analytics setup looks like this:

  1. You have Google Analytics 4 implemented on both marketing website and app
  2. GA4 gathers data on anonymous users on marketing website
  3. You send user_id to GA4 in app
  4. GA4 data is exported to BigQuery

Now, your data contains events where user is recognized (user_id is present) and events where you only have anonymous id (user_pseudo_id in GA4) - and these might be generated by the same person.

What we want to do here is to take the combination of user_id and user_pseudo_id, find all occurences of user_pseudo_id and add user_id to it. That way we can assign user_id to traffic which happend before the user was recognized for the first time.

Super simple solution would look like this:

with ga4_user_ids as (
  select distinct
    user_id,
    user_pseudo_id
    from analytics_12345678.events_*
    where user_id is not null
)

select
  *,
  ga4_user_ids.user_id
from analytics_12345678.events_*
left join ga4_user_ids on ga4_user_ids.user_pseudo_id = events.user_pseudo_id

This is nice, but you need to be aware of two things:

  1. You join user_id on the fly, which is not perfect for performance reasons
  2. Multiple user_ids can be joined to single user_pseudo_id (ex. shared computer), which might mess up your data

Now, instead of joining user_ids on the fly, we can use the same principle with combination of user_id and user_pseudo_id to update old events with MERGE:

MERGE analytics_12345678.events_* events
USING 
  (
    select distinct
      user_id,
      user_pseudo_id
    from analytics_12345678.events_*
    where user_id is not null 
      and user_pseudo_id is not null
    qualify row_number() over (partition by user_pseudo_id) = 1
    order by user_pseudo_id
  ) user_ids
ON events.user_pseudo_id = user_ids.user_pseudo_id

WHEN MATCHED THEN
  UPDATE SET events.user_id = user_ids.user_id

Notice the usage of qualify which we can use to remove duplicates of user_pseudo_ids joined to the same user_id. This will not ensure 100% perfect user recognition, but at least we won’t end up with duplicated events.

Disclaimer: This code will traverse all collected events, be wary of cost.


Analytics BigQuery Google Analytics 4