Analyze real usage cost of your BigQuery data warehouse

Jul 27, 2023 | 6 minute read

BigQuery is an amazing database. It will handle everything you throw at it, no matter the size… or cost.

And things are easy to overlook: BI tool doing full table scans hourly, new hire querying SELECT * on huge dataset or ETL script stuck in a loop…

You should be able to discover these before you get smacked with huge bill at the end of the month.

Let’s figure out how to build a report to give us proper insights into what exactly we’re paying for.

First of all, we need to understand pricing for on-demand queries:

  • $6.25 per TB
  • first TB is free
  • minimum 10MB billed per query ($0.0000625)
  • minimum 10MB billed per referenced table

No matter how small the data, you will always be billed at least 10MB for it. If you query three tables (ex. using joins), you will be billed at least 30MB ($0.0001875).

Seems difficult to calculate, but BigQuery does most of the heavy lifting for us in special INFORMATION_SCHEMA tables. These hold data on the jobs (queries) ran in BigQuery, including cost. Looking at the schema, we can find two columns which could indicate data usage:

  • total_bytes_processed: Total bytes processed by the job
  • total_bytes_billed: Total bytes billed for the job

total_bytes_processed is what you can see in BigQuery console before you run a SQL query:

big_query

Many times processed and billed numbers will be pretty similar:

medium_query

But if you incorrectly assume they are always the same, you’re up for a surprise:

small_query

OK, seems pretty clear. total_bytes_billed will show us exactly how much we had to pay for given query. Let’s go straight to the code:

select
  creation_time,
  user_email,
  query,
  -- if job creation date is before 2023-07-05, it was priced at $5/TB. Otherwise, $6.25/TB. 
  case
    when timestamp_diff(creation_time, timestamp("2023-07-05"), day) < 0
      then total_bytes_billed / pow(10,12) * 5.0
    else total_bytes_billed / pow(10,12) * 6.25
  end as cost_usd
from region-US.INFORMATION_SCHEMA.JOBS
group by 1, 2, 3

A lot has happened above, allow me to unpack:

  • we account for BigQuery pricing change on July 5th, 2023
  • we convert bytes to terabytes and multiply it by price per TB
  • we display who and when ran the query and what was the SQL code
  • we query jobs ran in US region (all queries must have defined region)
  • we ignore first free terabyte - it makes sense to lower cost calculation by 1 TB when doing sum of all queries, but it won’t work when we do some more detailed reporting

We’ve ended up with raw cost data, but we can make it a little more usable - we’ll wrap previous query in CTE and do some basic aggregation:

with raw_data as (
  select
    creation_time,
    user_email,
    query,
    -- if job creation date is before 2023-07-05, it was priced as $5/TB. After that, $6.25/TB. 
    case
      when timestamp_diff(creation_time, timestamp("2023-07-05"), day) < 0
        then total_bytes_billed / pow(10,12) * 5.0
      else total_bytes_billed / pow(10,12) * 6.25
    end as cost_usd
  from region-US.INFORMATION_SCHEMA.JOBS
)

select
  date_trunc(date(creation_time), month) as month,
  user_email,
  query,
  sum(cost_usd) as cost_usd
from raw_data
group by 1, 2, 3
order by 1 desc, 4 desc

Aaand, we just went from data to knowledge:

query_cost

Above results are showing most expensive queries per month. If you’d like to know which user or service account uses most resources in total, remove query column from aggregation:

select
  date_trunc(date(creation_time), month) as month,
  user_email,
  sum(cost_usd) as cost_usd
from raw_data
group by 1, 2
order by 1 desc, 3 desc

This will also greatly reduce our analysis cost, as query column tends to be pretty big.

We don’t have to leave it at cost calculation - how about query performance metrics?

(...)
select
  avg(timestamp_diff(end_time, start_time, millisecond)) as avg_query_time_ms,
  max(timestamp_diff(end_time, start_time, millisecond)) as max_query_time_ms
(...)

BigQuery Analytics