BigQuery and holidays

Anders Elton
6 min readNov 8, 2023

--

Time for vacation!

TL; DR — BigQuery doesnt have a list of holidays natively, just follow this link https://github.com/ael-computas/bigquery-holidays or use the public dataset “ael-cx.holidays.holidays”

While writing code to analyse working days in the company where I work, it is important to factor in holidays to get the complete picture. For example, May in Norway is known to have many holidays (3), and is usually the joke of “not getting anything done that month”. In fact, easter have more vacation days(4), and is also a moving holiday making it oh so more complicated to deal with.

I wanted to do something simple, like creating a time dimension with holidays included. So I figured there had to be a public dataset with holidays available in the BigQuery public data or the Analytics Hub, that I could use to connect with he working days to learn if the day was a vacation day or not.

Surprised, I realised this was not the case. Obviously this has to be a solved issue for a lot of companies out there, but I could not really find something suitable. (It could be my lack of google skills). I just wanted a public table with vacation that I could use.

What I could find, was javascript and python code to do this. But not reallt anything in SQL.. So I decided to use existing code to make my own table / UDF.

My initial thought was to create a UDF and just include a javascript library and generate the rows on the fly, and use dataform or dbt to create a stable incremental table in pure SQL.

BigQuery retaliates with a file-limit of the javasript library

I could have searched for a smaller javascript, or custom built one with less data and countries.. but.. too much job.

I dropped the UDF idea and went over to plan B: Create a python script to generate holidays, store it in cloud storage and load it to BigQuery.

Generating holidays in python

This was actually quite straight forward. Python has an easy to use library for this:

pip install holidays

The library is compatible with most countries and is well documented, so writing the code to generate the vacation days didnt take long.

The code to generate holidays from 1900 to 2050 for all available countries in that package looks like this:

import holidays

for year in range(1900, 2050):
for code in holidays.list_localized_countries():
print("{}".format(code))
try:
h = holidays.country_holidays(country=code, years=[year], expand=True, language="en_US")
for date, name in sorted(h.items()):
print(" {} {}".format(date, name))
except Exception as e:
print(f"{e}")

With that out of the way, instead of printing the results to stdout, its time to write it to files! I figured one file for each year would be nice. In BigQuery you can load files by using “gs://foo/bar/*”, so having multiple files is a good idea i think.

Storing it in a fileformat

Usually I have been writing this kind of stuff into a csv or jsonl format and then doing schema creating elsewhere (terraform or as a schema.json). Anyway, this time I wanted to do it in new ways, and i was looking at AVRO and PARQUET, that both have schema embedded in the payload, which i find really nice. By the toss of the dice i decided on AVRO.

AVRO implementation

pip install avro

So i went to apache avro documentation and started coding.

I tried cut and pasting the example from that page, but for some reason it has python2 code

I really should have taken the hint here..

After correting that piece of code, I spent a bit of time figuring out how to do the schema programatically.. I mean i could have written it in some json file. Probably would have saved some time on that.

Not really knowing the AVRO standard, I realised that it doesnt really have dates as a standard concept. You have to do all kinds of fiddling and compute days since epoch. Oh joy!

really??!

When loading this into BigQuery the resulting “date” column is not actually of type date, but rather INT. I tried setting the use_avro_logical_types flag, but i could not get “bq” tool to load it to a DATE type.

I folded and tossed away the code and replaced the storage with ….

PARQUET implementation

I followed the guide from this blog post.

I immediately liked the fact that when you were defining your schema it is quite literal, so you cant misspell “dtae/date” for example. In addition, it DOES support the types you would expect. Great!

schema = pa.schema([
('country', pa.string()),
('name', pa.string()),
('date', pa.date32()),
])

What is a bit strange (well, actually quite common if you are used to use python for analytics) is that you have to think column oriented.

Modifying my existing basecode to generate holidays and adding the code to store into batches. Each batch is a country for a year. Dont know if that is a sane size, but it works.

h = holidays.country_holidays(country=code, years=[year], expand=True, language="en_US")
country = []
names = []
dates = []

for date, name in sorted(h.items()):
country.append(code)
names.append(name)
dates.append(date)

batches.append(pa.RecordBatch.from_arrays([country, names, dates], schema=schema))

Storing it in cloud storage

One option is to load it to cloud storage by using gsutil and just generate the files locally. Since I am already a fan of the great smart_open library in python I would just stick to it and leave it to the user where the files are stored in the end!

from smart_open import open

BASE_PATH = "gs://the-bucket/path"

# the code
# ---
# SNIP

table = pa.Table.from_batches(batches)
with open(f'{BASE_PATH}/holidays-{year}.parquet', 'wb') as f:
pq.write_table(table, f)

smart open just works all the time.

BigQuery

To load the data into BigQuery I used the bq tool

bq --location=EU load --source_format=PARQUET --replace holidays.holidays gs://the-bucket/path/*

A simple validation test making sure Norways national day is a holiday, and todays date is not.

with is_this_a_holday as (
select DATE("2022-05-17") as test_date, TRUE as correct
union all
select DATE("2022-11-08") as test_date, FALSE as correct
)
select test_date, correct, h.name from is_this_a_holday
LEFT JOIN `ael-cx.holidays.holidays` h on test_date=h.date AND country="NO"

And the result..

Conclusion

In this post I have explored several ways of how you can figure out if a day is a vacation day or not. I am a bit surprised this is not natively in BigQuery or a public dataset, but in the end it is quite simple to do it yourself using your favourite programming language.

Feel free to play around with the code here: https://github.com/ael-computas/bigquery-holidays.

If there is a simpler solution out there please let me know in the comments!

--

--