Sometimes you just want data from your source into your analytical tool and start doing experiments. I have created a tool that can help you in this kind of prototyping.
A common way to integrate SQL server and BigQuery the lazy way is to:
- Export table to disk
- upload CSV file to cloud storage
- Load to BigQuery and autodetect schema
Currently there is no tool that is doing this in a nice way for you.
It can be tedious to do this process, and usually you have some border cases where the simple strategy will not work, and you end up spending time to mitigate (especially if you have tables with types that are incompatible with BQ). Since I tend to do this every time I have a customer with SQL server, a utility script has evolved that is able to do this and handles a lot of the border cases that costs time. I have packaged this on github with a ready dockerfile. You can run it as a python script, a docker job or package it into kubernetes and cloud run.
- It automatically generates a BigQuery schema for you, so you dont have to rely on autodetect
- It works well with serverless where you have no disk, since it does not use local disk as temp storage.
- It has a basic understanding of state, which is stored in GCS. Meaning it is quite fast to access compared to BQ.
- It tries to be smart on large tables and partition them into chunks, and only reload the changed chunks from source (since this is usually the bottleneck)
With this tool you should be able to get a pipeline up within hours, that is scalable and will probably run well in most cases. And as a side effect, if you have large tables that rarely changes, this pipeline will only load from source when something has changed, meaning that it will exit as quick as it can when it understands that BigQuery and source are the same. (usually within seconds)
Some example code
The full repo with all source can be found here: https://github.com/ael-computas/sqlserver-to-bigquery
Just fork and play with it as you like. If you do find bugs i am happy to be notified, so it can be fixed.
If you want to look into Change detect capture, take a look at Debezium: https://debezium.io/
If you need help with analytics or pipelines, feel free to reach out :)