How to load flat, unstructured text file into BigQuery? BigQuery supports loading files in Avro, CSV, JSON, ORC, or Parquet formats (check documentation). But when you need to load unstructured txt or log file, you need some simple hacks.
Beacause we can’t specify data format, we need to apply the ELT (Extract Load Transform) approach.
- Upload file to Google Cloud Storage
- Load your data into BigQuery single column staging table
- Parse data using BiqQuery functions and save the result in target table
Let’s try to load example apache logs file:
1 2 3 4 5 126.96.36.199 - - [17/May/2015:14:05:05 +0000] "GET /projects/solaudio HTTP/1.1" 301 332 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" 188.8.131.52 - - [17/May/2015:14:05:13 +0000] "GET /favicon.ico HTTP/1.1" 200 3638 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0" 184.108.40.206 - - [17/May/2015:14:05:25 +0000] "GET /projects/solaudio/ HTTP/1.1" 200 8301 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" 220.127.116.11 - - [17/May/2015:14:05:21 +0000] "GET /about/ HTTP/1.1" 200 11474 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" 18.104.22.168 - - [17/May/2015:14:05:05 +0000] "GET /articles/arp-security/ HTTP/1.1" 200 19735 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
If your file size is lower than 10 MB, you can load it directly into BigQuery using your browser. Bigger files should be uploaded to Google Cloud Storage first.
Loading file into BigQuery
- Create dataset.
- Create table
- Create table from: Google Cloud Storage, choose path to your file.
- File Format: CSV.
- Table type: External table - you don’t need to load temporary data to BigQuery.
- In schema, odd only one field “raw_data” with type STRING.
- Field delimiter: Custom - fill with character not present in the file. For example, for Apache Server logs, it could be
- Click Create table.
- Now you can query your data stored in Cloud Storage using BigQuery SQL.
Now we are ready to parse our data and load it to target structure. BigQuery string functions will be very helpful. Especially
LIMIT testing your query to save cost and time working with large tables.
For our example file, we can write query extracting ip and date:
1 2 3 4 SELECT REGEXP_EXTRACT(data, r"\d+\.\d+\.\d+\.\d+") as ip, PARSE_DATE("%e/%b/%Y",REGEXP_EXTRACT(data, r"\[(\d+\/\w+\/\d+)")) as date FROM our_temp_table LIMIT 100
- When query is ready, run it on whole table.
- Click Save results.
- Choose BigQuery table and give table name - final destination for our structured data.
In this short tutorial we loaded one file. If you need to parse real time data, you should consider using Cloud Logging. Check how I used it in article about dealing with the pandemic with Google Cloud.