Posts Loading text file into BigQuery
Post
Cancel

Loading text file into BigQuery

How to load flat, unstructured text file into BigQuery? BigQuery supports loading files in Avro, CSV, JSON, ORC, or Parquet formats (check documentation). But when you need to load unstructured txt or log file, you need some simple hacks.

Beacause we can’t specify data format, we need to apply the ELT (Extract Load Transform) approach.

  1. Upload file to Google Cloud Storage
  2. Load your data into BigQuery single column staging table
  3. Parse data using BiqQuery functions and save the result in target table

Let’s try to load example apache logs file:

1
2
3
4
5
65.55.213.74 - - [17/May/2015:14:05:05 +0000] "GET /projects/solaudio HTTP/1.1" 301 332 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
81.169.149.220 - - [17/May/2015:14:05:13 +0000] "GET /favicon.ico HTTP/1.1" 200 3638 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0"
65.55.213.74 - - [17/May/2015:14:05:25 +0000] "GET /projects/solaudio/ HTTP/1.1" 200 8301 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.213.79 - - [17/May/2015:14:05:21 +0000] "GET /about/ HTTP/1.1" 200 11474 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.213.79 - - [17/May/2015:14:05:05 +0000] "GET /articles/arp-security/ HTTP/1.1" 200 19735 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

Preparing file

If your file size is lower than 10 MB, you can load it directly into BigQuery using your browser. Bigger files should be uploaded to Google Cloud Storage first.

  1. Create new bucket or use existing one.
  2. Upload file. I recommend using gsutil method for large files.

Loading file into BigQuery

  1. Create dataset.
  2. Create table
    1. Create table from: Google Cloud Storage, choose path to your file.
    2. File Format: CSV.
    3. Table type: External table - you don’t need to load temporary data to BigQuery.
    4. In schema, odd only one field “raw_data” with type STRING.
    5. Field delimiter: Custom - fill with character not present in the file. For example, for Apache Server logs, it could be |
    6. Click Create table.
  3. Now you can query your data stored in Cloud Storage using BigQuery SQL.

Parsing data

Now we are ready to parse our data and load it to target structure. BigQuery string functions will be very helpful. Especially REGEXP_EXTRACT.

Use LIMIT testing your query to save cost and time working with large tables.

For our example file, we can write query extracting ip and date:

1
2
3
4
SELECT
    REGEXP_EXTRACT(data, r"\d+\.\d+\.\d+\.\d+") as ip,
    PARSE_DATE("%e/%b/%Y",REGEXP_EXTRACT(data, r"\[(\d+\/\w+\/\d+)")) as date
FROM our_temp_table LIMIT 100

with result:

ipdate
65.55.213.742015-05-17
81.169.149.2202015-05-17
65.55.213.742015-05-17
65.55.213.792015-05-17
65.55.213.792015-05-17
  1. When query is ready, run it on whole table.
  2. Click Save results.
  3. Choose BigQuery table and give table name - final destination for our structured data.

Summary

In this short tutorial we loaded one file. If you need to parse real time data, you should consider using Cloud Logging. Check how I used it in article about dealing with the pandemic with Google Cloud.

This post is licensed under CC BY 4.0 by the author.