title: "BigQuery Tutorial"
linktitle: "BigQuery"
date: 2020-08-10
weight: 2
The aim of this tutorial is to get you familiarized with BigQuery web UI to
query/filter/aggregate/export data.
This tutorial is based upon one from the Google Healthcare datathon repository.
It is written with the old BigQuery interface, and focuses on the MIMIC-III/eICU-CRD datasets.
PhysioNet does not cover the cost of queries against physionet-data
(though this cost is mostly trivial). In order to run queries, you will need
to configure a project for your account, which BigQuery can then use to bill
for your usage of the cloud platform. For more information on GCP projects,
see the documentation on creating and managing projects..
All PhysioNet data is hosted on the physionet-data
project. You will only have
read-access privileges to these datasets. As a result, if you would like to save
the results of any queries, you will need to save them to a dataset created
on your own project.
In this section we are going to run a query to briefly showcase BigQuery's
capability. The goal is to aggregate the mimic demo data to find out the
correlation between age and the average length of stay in hours in ICU.
Run the following query from BigQuery web interface (See "Executing Queries"
section above for how to access BigQuery web interface).
#standardSQL
WITH re AS (
SELECT
DATETIME_DIFF(icu.outtime, icu.intime, HOUR) AS icu_length_of_stay,
DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age
FROM `physionet-data.mimiciii_demo.icustays` AS icu
INNER JOIN `physionet-data.mimiciii_demo.patients` AS pat
ON icu.subject_id = pat.subject_id)
SELECT
age,
AVG(icu_length_of_stay) AS stay
FROM re
WHERE age < 100
GROUP BY age
ORDER BY age
You can download the returned result as a CSV file and genereate a chart with
your preferred tools.
The following is a scatter chart plotted from the result with Google Sheets.
Feel free to skip this section if you are already familiar with BigQuery.
A BigQuery table is uniquely identified by the three-layer hierarchy of project
ID, dataset ID and table name. For example in the following query:
SELECT
subject_id
FROM
`physionet-data.mimiciii_demo.icustays`
LIMIT 10
physionet-data.mimiciii_demo.icustays
specifies the table we are querying,
where physionet-data
is the project that hosts the datasets, mimiciii_demo
is the name of the dataset, and icustays
is the table name. Backticks (`) are
used as there is a non-standard character (-) in the project name. If the
dataset resides in the same project, you can safely omit the project name,
e.g.my-project.my_dataset.my_table
can be written asmy_dataset.my_table
instead.
BigQuery supports 2 SQL dialects, legacy and standard. During this datathon we
highly recommend using standard SQL dialect.
Follow the steps below to make sure the StandardSQL dialect is used:
Alternatively,
"#standardSQL" tag
can be prepended to each query to tell BigQuery the dialect you are using, which
is what we used in the TLDR section above.
As mentioned previously, the datasets are hosted in a different project, which
can be accessed
here.
On the left panel, you will see the mimiciii_demo
dataset, under which you
will see the table names.
To view the details of a table, simply click on it (for example the icustays
table). Then, on the right side of the window, you will have to option to see
the schema, metadata and preview of rows tabs.
Most of the following queries are adapted from the
MIMIC cohort selection tutorial.
Let's take a look at a few queries. To run the queries yourself, copy the SQL
statement to the input area on top of the web interface and click the red "RUN
QUERY" button.
SELECT
subject_id,
hadm_id,
icustay_id,
intime,
outtime,
DATETIME_DIFF(outtime, intime, DAY) AS icu_length_of_stay
FROM `physionet-data.mimiciii_demo.icustays`
Let's save the result of previous query to an intermediate table for later
analysis:
Now let's take a look at a query that requires table joining: include the
patient's age at the time of ICU admittance. This is computed by the date
difference in years between the ICU intime and the patient's date of birth. The
former is available in the icustays table, and the latter resides in the dob
column of the patients table.
SELECT
icu.subject_id,
icu.hadm_id,
icu.icustay_id,
pat.dob,
icu.icu_length_of_stay,
DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age
FROM `physionet-data.mimiciii_demo.patients` AS pat
INNER JOIN `temp.icustays` AS icu
ON icu.subject_id = pat.subject_id
Again, let's save the table as "pat_icustays" in the "temp" dataset for use
later. Briefly look at the age of patients when they are admitted with the
following query.
Now let's run the following query to produce data to generate a histrogram graph
to show the distribution of patient ages in ten-year buckets (i.e. [0, 10), [10,
20), ..., [90, ∞).
WITH bu AS (
SELECT
CAST(FLOOR(age / 10) AS INT64) AS bucket
FROM `temp.pat_icustays`)
SELECT
IF(bucket >= 9, ">= 90", FORMAT("%d - %d", bucket * 10, (bucket + 1) * 10)) AS age,
COUNT(bucket) AS total
FROM bu
GROUP BY bucket
ORDER BY bucket ASC
Now click "Save to Google Sheets" button and wait 1-2 seconds until a yellow
notification shows up, click "Click to view" which leads you to Google
Spreadsheet in a new browser window. As you can see, the data from our last
query is dumped into a spreadsheet. By clicking "Insert -> Chart" from the menu
bar on top, a nice histrogram graph is automatically created for us!
If you prefer using other tools to process the final result, a CSV file can be
downloaded by clicking the "Downed as CSV" button. If downloading fails because
the file is too large (we highly recommend aggregating the data to a small
enough result before downloading though), you can save it to a temporary table,
click the caret then "Export table" button from the dropdown menu and save it to
Google Cloud Storage, then you can download the file from
GCS.
Now let's see if there is correlation between age and average length of stay in
hours. Since we are using the age of patients when they get admitted, so we
don't need to worry about multiple admissions of patients. Note that we treat
the redacted ages (> 90) as noises and filter them out.
WITH re AS (
SELECT
DATETIME_DIFF(icu.outtime, icu.intime, HOUR) AS icu_length_of_stay,
DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age
FROM `physionet-data.mimiciii_demo.icustays` AS icu
INNER JOIN `physionet-data.mimiciii_demo.patients` AS pat
ON icu.subject_id = pat.subject_id)
SELECT
icu_length_of_stay AS stay,
age
FROM re
WHERE age < 100
Follow the same steps to save the result to Google Spreadsheet, by default a
linear chart is generate. We will need to change the chart type to scatter chart
through the chart editor on the right.
Datathon organizers might not allow you to create new tables. However, you can
save a view of a query's output to then use in later queries.
Create new dataset
. Give the dataset a temporary name that can beRun
Query
that says Save view
. Select the temporary dataset you created andproject.dataset.view
like the following:SQL
SELECT * FROM `datathon_project.team6temp.our_custom_view`;
The times in the tables are stored as DATETIME objects. This means you cannot
use operators like <, =, or > for comparing them.
Use the
DATETIME functionsin
BigQuery. An example would be if you were trying to find things within 1
hour of another event. In that case, you could use the native DATETIME_SUB()
function. In the example below, we are looking for stays of less than 1 hour
(where the admit time is less than 1 hour away from the discharge time).
[...] WHERE ADMITTIME BETWEEN DATETIME_SUB(DISCHTIME, INTERVAL 1 HOUR) AND
DISCHTIME
If you are more comfortable working with timestamps, you can cast the
DATETIME object to a TIMESTAMP object and then use the
TIMESTAMP functions.
There are a few cases where you may want to work with files outside of BigQuery.
Examples include importing your own custom Python library or saving a dataframe.
This tutorial covers
importing and exporting from local filesystem, Google Drive, Google Sheets, and
Google Cloud Storage.
Congratulations! You've finished the BigQuery web UI tutorial. In this tutorial
we demonstrated how to query, filter, aggregate data, and how to export the
result to different locations through BigQuery web UI. If you would like to
explore the real data, please use mimiciii_clinical
as the dataset name. For
example, the table mimiciii_demo.icustays
becomes mimiciii_clinical.icustays
when you need the actual MIMIC data. Please take a look at more comprehensive
examples here such as creating charts and training
machine learning models in an interactive way (or copy the queries over to web
UI to execute if you prefer) if you are interested.