How to Run SQL in Python: Connecting to PostgreSQL and BigQuery

Kelly Adams
May 22, 2024
5 min read

Updated: May 26, 2024

I’ve been wanting to use SQL in combination with Python for a while. As a data analyst I use SQL all the time and it’s great for getting data from a large database. But it’s not always the best at complicated analysis. I’ve been reviewing Python lately and I’ve used it for things like linear regression. But I wasn’t sure how to actually use SQL within Python.

Why? Before I was (inefficiently) loading my data into my Python scripts. I manually ran the SQL queries, exported the results as a CSV file, then imported the CSV file into a Python script. It took a while and it was a pain to update if my query/CSV file changed.

I recently figured out how to use Python to get data directly from our database, load it into a pandas DataFrame, and work with that DataFrame, all in one script. This is what I’ll be going into in my blog post.

To get the final code check out my Github repository for it: run-sql-in-python.

Notes

Before we start here are a few important things to note:

I am not going over the basics of Python like how to use Anaconda Distribution, etc. (hint, I’m coming out with a course later this month that will go into those details). This article assumes you already have a basic knowledge of SQL & Python.
You need to have a database already established. In my case I use both PostgreSQL and Google BigQuery frequently. So these are the two types of databases we’ll connect to.

If you’re wondering why I used these specific libraries it’s because they are common, widely used, and I’m familiar with most of them.

The Git/Github is optional but it’s good practice to use something to track changes. I won’t go into how Git/Github works. But all of this is in a repo that I’m working with.
All examples in this blog post use simplified, fictional data for educational purposes only. In the real-world, tables typically contain many more rows and columns. But, I have simplified the data here to focus understanding and to protect private information.

Anyway onto what you’ll need for this.

Requirements

Here's what you'll need before starting:

IDE: Visual Studio Code
Database: PostgreSQL or Google BigQuery
Package Manager: Anaconda Distribution
Version Control: Git/Github (Optional)

Code

Below are the two scripts I created:

Connect to a PostgreSQL database
Connect to BigQuery database.

I am connecting to these two databases in two separate Python scripts.

PostgreSQL

Below is the Python script which I use to connect to a PostgreSQL database.

Libraries

Import libraries.

pyscopg2 and sqlalchemy: Connect to our PostgreSQL database
- psycorg2 - handles PostgreSQL interaction
- sqlalchemy - is the actual interface and has more functionality
pandas : To load the query results into a Pandas DataFrame.
google.cloud: Use Secrets Manager
- Manage credentials (e.g. database password) using Secrets Manager
- Essential for security.

Process

First we set up the environment with the necessary Python libraries.

import psycopg2
from google.cloud import secretmanager
import pandas as pd
from sqlalchemy import create_engine

Then we get set up the secretmanager from Google. We initialize the secret manager client and then define get the secrets, in my case the database password that’s stored in the secret_string_key .

project_id = 'project-id-123'

# Secret Manager client
secretmanager_client = secretmanager.SecretManagerServiceClient()

# Secret names
secret_name_key = "SERVICE_ACCOUNT_KEY"

# Retrieve secrets from Secret Manager
request_key = {"name": f"projects/{project_id}/secrets/{secret_name_key}/versions/latest"}
response_key = secretmanager_client.access_secret_version(request_key)
secret_string_key = response_key.payload.data.decode("UTF-8")

Next, connect to the PostgreSQL database using connection parameters and our credentials. This prevents us from hardcoding sensitive information in the script.

# Connect to with credentials 
conn_params = {
    "host": "host-name",
    "database": "database-name",
    "user": "user-name",
    "password": secret_string_key}

We create a URI (using SQLAlchemy) to connect to the database. And we create a SQLALchemy engine to write SQL and get the query results.

# Construct database connection string for SQLAlchemy
db_uri = f"postgresql+psycopg2://{conn_params['user']}:{conn_params['password']}@{conn_params['host']}/{conn_params['database']}"

# Create SQLAlchemy engine
engine = create_engine(db_uri)

In a try block we write the actual query as a multi-line string for readability. In this simple example we are getting the id and timestamp from a fictional customers table where the timestamp is the date of: 2024-05-01. We also parameterize the query to improve security and prevent SQL injection attacks. But you can replace this with any query.

Then we use pandas to read the SQL (read_sql_query) query results which uses the connection created by SQLAlchemy engine to execute the SQL query. Finally it returns the results in a DataFrame called df .

try:
    # Execute SQL query using SQLAlchemy engine with parameterization
    query = """
        SELECT id, timestamp
        FROM customers
        WHERE timestamp::date = %(date)s
        """
    df = pd.read_sql_query(query, engine, params={'date': '2024-05-14'})
    
    # Print query results
    print(df)

In the except block we use it to handle exceptions that might occur during the database connection or when we run the query. If there’s an error it prints the error message. It’s good for debugging the code.

except Exception as e:
    # Handle any exceptions
    print("An error occurred:", e)

Final Code

To view the final code in Github: postgresql.py.

import psycopg2
from google.cloud import secretmanager
import pandas as pd
from sqlalchemy import create_engine

project_id = 'project-id-123'

# Secret Manager client
secretmanager_client = secretmanager.SecretManagerServiceClient()

# Secret names
secret_name_key = "SERVICE_ACCOUNT_KEY"

# Retrieve secrets from Secret Manager
request_key = {"name": f"projects/{project_id}/secrets/{secret_name_key}/versions/latest"}
response_key = secretmanager_client.access_secret_version(request_key)
secret_string_key = response_key.payload.data.decode("UTF-8")  

# Connect to DB
conn_params = {
    "host": "host-name",
    "database": "database-name",
    "user": "user-name",
    "password": secret_string_key}

# Construct database connection string for SQLAlchemy
db_uri = f"postgresql+psycopg2://{conn_params['user']}:{conn_params['password']}@{conn_params['host']}/{conn_params['database']}"

# Create SQLAlchemy engine
engine = create_engine(db_uri)

try:
    # Execute SQL query using SQLAlchemy engine
    query = """
        SELECT id, timestamp
        FROM customers
        WHERE timestamp::date = '2024-05-14'
        """
    df = pd.read_sql_query(query, engine)
    
    # Print query results
    print(df)

except Exception as e:
    # Handle any exceptions
    print("An error occurred:", e)

BigQuery

Below is the Python script which I use to connect to a BigQuery database.

Libraries

Import libraries and modules.

pandas: To load the query results into a Pandas DataFrame.
google.cloud : Use BigQuery to run a query in BigQuery

Process

First we set up the environment with the necessary Python libraries.

import pandas as pd
from google.cloud import bigquery

Then we initialize the BigQuery client.

# BigQuery client
client = bigquery.Client()

We write the actual query as a multi-line string for readability. In this simple example we are getting the id and timestamp from a fictional customers table (located in BigQuery). But you can replace this with any query.

# Perform a query
query = """
    SELECT id, timestamp 
    FROM `project-id-123.database-name.customers`
"""

In a try block we’ll use the BigQuery client to execute the query in our BigQuery database. This doesn’t use any query parameterization because we are not taking any user inputs. Then use pandas to save those results to a DataFrame named df . Finally, we print out the df .

try:
    query_job = client.query(query)
    df = query_job.to_dataframe()

    # Print DataFrame
    print(df)

except Exception as e:
    # Handle any exceptions
    print("An error occurred during query execution:", e)

Final Code

To view the final code in Github: bigquery.py.

import pandas as pd
from google.cloud import bigquery

# BigQuery client
client = bigquery.Client()

# Perform a query
query = """
    SELECT id, timestamp 
    FROM `project-id-123.database-name.customers`
"""

try:
    query_job = client.query(query)
    df = query_job.to_dataframe()

    # Print DataFrame
    print(df)

except Exception as e:
    # Handle any exceptions
    print("An error occurred during query execution:", e)

Conclusion

In conclusion, integrating SQL with Python provides an efficient way to interact with the database directly within Python scripts. This approach not only streamlines the data retrieval process by eliminating the need to manually export and import data but is the first step for more advanced analysis in Python.