top of page
  • Kelly Adams

How I Saved 10 Minutes with a Python Script

Updated: Jun 21

I’ve been using Python more and more. I recently learned how to write SQL queries directly in a Python script (view that blog post here). I knew my next project would be more in-depth. So I took a look and tried to see what I could automate. My first thought was a weekly executive report I

generate. It gives an overview (financially) of the health of the company. But I didn’t dive in. I first: 


  1. Focused on understanding the business: I spent time learning why the report is essential, identifying key KPIs, and what to focus on.

  1. Determined whether automation was worth it: Thought about ROI and considered whether the time and effort would be worth saving 10 minutes a week.


In this case it was worth it. I can now use this knowledge to automate even more and created more advanced scripts. In this blog post, I’ll be going over my process of how I used SQL & Python to automate this process. 


Link to my Github repository here: bigquery-python-weekly-report


Tools Used

Here’s what I used for this project:



Notes


  • I am not going over the basics of Python like how to use Anaconda Distribution. I’m coming out with a course later this month that will go into those details. This article assumes you already have a basic knowledge of SQL & Python. 

  • All examples in this blog post use simplified, fictional data for educational purposes only. In the real-world, tables typically contain many more rows and columns. But, the code here is to focus understanding and to protect private information.


Code Explanation


Libraries 

Import the following libraries:

  • os: To create files and folders on my computer. 

  • pandas: To load the query results into a Pandas DataFrame.

  • google.cloud : Use BigQuery to run a query in BigQuery


Process

First I set up the environment with the necessary Python libraries. 


import os
import pandas as pd
from google.cloud import bigquery

Then I initialize the BigQuery client. 

# Initialize BigQuery client to interact with the database
client = bigquery.Client()

I set some variables that I will use to get not only filter the query (by dates) but also is used to name the CSV file and folder. 


# Define start and end date for the report period
start_date = '2024-05-13'
end_date = '2024-05-19'
# Calculate week number for folder naming
week_num = 20

I use a try  block to format the start_date  and end_date  into the the format I want for the CSV file and folder. Which turns 2024-05-13 to 5-13-24. 


# Format dates to more readable form for folder creation
try:
    start_date_components = start_date.split('-')
    start_date_formatted = f"{int(start_date_components[1]):g}-{int(start_date_components[2]):g}-{start_date_components[0][-2:]}"

    end_date_components = end_date.split('-')
    end_date_formatted = f"{int(end_date_components[1]):g}-{int(end_date_components[2]):g}-{end_date_components[0][-2:]}"

Then I use these variables to create a new folder in my computer. I also define the folder path. I also have an if  statement to check if this main folder exists, if it doesn’t then create it. 


The main folder would be formatted as: 20 5-13-24 to 5-19-24. The script would print out:

Folder created at: computer\file\path.
    folder_name = f"{week_num} {start_date_formatted} to {end_date_formatted}"

    # Define path for main report folder and ensure it's created
    main_folder_path = os.path.join(r'computer\file\path', folder_name)
    if not os.path.exists(main_folder_path):
        os.makedirs(main_folder_path)
    print(f"Folder created at: {main_folder_path}")

Finally, for this try block I also have except  to handle possible errors and return the issues.  

except ValueError as ve:
    print(f"Error processing date formats: {ve}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Then we move onto the actual query. This is getting customer transactions and company expenses between the dates I provided above. This is in start_date  using @start  and end_date  using @end  (more on why I’m using this  @  later). As a reminder the actual query I use for this report is more complicated and involved. But for demonstration purposes this is a simpler query.

 

# Define SQL query using parameters for flexibility and security
query = """
WITH transactions AS (
    SELECT 
        DATE(timestamp, 'America/New_York') as purchase_date, 
        SUM(amount_usd) as total_purchases
    FROM 
        `project-id.database.transactions` t
    WHERE 
        DATE(timestamp, 'America/New_York') BETWEEN @start AND @end
    GROUP BY 
        purchase_date
),
expenses as(
    SELECT 
        DATE(timestamp, 'America/New_York') as expense_date, 
        SUM(expenses) as total_expenses
    FROM 
        `project-id.database.expenses` t
    WHERE 
        DATE(timestamp, 'America/New_York') BETWEEN @start AND @end
    GROUP BY expense_date
)
SELECT 
    day, 
    FORMAT_DATE('%a', day) as day_of_the_week, 
    t.total_purchases, 
    e.total_expenses
FROM 
    UNNEST(GENERATE_DATE_ARRAY(DATE(@end), DATE(@start), INTERVAL -1 DAY)) AS day
    LEFT JOIN transactions t ON t.purchase_date = day
    LEFT JOIN expenses e ON e.expense_date = day
WHERE 
    day BETWEEN @start AND @end
ORDER BY 
    day
"""

Then I set up the query parameters to actually use the start_date  and end_date  but as strings and label them as start  and end  respectively. 


# Configure query parameters to safeguard against SQL injection and provide flexibility
job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter("start", "STRING", start_date),
        bigquery.ScalarQueryParameter("end", "STRING", end_date)
    ]
)

If I ran this query the results (showing only the first 5 rows) would be:

day

day_of_the_week

total_purchases

expenses

2024-05-13

mon

10000

5000

2024-05-14

tue

15000

2000

2024-05-15

wed

10000

3000

2024-05-16

thu

15000

4000

2024-05-17

fri

20000

5000

In a try block I execute the query and load the query results into a pandas DataFrame called df . 

try:
    # Execute the query and load results into a pandas DataFrame
    query_job = client.query(query, job_config=job_config)
    df = query_job.to_dataframe()

I create and apply a function to create a new column with a calculation. In this example it’s to calculate the revenue which is total_purchases  (income) - total_expenses . It sets total_purchases  to the total_purchases row and total_expenses  to the total_expenses  row. Note: In theory you can do this in your SQL query. But the reason why I don’t in my actual report is because the final calculation is dependent on other calculations I do in the query. But for demonstration purposes I’ve kept it simple. 


Anyways I’ve also added an if  statement to check if there’s no data for total_purchases  or total_expenses , it returns None. The function returns total_purchases  - total_expenses. And I apply that function to my DataFrame by creating a new column called revenue . 


    # Function to calculate net revenue from purchases and expenses
    def calculate_revenue(row):
        total_purchases = row['total_purchases']
        total_expenses = row['total_expenses']
        if total_purchases is None or total_expenses is None:
            return None
        return total_purchases - total_expenses

    # Apply the custom function to create a new column
    df['revenue'] = df.apply(calculate_revenue, axis=1)

Then I print this DataFrame. 

  # Print DataFrame
    print("Query results:")
    print(df)

This would print out:

Query Results:  

day

day_of_the_week

total_purchases

expenses

revenue

2024-05-13

mon

10000

5000

5000

2024-05-14

tue

15000

2000

13000

2024-05-15

wed

10000

3000

7000

2024-05-16

thu

15000

4000

11000

2024-05-17

fri

20000

5000

15000

2024-05-18

sat

15000

4000

11000

2024-05-19

sun

10000

3000

7000


I also create a new folder to have this CSV file within the main folder I created above. I have an if  statement to check if the CSV subfolder exists or not. Then I give the file name for the CSV file and I set the file path for the CSV file using the csv_folder_path  and csv_file_name. 


    # Setup paths and export results to CSV
    csv_folder_path = os.path.join(main_folder_path, 'CSV')

    # Create the CSV subfolder if it doesn't exist
    if not os.path.exists(csv_folder_path):
        os.makedirs(csv_folder_path)

    # Define the filename for the CSV file
    csv_file_name = f'Report Tables {start_date_formatted} to {end_date_formatted}.csv'

    # Combine the CSV folder path and filename for the CSV file
    csv_file_path = os.path.join(csv_folder_path, csv_file_name)

I export the DataFrame as a CSV file into that CSV subfolder. I also print information about the CSV file and where it was saved. 

This would print out: 


CSV file: Report Tables 5-13-24 to 5-19-24.csv exported successfully to: computer\file\path
    # Export DataFrame to CSV file
    df.to_csv(csv_file_path, index=False)

    print(f"CSV file: {csv_file_name} exported successfully to: {csv_file_path}")

Finally, I have an except  block to handle any potential errors. 

eexcept Exception as e:
    print("An error occurred during query execution:", e)

Final Code

Below is the final code. View the final code in my Github file: weekly_sales_report.py. Note: the code snippets above have more comments (for explanation purposes) than the final code in my Github repo. 

import os
import pandas as pd
from google.cloud import bigquery

# Initialize BigQuery client to interact with the database
client = bigquery.Client()

# Define start and end date for the report period
start_date = '2024-05-13'
end_date = '2024-05-19'
# Calculate week number for folder naming
week_num = 20

# Format dates to more readable form for folder creation
try:
    start_date_components = start_date.split('-')
    start_date_formatted = f"{int(start_date_components[1]):g}-{int(start_date_components[2]):g}-{start_date_components[0][-2:]}"

    end_date_components = end_date.split('-')
    end_date_formatted = f"{int(end_date_components[1]):g}-{int(end_date_components[2]):g}-{end_date_components[0][-2:]}"

    folder_name = f"{week_num} {start_date_formatted} to {end_date_formatted}"

   # Define path for main report folder and ensure it's created
    main_folder_path = os.path.join(r'computer\file\path', folder_name)
    if not os.path.exists(main_folder_path):
        os.makedirs(main_folder_path)
    print(f"Folder created at: {main_folder_path}")
except ValueError as ve:
    print(f"Error processing date formats: {ve}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

# Define SQL query using parameters for flexibility and security
query = """
WITH transactions AS (
    SELECT 
        DATE(timestamp, 'America/New_York') as purchase_date, 
        SUM(amount_usd) as total_purchases
    FROM 
        `project-id.database.transactions` t
    WHERE 
        DATE(timestamp, 'America/New_York') BETWEEN @start AND @end
    GROUP BY 
        purchase_date
),
expenses as(
    SELECT 
        DATE(timestamp, 'America/New_York') as expense_date, 
        SUM(expenses) as total_expenses
    FROM 
        `project-id.database.expenses` t
    WHERE 
        DATE(timestamp, 'America/New_York') BETWEEN @start AND @end
    GROUP BY expense_date
)
SELECT 
    day, 
    FORMAT_DATE('%a', day) as day_of_the_week, 
    t.total_purchases, 
    e.total_expenses
FROM 
    UNNEST(GENERATE_DATE_ARRAY(DATE(@end), DATE(@start), INTERVAL -1 DAY)) AS day
    LEFT JOIN transactions t ON t.purchase_date = day
    LEFT JOIN expenses e ON e.expense_date = day
WHERE 
    day BETWEEN @start AND @end
ORDER BY 
    day
"""
# Configure query parameters to safeguard against SQL injection and provide flexibility
job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter("start", "STRING", start_date),
        bigquery.ScalarQueryParameter("end", "STRING", end_date)
    ]
)

try:
    # Execute the query and load results into a pandas DataFrame
    query_job = client.query(query, job_config=job_config)
    df = query_job.to_dataframe()

    # Function to calculate net revenue from purchases and expenses
    def calculate_revenue(row):
        total_purchases = row['total_purchases']
        total_expenses = row['total_expenses']
        if total_purchases is None or total_expenses is None:
            return None
        return total_purchases - total_expenses

    # Apply the custom function to create a new column
    df['revenue'] = df.apply(calculate_revenue, axis=1)

    print("Query results:")
    print(df)

    # Setup paths and export results to CSV for further analysis and reporting
    csv_folder_path = os.path.join(main_folder_path, 'CSV')
    if not os.path.exists(csv_folder_path):
        os.makedirs(csv_folder_path)

    csv_file_name = f'Report Tables {start_date_formatted} to {end_date_formatted}.csv'
    csv_file_path = os.path.join(csv_folder_path, csv_file_name)
    df.to_csv(csv_file_path, index=False)

    print(f"CSV file: {csv_file_name} exported successfully to: {csv_file_path}")
except Exception as e:
    print("An error occurred during query execution:", e)

Reasoning

I wanted to make a few notes with my reasoning: 


  • I used these specific libraries because I’m familiar with them, they’re easy to use, and widely supported.  

  • I saved the file in a CSV sub-folder because I save this as an Excel file in the main folder. The the reason why I don't save the DataFrame as an Excel file is the date formatting doesn’t turn out correct this way. I have to also create some charts in Excel for an executive report every week. 

  • Again the actual query and calculations in my script are more complicated. But the purpose of this blog post is to go over the concepts so it’s been changed to focus on learning (and for data privacy reasons). 


Future Features

A few things I would want for the future:


  • Figure out a way to export this to an Excel sheet (with the dates fixed). 

  • Use this Excel sheet to update another Excel sheet automatically. 

  • Save the charts and graphs that’s created to the main folder.


As of 6/21/24 I actually did these two things see post: How I Automated My Weekly Python Script.

  • Schedule this script to run this automatically every week.

  • Update the dates and week number automatically. 


What I Learned

This was a great experience and I honestly learned so much. Not only the technical aspects of using SQL & Python together. But understanding the importance of having a clear goal, starting small, and writing the code in a way that it can be scaled. 


Conclusion 

This CSV file used to write a weekly report. Before it was a very manual process and while I still have to do other things manually, this saves some time. While this only saves about 10 minutes a week (8 hours a year), it was a great learning experience. And a good way for me to begin using Python to automate more processes. Hopefully this helps give you ideas on what you can automate using Python. This is only the start of of my journey using SQL & Python together, can’t wait to see what more I can do! 

Comentarios


bottom of page