I’ve been using Python more and more. I recently learned how to write SQL queries directly in a Python script (view that blog post here). I knew my next project would be more in-depth. So I took a look and tried to see what I could automate. My first thought was a weekly executive report I
generate. It gives an overview (financially) of the health of the company. But I didn’t dive in. I first:
Focused on understanding the business: I spent time learning why the report is essential, identifying key KPIs, and what to focus on.
Determined whether automation was worth it: Thought about ROI and considered whether the time and effort would be worth saving 10 minutes a week.
In this case it was worth it. I can now use this knowledge to automate even more and created more advanced scripts. In this blog post, I’ll be going over my process of how I used SQL & Python to automate this process.
Link to my Github repository here: bigquery-python-weekly-report.
Tools Used
Here’s what I used for this project:
IDE: Visual Studio Code
Database: Google BigQuery
Package Manager: Anaconda Distribution
Version Control: Git/Github
Notes
I am not going over the basics of Python like how to use Anaconda Distribution. I’m coming out with a course later this month that will go into those details. This article assumes you already have a basic knowledge of SQL & Python.
All examples in this blog post use simplified, fictional data for educational purposes only. In the real-world, tables typically contain many more rows and columns. But, the code here is to focus understanding and to protect private information.
Code Explanation
Libraries
Import the following libraries:
os: To create files and folders on my computer.
pandas: To load the query results into a Pandas DataFrame.
google.cloud : Use BigQuery to run a query in BigQuery
Process
First I set up the environment with the necessary Python libraries.
import os
import pandas as pd
from google.cloud import bigquery
Then I initialize the BigQuery client.
# Initialize BigQuery client to interact with the database
client = bigquery.Client()
I set some variables that I will use to get not only filter the query (by dates) but also is used to name the CSV file and folder.
# Define start and end date for the report period
start_date = '2024-05-13'
end_date = '2024-05-19'
# Calculate week number for folder naming
week_num = 20
I use a try block to format the start_date and end_date into the the format I want for the CSV file and folder. Which turns 2024-05-13 to 5-13-24.
# Format dates to more readable form for folder creation
try:
start_date_components = start_date.split('-')
start_date_formatted = f"{int(start_date_components[1]):g}-{int(start_date_components[2]):g}-{start_date_components[0][-2:]}"
end_date_components = end_date.split('-')
end_date_formatted = f"{int(end_date_components[1]):g}-{int(end_date_components[2]):g}-{end_date_components[0][-2:]}"
Then I use these variables to create a new folder in my computer. I also define the folder path. I also have an if statement to check if this main folder exists, if it doesn’t then create it.
The main folder would be formatted as: 20 5-13-24 to 5-19-24. The script would print out:
Folder created at: computer\file\path.
folder_name = f"{week_num} {start_date_formatted} to {end_date_formatted}"
# Define path for main report folder and ensure it's created
main_folder_path = os.path.join(r'computer\file\path', folder_name)
if not os.path.exists(main_folder_path):
os.makedirs(main_folder_path)
print(f"Folder created at: {main_folder_path}")
Finally, for this try block I also have except to handle possible errors and return the issues.
except ValueError as ve:
print(f"Error processing date formats: {ve}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
Then we move onto the actual query. This is getting customer transactions and company expenses between the dates I provided above. This is in start_date using @start and end_date using @end (more on why I’m using this @ later). As a reminder the actual query I use for this report is more complicated and involved. But for demonstration purposes this is a simpler query.
# Define SQL query using parameters for flexibility and security
query = """
WITH transactions AS (
SELECT
DATE(timestamp, 'America/New_York') as purchase_date,
SUM(amount_usd) as total_purchases
FROM
`project-id.database.transactions` t
WHERE
DATE(timestamp, 'America/New_York') BETWEEN @start AND @end
GROUP BY
purchase_date
),
expenses as(
SELECT
DATE(timestamp, 'America/New_York') as expense_date,
SUM(expenses) as total_expenses
FROM
`project-id.database.expenses` t
WHERE
DATE(timestamp, 'America/New_York') BETWEEN @start AND @end
GROUP BY expense_date
)
SELECT
day,
FORMAT_DATE('%a', day) as day_of_the_week,
t.total_purchases,
e.total_expenses
FROM
UNNEST(GENERATE_DATE_ARRAY(DATE(@end), DATE(@start), INTERVAL -1 DAY)) AS day
LEFT JOIN transactions t ON t.purchase_date = day
LEFT JOIN expenses e ON e.expense_date = day
WHERE
day BETWEEN @start AND @end
ORDER BY
day
"""
Then I set up the query parameters to actually use the start_date and end_date but as strings and label them as start and end respectively.
# Configure query parameters to safeguard against SQL injection and provide flexibility
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("start", "STRING", start_date),
bigquery.ScalarQueryParameter("end", "STRING", end_date)
]
)
If I ran this query the results (showing only the first 5 rows) would be:
day | day_of_the_week | total_purchases | expenses |
2024-05-13 | mon | 10000 | 5000 |
2024-05-14 | tue | 15000 | 2000 |
2024-05-15 | wed | 10000 | 3000 |
2024-05-16 | thu | 15000 | 4000 |
2024-05-17 | fri | 20000 | 5000 |
In a try block I execute the query and load the query results into a pandas DataFrame called df .
try:
# Execute the query and load results into a pandas DataFrame
query_job = client.query(query, job_config=job_config)
df = query_job.to_dataframe()
I create and apply a function to create a new column with a calculation. In this example it’s to calculate the revenue which is total_purchases (income) - total_expenses . It sets total_purchases to the total_purchases row and total_expenses to the total_expenses row. Note: In theory you can do this in your SQL query. But the reason why I don’t in my actual report is because the final calculation is dependent on other calculations I do in the query. But for demonstration purposes I’ve kept it simple.
Anyways I’ve also added an if statement to check if there’s no data for total_purchases or total_expenses , it returns None. The function returns total_purchases - total_expenses. And I apply that function to my DataFrame by creating a new column called revenue .
# Function to calculate net revenue from purchases and expenses
def calculate_revenue(row):
total_purchases = row['total_purchases']
total_expenses = row['total_expenses']
if total_purchases is None or total_expenses is None:
return None
return total_purchases - total_expenses
# Apply the custom function to create a new column
df['revenue'] = df.apply(calculate_revenue, axis=1)
Then I print this DataFrame.
# Print DataFrame
print("Query results:")
print(df)
This would print out:
Query Results:
day | day_of_the_week | total_purchases | expenses | revenue |
2024-05-13 | mon | 10000 | 5000 | 5000 |
2024-05-14 | tue | 15000 | 2000 | 13000 |
2024-05-15 | wed | 10000 | 3000 | 7000 |
2024-05-16 | thu | 15000 | 4000 | 11000 |
2024-05-17 | fri | 20000 | 5000 | 15000 |
2024-05-18 | sat | 15000 | 4000 | 11000 |
2024-05-19 | sun | 10000 | 3000 | 7000 |
I also create a new folder to have this CSV file within the main folder I created above. I have an if statement to check if the CSV subfolder exists or not. Then I give the file name for the CSV file and I set the file path for the CSV file using the csv_folder_path and csv_file_name.
# Setup paths and export results to CSV
csv_folder_path = os.path.join(main_folder_path, 'CSV')
# Create the CSV subfolder if it doesn't exist
if not os.path.exists(csv_folder_path):
os.makedirs(csv_folder_path)
# Define the filename for the CSV file
csv_file_name = f'Report Tables {start_date_formatted} to {end_date_formatted}.csv'
# Combine the CSV folder path and filename for the CSV file
csv_file_path = os.path.join(csv_folder_path, csv_file_name)
I export the DataFrame as a CSV file into that CSV subfolder. I also print information about the CSV file and where it was saved.
This would print out:
CSV file: Report Tables 5-13-24 to 5-19-24.csv exported successfully to: computer\file\path
# Export DataFrame to CSV file
df.to_csv(csv_file_path, index=False)
print(f"CSV file: {csv_file_name} exported successfully to: {csv_file_path}")
Finally, I have an except block to handle any potential errors.
eexcept Exception as e:
print("An error occurred during query execution:", e)
Final Code
Below is the final code. View the final code in my Github file: weekly_sales_report.py. Note: the code snippets above have more comments (for explanation purposes) than the final code in my Github repo.
import os
import pandas as pd
from google.cloud import bigquery
# Initialize BigQuery client to interact with the database
client = bigquery.Client()
# Define start and end date for the report period
start_date = '2024-05-13'
end_date = '2024-05-19'
# Calculate week number for folder naming
week_num = 20
# Format dates to more readable form for folder creation
try:
start_date_components = start_date.split('-')
start_date_formatted = f"{int(start_date_components[1]):g}-{int(start_date_components[2]):g}-{start_date_components[0][-2:]}"
end_date_components = end_date.split('-')
end_date_formatted = f"{int(end_date_components[1]):g}-{int(end_date_components[2]):g}-{end_date_components[0][-2:]}"
folder_name = f"{week_num} {start_date_formatted} to {end_date_formatted}"
# Define path for main report folder and ensure it's created
main_folder_path = os.path.join(r'computer\file\path', folder_name)
if not os.path.exists(main_folder_path):
os.makedirs(main_folder_path)
print(f"Folder created at: {main_folder_path}")
except ValueError as ve:
print(f"Error processing date formats: {ve}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
# Define SQL query using parameters for flexibility and security
query = """
WITH transactions AS (
SELECT
DATE(timestamp, 'America/New_York') as purchase_date,
SUM(amount_usd) as total_purchases
FROM
`project-id.database.transactions` t
WHERE
DATE(timestamp, 'America/New_York') BETWEEN @start AND @end
GROUP BY
purchase_date
),
expenses as(
SELECT
DATE(timestamp, 'America/New_York') as expense_date,
SUM(expenses) as total_expenses
FROM
`project-id.database.expenses` t
WHERE
DATE(timestamp, 'America/New_York') BETWEEN @start AND @end
GROUP BY expense_date
)
SELECT
day,
FORMAT_DATE('%a', day) as day_of_the_week,
t.total_purchases,
e.total_expenses
FROM
UNNEST(GENERATE_DATE_ARRAY(DATE(@end), DATE(@start), INTERVAL -1 DAY)) AS day
LEFT JOIN transactions t ON t.purchase_date = day
LEFT JOIN expenses e ON e.expense_date = day
WHERE
day BETWEEN @start AND @end
ORDER BY
day
"""
# Configure query parameters to safeguard against SQL injection and provide flexibility
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("start", "STRING", start_date),
bigquery.ScalarQueryParameter("end", "STRING", end_date)
]
)
try:
# Execute the query and load results into a pandas DataFrame
query_job = client.query(query, job_config=job_config)
df = query_job.to_dataframe()
# Function to calculate net revenue from purchases and expenses
def calculate_revenue(row):
total_purchases = row['total_purchases']
total_expenses = row['total_expenses']
if total_purchases is None or total_expenses is None:
return None
return total_purchases - total_expenses
# Apply the custom function to create a new column
df['revenue'] = df.apply(calculate_revenue, axis=1)
print("Query results:")
print(df)
# Setup paths and export results to CSV for further analysis and reporting
csv_folder_path = os.path.join(main_folder_path, 'CSV')
if not os.path.exists(csv_folder_path):
os.makedirs(csv_folder_path)
csv_file_name = f'Report Tables {start_date_formatted} to {end_date_formatted}.csv'
csv_file_path = os.path.join(csv_folder_path, csv_file_name)
df.to_csv(csv_file_path, index=False)
print(f"CSV file: {csv_file_name} exported successfully to: {csv_file_path}")
except Exception as e:
print("An error occurred during query execution:", e)
Reasoning
I wanted to make a few notes with my reasoning:
I used these specific libraries because I’m familiar with them, they’re easy to use, and widely supported.
I saved the file in a CSV sub-folder because I save this as an Excel file in the main folder. The the reason why I don't save the DataFrame as an Excel file is the date formatting doesn’t turn out correct this way. I have to also create some charts in Excel for an executive report every week.
Again the actual query and calculations in my script are more complicated. But the purpose of this blog post is to go over the concepts so it’s been changed to focus on learning (and for data privacy reasons).
Future Features
A few things I would want for the future:
Figure out a way to export this to an Excel sheet (with the dates fixed).
Use this Excel sheet to update another Excel sheet automatically.
Save the charts and graphs that’s created to the main folder.
As of 6/21/24 I actually did these two things see post: How I Automated My Weekly Python Script.
Schedule this script to run this automatically every week.
Update the dates and week number automatically.
What I Learned
This was a great experience and I honestly learned so much. Not only the technical aspects of using SQL & Python together. But understanding the importance of having a clear goal, starting small, and writing the code in a way that it can be scaled.
Conclusion
This CSV file used to write a weekly report. Before it was a very manual process and while I still have to do other things manually, this saves some time. While this only saves about 10 minutes a week (8 hours a year), it was a great learning experience. And a good way for me to begin using Python to automate more processes. Hopefully this helps give you ideas on what you can automate using Python. This is only the start of of my journey using SQL & Python together, can’t wait to see what more I can do!
Comentarios