Introduction¶
Coffee Bliss is a popular, independently owned coffee shop, known for its high-quality coffee and cozy atmosphere. As the business grows, the owners recognize the need for data-driven decision-making to optimize sales, improve inventory management, and enhance the customer experience. By analyzing sales data, Coffee Bliss aims to uncover actionable insights that will drive growth and increase profitability.
Key stakeholders¶
The main stakeholders in this project are:
- Shop Owners: Looking to optimize sales strategies and increase profitability.
- Marketing Team: Interested in understanding customer behavior and tailoring marketing efforts accordingly.
- Operations Team: Aiming to improve inventory management and streamline operations
Objectives¶
- Analyze Sales Trends: Examine sales data to identify significant trends, patterns, and anomalies in transaction volumes, product popularity, and seasonal variations.
- Understand Customer Behavior: Explore customer purchasing habits and preferences to better align offerings with customer needs.
- Set KPIs: Define and monitor essential metrics like daily sales and average transaction value to track business performance.
- Provide Actionable Insights: Offer insights and recommendations based on the EDA to help improve sales strategies and operational efficiency.
Dataset¶
The dataset is publicly available through kaggle. This is the link to the dataset.
Data Cleaning and Preparation¶
Importing neccessary libraries and Dataset¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Importing data from the excel file
df = pd.read_excel(r'Coffee Shop Sales.xlsx', sheet_name = 'Transactions')
df.head()
transaction_id | transaction_date | transaction_time | transaction_qty | store_id | store_location | product_id | unit_price | product_category | product_type | product_detail | Revenue | Month | Month.1 | Weekday | Weekday.1 | Hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2023-01-01 | 07:06:11 | 2 | 5 | Lower Manhattan | 32 | 3.0 | Coffee | Gourmet brewed coffee | Ethiopia Rg | 6.0 | 1 | Jan | 7 | Sun | 7 |
1 | 2 | 2023-01-01 | 07:08:56 | 2 | 5 | Lower Manhattan | 57 | 3.1 | Tea | Brewed Chai tea | Spicy Eye Opener Chai Lg | 6.2 | 1 | Jan | 7 | Sun | 7 |
2 | 3 | 2023-01-01 | 07:14:04 | 2 | 5 | Lower Manhattan | 59 | 4.5 | Drinking Chocolate | Hot chocolate | Dark chocolate Lg | 9.0 | 1 | Jan | 7 | Sun | 7 |
3 | 4 | 2023-01-01 | 07:20:24 | 1 | 5 | Lower Manhattan | 22 | 2.0 | Coffee | Drip coffee | Our Old Time Diner Blend Sm | 2.0 | 1 | Jan | 7 | Sun | 7 |
4 | 5 | 2023-01-01 | 07:22:41 | 2 | 5 | Lower Manhattan | 57 | 3.1 | Tea | Brewed Chai tea | Spicy Eye Opener Chai Lg | 6.2 | 1 | Jan | 7 | Sun | 7 |
let's look at the structure of the data and make sense of the columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 149116 entries, 0 to 149115 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 transaction_id 149116 non-null int64 1 transaction_date 149116 non-null datetime64[ns] 2 transaction_time 149116 non-null object 3 transaction_qty 149116 non-null int64 4 store_id 149116 non-null int64 5 store_location 149116 non-null object 6 product_id 149116 non-null int64 7 unit_price 149116 non-null float64 8 product_category 149116 non-null object 9 product_type 149116 non-null object 10 product_detail 149116 non-null object 11 Revenue 149116 non-null float64 12 Month 149116 non-null int64 13 Month.1 149116 non-null object 14 Weekday 149116 non-null int64 15 Weekday.1 149116 non-null object 16 Hour 149116 non-null int64 dtypes: datetime64[ns](1), float64(2), int64(7), object(7) memory usage: 19.3+ MB
- The dataset appears well-structured with no missing values, which is excellent for analysis.
Actions:
- We will remove the transaction_time, Month and Weekday columns from the dataset, as Hour, Month.1 and Weekday.1 provide more descriptive representations of the month and day of the week. We can also remove product_id and store_id as product_detail and store_location are better represnetations
- Rename Month.1 and Weekday.1 for better readability.
df.drop(columns=['transaction_time','store_id','Month','Weekday','product_id'], inplace = True )
df.rename(columns={'Month.1':'Month','Weekday.1':'Weekday'}, inplace = True)
- To gain an overview of the dataset’s numerical features and understand their statistical properties, we'll use df.describe(). This will provide insights into measures such as the mean, standard deviation, and range for each numerical column.
df.describe()
transaction_id | transaction_date | transaction_qty | unit_price | Revenue | Hour | |
---|---|---|---|---|---|---|
count | 149116.000000 | 149116 | 149116.000000 | 149116.000000 | 149116.000000 | 149116.000000 |
mean | 74737.371872 | 2023-04-15 11:50:32.173609984 | 1.438276 | 3.382219 | 4.686367 | 11.735790 |
min | 1.000000 | 2023-01-01 00:00:00 | 1.000000 | 0.800000 | 0.800000 | 6.000000 |
25% | 37335.750000 | 2023-03-06 00:00:00 | 1.000000 | 2.500000 | 3.000000 | 9.000000 |
50% | 74727.500000 | 2023-04-24 00:00:00 | 1.000000 | 3.000000 | 3.750000 | 11.000000 |
75% | 112094.250000 | 2023-05-30 00:00:00 | 2.000000 | 3.750000 | 6.000000 | 15.000000 |
max | 149456.000000 | 2023-06-30 00:00:00 | 8.000000 | 45.000000 | 360.000000 | 20.000000 |
std | 43153.600016 | NaN | 0.542509 | 2.658723 | 4.227099 | 3.764662 |
- From transaction_date column we can say that all the records are from the year 2023 and are from months January to June only.
- Transaction Quantity: Varies from 1 to 8 units, with an average of 1.44 units.
- Unit Price(in dollars): Ranges from 0.80 to 45.00, averaging 3.38, indicating price variability.
- Revenue(in dollars): From 0.80 to 360.00, with an average of 4.69, reflecting diverse transaction values.
Checking for duplicate values¶
df[df.duplicated()]
transaction_id | transaction_date | transaction_qty | store_location | unit_price | product_category | product_type | product_detail | Revenue | Month | Weekday | Hour |
---|
- The empty dataframe indicates that there are no duplicate values.
Checking for outliers¶
fig, ax = plt.subplots( 1,3 ,figsize = (11,4))
sns.boxplot(data = df[[ 'transaction_qty']], ax = ax[0])
sns.boxplot(data = df[[ 'unit_price' ]], ax = ax[1])
sns.boxplot(data = df[[ 'Revenue' ]], ax = ax[2])
plt.show()
- Outliers are present in the transaction quantity, unit price, and revenue columns. Domain knowledge is crucial for addressing these outliers effectively.
- However, these outliers are justifiable given that high transaction quantities can lead to increased revenue, and premium items may have higher unit prices. Therefore, we will leave the outliers as they are, as they reflect realistic variations in the data.
Exploratory Data Analyis¶
Univariate Analysis¶
Let's look at revenue of each store over the year 2023¶
# grouping store locations with the total revenue earned by each store
df_revenue = df.groupby('store_location')['Revenue'].sum().reset_index()
# Dividing the Revenue by 1000 for easy representation in k dollars
df_revenue['Revenue'] = round(df_revenue['Revenue']/1000,2)
plt.figure(figsize=(5,3))
sns.barplot(x = 'store_location', y= 'Revenue', data = df_revenue)
plt.ylim(0,250)
plt.title('Total Revenue based on store location (k$)')
plt.xlabel('Store Location')
plt.ylabel('Revenue (k$)')
plt.show()
- This shows that the total revenue generated is quite similar across all stores, indicating that there isn't much variation in revenue performance between different locations.
Let's check how many units were commonly bought on each transaction¶
units_per_transaction = df.groupby('transaction_qty').size()
plt.figure(figsize=(5,3))
sns.barplot(x= units_per_transaction.index, y= units_per_transaction.values, color ='dodgerblue')
plt.xlabel('Number of Units')
plt.ylabel('Transactions')
Text(0, 0.5, 'Transactions')
- Most transactions involve only one or two items.
- We have an opportunity to increase the number of items purchased per transaction by implementing strategies such as bundle discounts, volume-based pricing, or loyalty rewards for larger purchases.
Let's look at which category has the highest number of units sold¶
transaction_by_category = df.groupby('product_category')['transaction_qty'].sum().reset_index()
transaction_by_category = transaction_by_category[['product_category', 'transaction_qty']].sort_values(by='transaction_qty', ascending=False)
total = transaction_by_category['transaction_qty'].sum()
def categorize(row,total):
perct = (row['transaction_qty']/total)*100
return row['product_category'] if perct > 6 else 'Others'
transaction_by_category['category'] = transaction_by_category.apply(categorize,total = total,axis=1)
pie_cat = transaction_by_category.groupby('category')['transaction_qty'].sum().sort_values()
fig , ax = plt.subplots(1,2, figsize = (12,5))
#plt.figure(figsize = (7,3))
sns.barplot(y='product_category', x= 'transaction_qty', data=transaction_by_category, orient='h', ax=ax[0], palette='Blues_d')
ax[0].set_title('Number of units sold by each category')
ax[0].set_ylabel('Product category')
ax[0].set_xlabel('Units sold')
colors = ['#ADD8E6', '#87CEEB', '#1E90FF', '#4169E1', '#0202F5']
ax[1].pie(pie_cat.values, labels = pie_cat.index,autopct='%1.1f%%', colors=colors)
ax[1].set_title('Category distribution Pie chart')
fig.subplots_adjust(left = 0.05, right = 0.95)
plt.show()
- Coffee is the most popular category, with the highest units sold and revenue generated, indicating a strong customer preference.
- Tea ranks second, showing significant sales and revenue, reflecting solid demand and interest.
- Bakery items come in third, with notable sales figures, though not as dominant as coffee or tea, suggesting a potential area for growth.
Unit price distribution among categories¶
unique_prices = df.drop_duplicates(subset=['product_category','unit_price'])
fig , ax = plt.subplots(figsize=(8,4))
sns.boxplot(x = 'product_category', y='unit_price', data = unique_prices, flierprops = {'marker':'o','markerfacecolor':'none'})
plt.ylabel('Unit price ($)', fontsize= 12)
plt.xlabel('Product category', fontsize = 12)
plt.xticks(rotation = 45)
plt.title('Unit price distribution for Product categories', fontsize = 13)
Text(0.5, 1.0, 'Unit price distribution for Product categories')
The analysis of unit prices reveals that products with the highest unit prices—such as loose tea, coffee beans, packaged chocolate, and branded items—tend to have lower sales volumes. This indicates that higher-priced items are purchased less frequently compared to lower-priced products, which show more consistent sales.
Coffee and tea, being lower-priced items, exhibit high sales volumes. To capitalize on this trend, focus on promoting these popular, high-turnover products while exploring opportunities to upsell or cross-sell higher-priced items to boost overall revenue.
Bivariate Analysis¶
Looking at the Revenue and sales performed over months and days¶
# highest revenue generated by month
month_rev = df.groupby(['Month'])['Revenue'].sum().reset_index()
#sorting by month
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
month_rev['month_order'] = month_rev['Month'].map({Month: i for i,Month in enumerate(month_order)})
month_rev = month_rev.sort_values(by = 'month_order').drop('month_order',axis=1)
# daily revenue and quantity
daily_revenue = df.groupby('transaction_date')['Revenue'].sum().reset_index()
daily_qty = df.groupby('transaction_date')['transaction_qty'].sum().reset_index()
fig, (ax,ax0) = plt.subplots(1,2, figsize = (20,8))
sns.barplot(x = 'Month', y ='Revenue', data = month_rev, ax=ax, palette = 'Blues')
ax.set_title('Total Monthly revenue for 2023 ')
ax1 = ax0.twinx()
ax0.plot('transaction_date', 'Revenue', data=daily_revenue , color='red')
ax1.plot('transaction_date', 'transaction_qty', data=daily_qty, color='blue')
ax0.set_ylabel("Revenue", color='red')
ax0.set_xlabel('Transaction Date')
ax1.set_ylabel('Sale Quantity', color='blue')
plt.title('Daily sales and Revenue trend over 6 months')
plt.show()
- Monthly Sales Trend: Sales and transaction volumes have been rising since February. More extended data could shed light on annual seasonality and potential sales limits.
- Daily Sales Trend: Sales vary with weekdays, holidays, and promotions, with noticeable dips at the start and end of each month. The overall trend is positive, indicating business growth.
Looking for Weekday trends in revenue¶
weekday_revenue = df.groupby('Weekday')['Revenue'].sum().reset_index()
# sorting based on the week
weekday_order = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
weekday_revenue['order'] = weekday_revenue['Weekday'].map({Weekday: i for i,Weekday in enumerate(weekday_order) })
weekday_revenue = weekday_revenue.sort_values(by='order').drop(columns='order')
plt.figure(figsize=(5,3))
sns.barplot(x=weekday_revenue['Weekday'], y = weekday_revenue['Revenue'], color = 'dodgerblue')
plt.title('Revenue over different days of week')
plt.show()
- The difference is very little over different days in a week
Let's look at the busiest time of the day¶
# calculating average sales over the day
total_sales_per_hour_of_day = df.groupby('Hour')['Revenue'].sum()
unique_dates = df.drop_duplicates(['transaction_date', 'Hour'])
# Counting how many times specific hours are recorded
hours_of_day_count = unique_dates['Hour'].value_counts(sort=True).sort_index(ascending=True)
avg_hourly_sales_per_day = total_sales_per_hour_of_day / hours_of_day_count
plt.figure(figsize=( 5,4))
plt.plot(avg_hourly_sales_per_day.index, avg_hourly_sales_per_day.values)
plt.title('Avearge hourly sales in a day')
plt.xlabel('Hour')
plt.ylabel('Average sales')
Text(0, 0.5, 'Average sales')
Sales peak between 8 and 10 AM as people seek stimulating drinks to start their day. Sales become more evenly distributed from 11 AM to 7 PM.
Sales during peak hours are twice as high as during regular hours. This insight can guide employee task allocation and marketing promotions to balance sales throughout the day.
Multi variate Analysis¶
Lets see if there is a preferred product category across the three stores¶
pivoted = df.pivot_table(index = 'product_category', columns='store_location', values='transaction_qty', aggfunc = 'sum')
pivoted['Total'] = pivoted.sum(axis = 1)
pivoted.sort_values(by='Total', ascending = False, inplace = True)
pivoted.drop(columns = ['Total'], inplace = True)
pivoted.plot(kind='bar', figsize = (10,5))
plt.xticks(rotation = 45)
plt.xlabel('Product category')
plt.ylabel('Total transaction quantity')
plt.title('Distribution of each product category purchased per store')
plt.show()
- Across all three stores, most product categories exhibit similar purchase volumes, indicating that there are no clear favorites for specific product categories.
Final Analysis:¶
Sales Consistency: All stores exhibit similar sales and transaction patterns, indicating uniform performance across locations.
Transaction Patterns: Customers predominantly purchase 1–2 items per transaction. Coffee and tea make up 70% of total purchases, while the five most expensive items account for less than 6% of sales.
Temporal Trends: Sales and transactions show consistent trends over time, with a slight cyclical increase towards the end of each month. There is no significant variation in daily sales by week day, but sales peak between 8 a.m. and 10 a.m., with a more moderate level of activity from 11 a.m. to 7 p.m.
Product Preferences: No specific product category preferences are observed across different stores, suggesting a uniform demand for product categories.
Suggestions:¶
Encourage Larger Purchases: Since transactions generally involve 1–2 items, introduce group promotions or bundle offers to increase the average transaction size.
Optimize High-Price Items: Revise the product lineup or initiate specific marketing promotions for items with higher prices, to enhance their sales performance.
Maximize Busy Hours: Leverage the peak hours from 8 a.m. to 10 a.m. by introducing additional promotions. Adjust staffing schedules to balance the workload during these busy times and consider afternoon promotions to boost sales during the quieter hours from 11 a.m. to 7 p.m.
Conclusion¶
The EDA reveals a strong customer preference for coffee and tea, which are the most popular categories, driving the majority of sales and demonstrating consistent demand. Sales trends indicate positive growth, with the potential to further increase revenue by focusing on these key items and implementing promotional strategies during non-peak hours. Addressing the performance of high-priced items through targeted marketing and product adjustments presents additional opportunities for growth. Overall, the data suggests a promising outlook for the coffee shop, with ample room for optimization and increased customer engagement.