This project presents a typical data analysis task. The dataset is an MPESA statement from a cashless restaurant, containing transaction data for a three-month period.
This project had 3 objectives:
Any other relevant analyses: Here I attempted to identify sales by time periods(breakfast, lunch, happy hour, dinner) within the day. An attempt to segment users was also made.
Jupyter Notebook and Python were the primary tools used to execute this task.
The data is presented in pdf format. To load this type of file, I use the read_pdf()
function from the tabula
library. Other libraries like pypdf
can be explored to see how to handle pdf files. For this task, the usual data analysis workhorses were used: pandas
for wrangling and transformation, datetime
to handle time series data, matplotlib
and seaborn
for visualization and finally, sklearn
for clustering.
After successfully loading the data into a pandas dataframe, the information about the data is as below
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7226 entries, 0 to 7225 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Receipt No. 7175 non-null object 1 Completion 50 non-null object 2 Initiation Time 7175 non-null object 3 Details 7220 non-null object 4 Transaction 50 non-null object 5 Paid In 2431 non-null object 6 Withdrawn 4744 non-null object 7 Balance 7175 non-null object 8 Reason Type 7194 non-null object 9 Other Party Info 4195 non-null object Time 7126 non-null object Status 7126 non-null object dtypes: object(12) memory usage: 677.6+ KB
A brief description of the columns after investigation would be:
A function process_customer_data()
is written to quickly process the data into a simple format since most of it is already clean.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2431 entries, 0 to 2430 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Details 2431 non-null object 1 Paid In 2431 non-null float64 2 Initiation Time 2431 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(1), object(1) memory usage: 57.1+ KB
After cleaning the data, customer contacts can be found in the df['Details']
column. A function process_customer_details()
is created to get the customer details where, the phone number details are masked with an asterisk and the name is initailized to avoid sharing personal data.
This data can be added to an existing customer database using joins
on SQL databaseS or VLOOKUP()
functions on MS-Excel or google sheets
ID | Name | |
---|---|---|
0 | 254725****71 | NW |
1 | 254721****36 | MMM |
2 | 254716****49 | JA |
3 | 254714****72 | RO |
4 | 254723****53 | CPM |
... | ... | ... |
2346 | 254711****95 | INK |
2348 | 254706****88 | kic |
2350 | 254723****54 | IT |
2355 | 254720****71 | TW |
2360 | 254720****84 | CMG |
1357 rows × 2 columns
To get the top customers by amount spent, number of transactions, and average spend per transaction, I define a custom function get_top_customers()
that uses the count()
and sum()
functions on the df['Paid In']
coulumn and division to achieve this df['trx.average'] = df['sum'] / df['count']
.
The table below shows the top 10 customers ranked by total transactions made. The top customer 254723****81 KM
made 20 transactions and spent Kshs 35980.00 within the three month period.
id_mask | name_mask | count | sum | trx.average | |
---|---|---|---|---|---|
865 | 254723****81 | KM | 20 | 35980.0 | 1799.000000 |
852 | 254723****66 | IMM | 13 | 13790.0 | 1060.769231 |
859 | 254723****78 | NC | 12 | 16405.0 | 1367.083333 |
309 | 254714****45 | DWM | 12 | 15415.0 | 1284.583333 |
335 | 254715****39 | ymo | 11 | 9200.0 | 836.363636 |
661 | 254722****05 | NK | 10 | 11325.0 | 1132.500000 |
1202 | 254742****87 | CMK | 10 | 12030.0 | 1203.000000 |
288 | 254713****95 | MNA | 10 | 12595.0 | 1259.500000 |
1350 | 254799****25 | SON | 10 | 14500.0 | 1450.000000 |
The table below shows the top 10 customers ranked by amount spent
id_mask | name_mask | count | sum | trx.average | |
---|---|---|---|---|---|
865 | 254723****81 | KM | 20 | 35980.0 | 1799.000000 |
664 | 254722****06 | AN | 8 | 21290.0 | 2661.250000 |
859 | 254723****78 | NC | 12 | 16405.0 | 1367.083333 |
442 | 254720****03 | AW | 3 | 15700.0 | 5233.333333 |
309 | 254714****45 | DWM | 12 | 15415.0 | 1284.583333 |
729 | 254722****50 | DMK | 8 | 14895.0 | 1861.875000 |
1350 | 254799****25 | SON | 10 | 14500.0 | 1450.000000 |
506 | 254720****68 | RMM | 7 | 14030.0 | 2004.285714 |
852 | 254723****66 | IMM | 13 | 13790.0 | 1060.769231 |
268 | 254712****98 | sl | 5 | 13070.0 | 2614.000000 |
The table below shows the top 10 customers ranked by average spend per transaction df.sort_values(['trx.average', 'count'], ascending = False)
where the customer has 3 or more transactions df['count'] >= 3].head(10)
.
id_mask | name_mask | count | sum | trx.average | |
---|---|---|---|---|---|
442 | 254720****03 | AW | 3 | 15700.0 | 5233.333333 |
653 | 254721****97 | TN | 3 | 11515.0 | 3838.333333 |
742 | 254722****59 | FWM | 3 | 10105.0 | 3368.333333 |
678 | 254722****13 | DSKW | 3 | 10020.0 | 3340.000000 |
117 | 254705****24 | JMT | 3 | 9555.0 | 3185.000000 |
453 | 254720****18 | EOW | 3 | 9500.0 | 3166.666667 |
137 | 254706****55 | JH | 3 | 9315.0 | 3105.000000 |
372 | 254717****05 | TMO | 3 | 8475.0 | 2825.000000 |
633 | 254721****78 | LS | 4 | 10905.0 | 2726.250000 |
297 | 254714****14 | MMM | 4 | 10650.0 | 2662.500000 |
The Top Customer data can be enhanced by creating a reward and loyalty program. This can be used to build customer knowledge databases and development of a marketing strategies while protecting the privacy of the customers as their loyalty is reliant on trust. Therefore, loyalty programs are profitable because it is less expensive to retain existing customers than to attract new ones$^{[1]}$.
The graph below shows the total monthly sales in Kenyan Shillings (Kshs). The data resampled by month using the resample()
method on the df['Completion']
column. Based on the data shown in the graph, we can see that the business experienced a decline in sales. In September, the sales were the highest, followed by a decrease in October and a further decrease in November.
Since there was a gradual decrease in sales from September to November, it would be worth analyzing the possible reasons for this decline. Some potential factors to consider could include changes in market conditions, changing consumer preferences or increased competition.
<function matplotlib.pyplot.show(close=None, block=None)>
The line graph below shows the daily sales trend after resampling the data by day. Based on the data shown in the graph, we can see that the business experiences fluctuating sales
<function matplotlib.pyplot.show(close=None, block=None)>
The graph below shows total sales by day of the week and is used to test the validity of promotion days (like buy one get one free or discounts). This is evidenced by Friday (Buy 2 burgers for the price of 1) and Tuesday (Buy one pizza and get another free) having higher sales than the other weekdays - Monday, Wednesday and Thursday. Saturday and Sunday are the top grossing days in terms of sales.
<function matplotlib.pyplot.show(close=None, block=None)>
The graph below supplements the previous one by breaking down the days of the week by month. And while the it reiterates Sunday as the highest grossing day, Fridays and Saturdays in the month of September performed exceptionally well and this can be attributed to a marketing campaign to celebrate the restaurant's 2nd anniversary with deals on Pizzas and Ice creams.
<function matplotlib.pyplot.show(close=None, block=None)>
Here, I resampled the data based on four broad periods based on the menu - Breakfast, Lunch, HappyHour and Dinner using the df['Completion']
column.
The graph below shows sales by period of the day. While most transactions happen within the happy hour period (1600hrs - 1900hrs), it is closely followed by lunch period (1230hrs - 1600hrs) then followed by dinner (1900hrs - 2200hrs).
Given this information, some strategies around pricing, new product development like theme nights or breakfast menus can be used to drive sales in the dinner and brakfast periods of the day.
<function matplotlib.pyplot.show(close=None, block=None)>
The graph below shows sales by day of the week broken down into period of the day.
Here, I attempted to segment users based on the number of transactions (Frequency), amount spent (Monetary Value) and Recency of the transaction in days in relation to the last transaction on the statement.
After applying K-Means algorithm on the data, we get the 3D plot below of the clustered data points. Segmenting users based on similar characteristics allows businesses to gain a deeper understanding of their customers' needs, preferences, and behaviors. Several researchers $^{[2, 3, 4]}$ note that this type of segmentation provides advantages like:
Despite the relatively small size of the dataset, meaningful insights were uncovered that can help inform decision-making. This highlights the importance of using appropriate statistical methods and visualizations to explore data, even when working with limited information. Furthermore, this dataset has the potential to be expanded by combining it with other sources of data, such as point-of-sale (POS) data, and social media marketing campaign data to gain a more comprehensive understanding of customer behavior and preferences. Such integration of data can provide a more complete picture of the underlying patterns and relationships, leading to better insights and more effective decision-making.