Sales Data Analysis with Python¶

Introduction¶

This project presents a typical data analysis task. The dataset is an MPESA statement from a cashless restaurant, containing transaction data for a three-month period.

Objectives¶

This project had 3 objectives:

Get customer contacts for digital marketing via bulk sms messaging
Identify top customers by transaction amount and/or transaction count
Identify sales amount by day of the week (To test sales on promotion days vs normal days)

Any other relevant analyses: Here I attempted to identify sales by time periods(breakfast, lunch, happy hour, dinner) within the day. An attempt to segment users was also made.

Jupyter Notebook and Python were the primary tools used to execute this task.

The data is presented in pdf format. To load this type of file, I use the read_pdf() function from the tabula library. Other libraries like pypdf can be explored to see how to handle pdf files. For this task, the usual data analysis workhorses were used: pandas for wrangling and transformation, datetime to handle time series data, matplotlib and seaborn for visualization and finally, sklearn for clustering.

After successfully loading the data into a pandas dataframe, the information about the data is as below

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7226 entries, 0 to 7225
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Receipt No.         7175 non-null   object
 1   Completion          50 non-null     object
 2   Initiation Time     7175 non-null   object
 3   Details             7220 non-null   object
 4   Transaction         50 non-null     object
 5   Paid In             2431 non-null   object
 6   Withdrawn           4744 non-null   object
 7   Balance             7175 non-null   object
 8   Reason Type         7194 non-null   object
 9   Other Party Info    4195 non-null   object
Time     7126 non-null   object
Status  7126 non-null   object
dtypes: object(12)
memory usage: 677.6+ KB

A brief description of the columns after investigation would be:

Receipt No. - Unique transaction identifier, string type eg. QKS2UERGN2
Completion - Transaction completion time, datetime type eg. 28-11-2022 18:16:09
Initiation Time - Transaction initiation time, datetime type eg. 28-11-2022 18:16:09
Details - Transaction details(Reason and Customer Information), string type eg. Merchant Payment Online from 254700000000 - John Doe
Transaction - Transaction status, boolean type eg. completed
Paid In - Transaction Amount(CR), float type eg. 750.00
Withdrawn - Transaction Amount(DR), signed float type eg. -7550.00
Balance - Account Balance, float type eg. 38,058.28
Reason Type - Transaction Reason, string type eg. Pay Merchant
Other Party Info - Customer Information, string type eg. 254700000000 -John Doe

A function process_customer_data() is written to quickly process the data into a simple format since most of it is already clean.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2431 entries, 0 to 2430
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Details          2431 non-null   object        
 1   Paid In          2431 non-null   float64       
 2   Initiation Time  2431 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 57.1+ KB

1. Customer Contacts¶

After cleaning the data, customer contacts can be found in the df['Details'] column. A function process_customer_details() is created to get the customer details where, the phone number details are masked with an asterisk and the name is initailized to avoid sharing personal data.

This data can be added to an existing customer database using joins on SQL databaseS or VLOOKUP() functions on MS-Excel or google sheets

	ID	Name
0	254725****71	NW
1	254721****36	MMM
2	254716****49	JA
3	254714****72	RO
4	254723****53	CPM
...	...	...
2346	254711****95	INK
2348	254706****88	kic
2350	254723****54	IT
2355	254720****71	TW
2360	254720****84	CMG

1357 rows × 2 columns

2. Top Customers¶

To get the top customers by amount spent, number of transactions, and average spend per transaction, I define a custom function get_top_customers() that uses the count() and sum() functions on the df['Paid In'] coulumn and division to achieve this df['trx.average'] = df['sum'] / df['count'].

The table below shows the top 10 customers ranked by total transactions made. The top customer 254723****81 KM made 20 transactions and spent Kshs 35980.00 within the three month period.

	id_mask	name_mask	count	sum	trx.average
865	254723****81	KM	20	35980.0	1799.000000
852	254723****66	IMM	13	13790.0	1060.769231
859	254723****78	NC	12	16405.0	1367.083333
309	254714****45	DWM	12	15415.0	1284.583333
335	254715****39	ymo	11	9200.0	836.363636
661	254722****05	NK	10	11325.0	1132.500000
1202	254742****87	CMK	10	12030.0	1203.000000
288	254713****95	MNA	10	12595.0	1259.500000
1350	254799****25	SON	10	14500.0	1450.000000

The table below shows the top 10 customers ranked by amount spent

	id_mask	name_mask	count	sum	trx.average
865	254723****81	KM	20	35980.0	1799.000000
664	254722****06	AN	8	21290.0	2661.250000
859	254723****78	NC	12	16405.0	1367.083333
442	254720****03	AW	3	15700.0	5233.333333
309	254714****45	DWM	12	15415.0	1284.583333
729	254722****50	DMK	8	14895.0	1861.875000
1350	254799****25	SON	10	14500.0	1450.000000
506	254720****68	RMM	7	14030.0	2004.285714
852	254723****66	IMM	13	13790.0	1060.769231
268	254712****98	sl	5	13070.0	2614.000000

The table below shows the top 10 customers ranked by average spend per transaction df.sort_values(['trx.average', 'count'], ascending = False) where the customer has 3 or more transactions df['count'] >= 3].head(10).

	id_mask	name_mask	count	sum	trx.average
442	254720****03	AW	3	15700.0	5233.333333
653	254721****97	TN	3	11515.0	3838.333333
742	254722****59	FWM	3	10105.0	3368.333333
678	254722****13	DSKW	3	10020.0	3340.000000
117	254705****24	JMT	3	9555.0	3185.000000
453	254720****18	EOW	3	9500.0	3166.666667
137	254706****55	JH	3	9315.0	3105.000000
372	254717****05	TMO	3	8475.0	2825.000000
633	254721****78	LS	4	10905.0	2726.250000
297	254714****14	MMM	4	10650.0	2662.500000

The Top Customer data can be enhanced by creating a reward and loyalty program. This can be used to build customer knowledge databases and development of a marketing strategies while protecting the privacy of the customers as their loyalty is reliant on trust. Therefore, loyalty programs are profitable because it is less expensive to retain existing customers than to attract new ones$^{[1]}$.

3. Sales Trends¶

3(a) Monthly Sales¶

The graph below shows the total monthly sales in Kenyan Shillings (Kshs). The data resampled by month using the resample() method on the df['Completion'] column. Based on the data shown in the graph, we can see that the business experienced a decline in sales. In September, the sales were the highest, followed by a decrease in October and a further decrease in November.

Since there was a gradual decrease in sales from September to November, it would be worth analyzing the possible reasons for this decline. Some potential factors to consider could include changes in market conditions, changing consumer preferences or increased competition.

<function matplotlib.pyplot.show(close=None, block=None)>

Daily Sales Trend¶

The line graph below shows the daily sales trend after resampling the data by day. Based on the data shown in the graph, we can see that the business experiences fluctuating sales

<function matplotlib.pyplot.show(close=None, block=None)>

3. Sales by day of the week¶

The graph below shows total sales by day of the week and is used to test the validity of promotion days (like buy one get one free or discounts). This is evidenced by Friday (Buy 2 burgers for the price of 1) and Tuesday (Buy one pizza and get another free) having higher sales than the other weekdays - Monday, Wednesday and Thursday. Saturday and Sunday are the top grossing days in terms of sales.

<function matplotlib.pyplot.show(close=None, block=None)>

The graph below supplements the previous one by breaking down the days of the week by month. And while the it reiterates Sunday as the highest grossing day, Fridays and Saturdays in the month of September performed exceptionally well and this can be attributed to a marketing campaign to celebrate the restaurant's 2nd anniversary with deals on Pizzas and Ice creams.

<function matplotlib.pyplot.show(close=None, block=None)>

4. Sales by period of the day¶

Here, I resampled the data based on four broad periods based on the menu - Breakfast, Lunch, HappyHour and Dinner using the df['Completion'] column.

The graph below shows sales by period of the day. While most transactions happen within the happy hour period (1600hrs - 1900hrs), it is closely followed by lunch period (1230hrs - 1600hrs) then followed by dinner (1900hrs - 2200hrs).

Given this information, some strategies around pricing, new product development like theme nights or breakfast menus can be used to drive sales in the dinner and brakfast periods of the day.

<function matplotlib.pyplot.show(close=None, block=None)>

The graph below shows sales by day of the week broken down into period of the day.

4. Segment Customers¶

Here, I attempted to segment users based on the number of transactions (Frequency), amount spent (Monetary Value) and Recency of the transaction in days in relation to the last transaction on the statement.

After applying K-Means algorithm on the data, we get the 3D plot below of the clustered data points. Segmenting users based on similar characteristics allows businesses to gain a deeper understanding of their customers' needs, preferences, and behaviors. Several researchers $^{[2, 3, 4]}$ note that this type of segmentation provides advantages like:

targeted marketing campaigns that are more likely to resonate with specific customer segments
tailoring products and services to specific customer segments to create a more personalized customer experience, leading to increased customer satisfaction and loyalty
businesses may allocate resources more efficiently by focusing on the most profitable customer segments and reducing investment in less profitable ones
gaining a competitive advantage by creating more relevant and compelling offerings than their competitors.

Conclusion¶

Despite the relatively small size of the dataset, meaningful insights were uncovered that can help inform decision-making. This highlights the importance of using appropriate statistical methods and visualizations to explore data, even when working with limited information. Furthermore, this dataset has the potential to be expanded by combining it with other sources of data, such as point-of-sale (POS) data, and social media marketing campaign data to gain a more comprehensive understanding of customer behavior and preferences. Such integration of data can provide a more complete picture of the underlying patterns and relationships, leading to better insights and more effective decision-making.

References¶

Hofman-Kohlmeyer, Magdalena. (2016). CUSTOMER LOYALTY PROGRAM AS A TOOL OF CUSTOMER RETENTION: LITERATURE REVIEW. CBU International Conference Proceedings. 4. 199. 10.12955/cbup.v4.762.
Fonseca, Jaime & S, Jaime. (2011). Why Does Segmentation Matter? Identifying Market Segments Through a Mixed Methodology.
Riikka Harmanen (2018). Customer Segmentation and Efficient Marketing Case: Teboil Simpele.
Camilleri, Mark Anthony. (2017). Market Segmentation, Targeting and Positioning. 10.1007/978-3-319-49849-2_4.