Week 4 — Correlation Analysis (Professional Final Version)

Course: Applied Data Science with AI • Instructor: Dr. Muhammad Mohsin Nazir

Status: Completed

Report generated: Week 4

📘 Week 4 — Detailed Overview / Introduction

Week 4 of the Data Analytics project marks a critical transition from basic data preparation toward understanding the statistical relationships among variables. The main goal of this phase was to explore how different features within the dataset relate to one another and, more importantly, how they influence the target variable — typically representing customer satisfaction or a performance score.

This week’s analytical task focused on correlation analysis, a fundamental step before predictive modeling. By computing the correlation coefficients between numerical features, you identified patterns of association that help in selecting relevant predictors and removing redundant or non-informative variables. Understanding these relationships is essential for improving model accuracy and interpretability in later stages such as regression or classification.

The workflow began with loading and cleaning the dataset (Customer_support_data.csv), ensuring no missing or inconsistent values remained. Categorical features were encoded using label encoding to make them suitable for numerical computations. Next, the correlation matrix was generated to visually and numerically compare inter-variable relationships. Using Seaborn’s heatmap visualization, the strongest and weakest correlations were identified with a color-coded representation for easier interpretation.

To deepen conceptual understanding, a manual correlation computation was performed for one of the most significant feature pairs. This step-by-step calculation demonstrated how correlation values are mathematically derived using mean-centered values and standard deviations, thereby reinforcing both statistical intuition and technical accuracy.

The week concluded by highlighting the Top 3 predictive features most correlated with the target variable. These insights serve as a foundation for Week 5, where these features will be utilized to build predictive models that forecast customer satisfaction or business outcomes. In essence, Week 4 established a data-driven pathway for selecting variables that matter — paving the way for robust model development and validation in subsequent stages.

In summary: Week 4 successfully bridged exploratory data analysis with predictive modeling preparation — transforming raw data into actionable insights through systematic correlation assessment and statistical reasoning.

Tasks Performed in Your Vs Code

✅ Loaded dataset (Customer_support_data.csv).
✅ Cleaned missing values and encoded categorical variables.
✅ Calculated correlation matrix between all numeric features.
✅ Created visual correlation heatmap using Seaborn.
✅ Identified top 3 correlated features with the target variable.
✅ Performed manual step-by-step correlation calculation for deeper understanding.

📊 Outputs Produced

📈 Correlation Matrix (numeric + color-coded heatmap)
🧠 Top 3 correlated features identified
🧮 Manual correlation computation table
🎨 Pairplot showing feature relationships

✅ Dataset Loaded Successfully

85,907

Rows

Columns

📍 Target Column Detected

Automatically identified target: CSAT Score

📊 Dataset Overview

Column Index	Column Name
0	Unique id
1	channel_name
2	category
3	Sub-category
4	Customer Remarks
5	Order_id
6	order_date_time
7	Issue_reported at
8	issue_responded
9	Survey_response_Date
10	Customer_City
11	Product_category
12	Item_price
13	connected_handling_time
14	Agent_name
15	Supervisor
16	Manager
17	Tenure Bucket
18	Agent Shift
19	CSAT Score

🔎 First 5 Rows (Preview)

Unique id	channel_name	category	Sub-category	Customer Remarks	Order_id	order_date_time	Issue_reported at	issue_responded	Survey_response_Date	Customer_City	Product_category	Item_price	connected_handling_time	Agent_name	Supervisor	Manager	Tenure Bucket	Agent Shift	CSAT Score
7e9ae164-6a8b-4521-a2d4-58f7c9fff13f	Outcall	Product Queries	Life Insurance	—	c27c9bb4-fa36-4140-9f1f-21009254ffdb	—	01/08/2023 11:13	01/08/2023 11:47	01-Aug-23	—	—	—	—	Richard Buchanan	Mason Gupta	Jennifer Nguyen	On Job Training	Morning	5
b07ec1b0-f376-43b6-86df-ec03da3b2e16	Outcall	Product Queries	Product Specific Information	—	d406b0c7-ce17-4654-b9de-f08d421254bd	—	01/08/2023 12:52	01/08/2023 12:54	01-Aug-23	—	—	—	—	Vicki Collins	Dylan Kim	Michael Lee	>90	Morning	5
200814dd-27c7-4149-ba2b-bd3af3092880	Inbound	Order Related	Installation/demo	—	c273368d-b961-44cb-beaf-62d6fd6c00d5	—	01/08/2023 20:16	01/08/2023 20:38	01-Aug-23	—	—	—	—	Duane Norman	Jackson Park	William Kim	On Job Training	Evening	5
eb0d3e53-c1ca-42d3-8486-e42c8d622135	Inbound	Returns	Reverse Pickup Enquiry	—	5aed0059-55a4-4ec6-bb54-97942092020a	—	01/08/2023 20:56	01/08/2023 21:16	01-Aug-23	—	—	—	—	Patrick Flores	Olivia Wang	John Smith	>90	Evening	5
ba903143-1e54-406c-b969-46c52f92e5df	Inbound	Cancellation	Not Needed	—	e8bed5a9-6933-4aff-9dc6-ccefd7dcde59	—	01/08/2023 10:30	01/08/2023 10:32	01-Aug-23	—	—	—	—	Christopher Sanchez	Austin Johnson	Michael Lee	0-30	Morning	5

🧹 Missing Value Check

Total missing values after cleaning: 0

🔠 Encoding Summary

Encoded 17 categorical columns using LabelEncoder.

🗂️ Encoded Columns

Index	Categorical Column
0	Unique id
1	channel_name
2	category
3	Sub-category
4	Customer Remarks
5	Order_id
6	order_date_time
7	Issue_reported at
8	issue_responded
9	Survey_response_Date
10	Customer_City
11	Product_category
12	Agent_name
13	Supervisor
14	Manager
15	Tenure Bucket
16	Agent Shift

📊 Correlation Matrix (20 Columns × 10 Rows)

	Unique id	channel_name	category	Sub-category	Customer Remarks	Order_id	order_date_time	Issue_reported at	issue_responded	Survey_response_Date	Customer_City	Product_category	Item_price	connected_handling_time	Agent_name	Supervisor	Manager	Tenure Bucket	Agent Shift	CSAT Score
Unique id	1.00	0.00	-0.01	0.00	-0.01	0.00	-0.00	0.00	0.00	0.00	0.00	0.00	-0.00	-0.00	-0.00	0.00	0.00	-0.00	-0.01	0.00
channel_name	0.00	1.00	0.02	0.03	0.00	-0.01	-0.02	0.06	0.06	0.06	-0.02	-0.06	-0.04	0.02	0.00	-0.01	0.03	0.03	-0.03	0.03
category	-0.01	0.02	1.00	0.39	-0.00	-0.02	-0.04	-0.01	-0.01	-0.01	-0.04	-0.07	-0.09	-0.00	-0.01	0.04	-0.02	-0.01	0.01	0.08
Sub-category	0.00	0.03	0.39	1.00	0.01	-0.02	-0.02	0.01	0.02	0.01	-0.03	-0.04	-0.07	0.01	0.00	0.02	-0.00	-0.01	-0.00	0.02
Customer Remarks	-0.01	0.00	-0.00	0.01	1.00	0.01	0.01	-0.00	-0.00	-0.00	-0.00	0.01	0.01	-0.00	0.00	-0.00	0.00	-0.00	0.01	-0.09
Order_id	0.00	-0.01	-0.02	-0.02	0.01	1.00	0.04	0.11	0.11	0.11	0.05	0.10	0.05	-0.00	-0.00	-0.00	0.01	0.03	-0.01	-0.01
order_date_time	-0.00	-0.02	-0.04	-0.02	0.01	0.04	1.00	0.01	-0.01	-0.01	0.08	0.20	0.08	-0.00	0.00	-0.00	-0.00	-0.01	0.00	-0.04
Issue_reported at	0.00	0.06	-0.01	0.01	-0.00	0.11	0.01	1.00	0.98	0.98	-0.02	-0.03	-0.04	-0.00	-0.00	-0.04	0.10	0.18	-0.01	0.03
issue_responded	0.00	0.06	-0.01	0.02	-0.00	0.11	-0.01	0.98	1.00	1.00	-0.02	-0.04	-0.04	-0.00	-0.00	-0.04	0.10	0.18	-0.00	0.03
Survey_response_Date	0.00	0.06	-0.01	0.01	-0.00	0.11	-0.01	0.98	1.00	1.00	-0.02	-0.04	-0.04	-0.00	-0.00	-0.04	0.10	0.18	0.00	0.03

= 1.00 (Perfect Correlation)

= Positive < 1.00

= Negative

= 0.00

= -0.00

🏆 Top 3 Features Most Related to Target

Feature	Correlation Value
category	0.077
Issue_reported at	0.033
Survey_response_Date	0.032

These features are the strongest (positive) Pearson correlations with the CSAT Score target in this dataset. Use these as priority predictors when building models.

📘 Manual Step-by-Step Correlation Calculation (Sample Rows)

Feature explained: category vs Target: CSAT Score

category	CSAT Score	(X - X̄)	(Y - Ȳ)	(X - X̄)*(Y - Ȳ)	(X - X̄)²	(Y - Ȳ)²
8	5	0.042255	0.757843	0.032023	0.001785	0.574326
8	5	0.042255	0.757843	0.032023	0.001785	0.574326
5	5	-2.957745	0.757843	-2.241506	8.748256	0.574326
10	5	2.042255	0.757843	1.547708	4.170805	0.574326
1	5	-6.957745	0.757843	-5.272877	48.410216	0.574326
10	5	2.042255	0.757843	1.547708	4.170805	0.574326
8	5	0.042255	0.757843	0.032023	0.001785	0.574326
10	5	2.042255	0.757843	1.547708	4.170805	0.574326
10	5	2.042255	0.757843	1.547708	4.170805	0.574326
11	5	3.042255	0.757843	2.305551	9.255315	0.574326

Σ((X−X̄)(Y−Ȳ)) = 24,743.0307 Σ(X−X̄)² = 626,958.6144 Σ(Y−Ȳ)² = 163,339.4034 Denominator = 320,011.0093 Calculated Correlation (Manual): 0.0773 Correlation from Pandas: 0.0773

The manual computation confirms the Pearson correlation reported by pandas. This validates the calculation and helps build statistical intuition.

Project Progress Milestone

This Week’s Milestone (as per course outline):
“Identify key predictive variables.”

What You Have Achieved

✅ Cleaned and encoded the dataset (from earlier weeks)
✅ Calculated and visualized correlations across features
✅ Found the Top 3 predictive features most related to the target
✅ Validated correlation both statistically and mathematically
✅ Ready to move toward regression or classification modeling in Week 5

🏆 Final Week 4 Project Milestone:
Key predictive features identified for the modeling phase.
(You now know which independent variables strongly influence your target variable.)