Week 4 โ€” Correlation Analysis (Professional Final Version)

Course: Applied Data Science with AI โ€ข Instructor: Dr. Muhammad Mohsin Nazir
Status: Completed
Report generated: Week 4

๐Ÿ“˜ Week 4 โ€” Detailed Overview / Introduction

Week 4 of the Data Analytics project marks a critical transition from basic data preparation toward understanding the statistical relationships among variables. The main goal of this phase was to explore how different features within the dataset relate to one another and, more importantly, how they influence the target variable โ€” typically representing customer satisfaction or a performance score.

This weekโ€™s analytical task focused on correlation analysis, a fundamental step before predictive modeling. By computing the correlation coefficients between numerical features, you identified patterns of association that help in selecting relevant predictors and removing redundant or non-informative variables. Understanding these relationships is essential for improving model accuracy and interpretability in later stages such as regression or classification.

The workflow began with loading and cleaning the dataset (Customer_support_data.csv), ensuring no missing or inconsistent values remained. Categorical features were encoded using label encoding to make them suitable for numerical computations. Next, the correlation matrix was generated to visually and numerically compare inter-variable relationships. Using Seabornโ€™s heatmap visualization, the strongest and weakest correlations were identified with a color-coded representation for easier interpretation.

To deepen conceptual understanding, a manual correlation computation was performed for one of the most significant feature pairs. This step-by-step calculation demonstrated how correlation values are mathematically derived using mean-centered values and standard deviations, thereby reinforcing both statistical intuition and technical accuracy.

The week concluded by highlighting the Top 3 predictive features most correlated with the target variable. These insights serve as a foundation for Week 5, where these features will be utilized to build predictive models that forecast customer satisfaction or business outcomes. In essence, Week 4 established a data-driven pathway for selecting variables that matter โ€” paving the way for robust model development and validation in subsequent stages.

In summary: Week 4 successfully bridged exploratory data analysis with predictive modeling preparation โ€” transforming raw data into actionable insights through systematic correlation assessment and statistical reasoning.

Tasks Performed in Your Vs Code

๐Ÿ“Š Outputs Produced

โœ… Dataset Loaded Successfully

85,907
Rows
20
Columns

๐Ÿ“ Target Column Detected

Automatically identified target: CSAT Score

๐Ÿ“Š Dataset Overview

Column IndexColumn Name
0Unique id
1channel_name
2category
3Sub-category
4Customer Remarks
5Order_id
6order_date_time
7Issue_reported at
8issue_responded
9Survey_response_Date
10Customer_City
11Product_category
12Item_price
13connected_handling_time
14Agent_name
15Supervisor
16Manager
17Tenure Bucket
18Agent Shift
19CSAT Score

๐Ÿ”Ž First 5 Rows (Preview)

Unique idchannel_namecategorySub-categoryCustomer Remarks Order_idorder_date_timeIssue_reported atissue_responded Survey_response_DateCustomer_CityProduct_categoryItem_price connected_handling_timeAgent_nameSupervisorManager Tenure BucketAgent ShiftCSAT Score
7e9ae164-6a8b-4521-a2d4-58f7c9fff13fOutcallProduct QueriesLife Insuranceโ€” c27c9bb4-fa36-4140-9f1f-21009254ffdbโ€”01/08/2023 11:1301/08/2023 11:47 01-Aug-23โ€”โ€”โ€”โ€”Richard BuchananMason GuptaJennifer Nguyen On Job TrainingMorning5
b07ec1b0-f376-43b6-86df-ec03da3b2e16OutcallProduct QueriesProduct Specific Informationโ€” d406b0c7-ce17-4654-b9de-f08d421254bdโ€”01/08/2023 12:5201/08/2023 12:54 01-Aug-23โ€”โ€”โ€”โ€”Vicki CollinsDylan KimMichael Lee >90Morning5
200814dd-27c7-4149-ba2b-bd3af3092880InboundOrder RelatedInstallation/demoโ€” c273368d-b961-44cb-beaf-62d6fd6c00d5โ€”01/08/2023 20:1601/08/2023 20:38 01-Aug-23โ€”โ€”โ€”โ€”Duane NormanJackson ParkWilliam Kim On Job TrainingEvening5
eb0d3e53-c1ca-42d3-8486-e42c8d622135InboundReturnsReverse Pickup Enquiryโ€” 5aed0059-55a4-4ec6-bb54-97942092020aโ€”01/08/2023 20:5601/08/2023 21:16 01-Aug-23โ€”โ€”โ€”โ€”Patrick FloresOlivia WangJohn Smith >90Evening5
ba903143-1e54-406c-b969-46c52f92e5dfInboundCancellationNot Neededโ€” e8bed5a9-6933-4aff-9dc6-ccefd7dcde59โ€”01/08/2023 10:3001/08/2023 10:32 01-Aug-23โ€”โ€”โ€”โ€”Christopher SanchezAustin JohnsonMichael Lee 0-30Morning5

๐Ÿงน Missing Value Check

Total missing values after cleaning: 0

๐Ÿ”  Encoding Summary

Encoded 17 categorical columns using LabelEncoder.

๐Ÿ—‚๏ธ Encoded Columns

IndexCategorical Column
0Unique id
1channel_name
2category
3Sub-category
4Customer Remarks
5Order_id
6order_date_time
7Issue_reported at
8issue_responded
9Survey_response_Date
10Customer_City
11Product_category
12Agent_name
13Supervisor
14Manager
15Tenure Bucket
16Agent Shift

๐Ÿ“Š Correlation Matrix (20 Columns ร— 10 Rows)

Unique id channel_name category Sub-category Customer Remarks Order_id order_date_time Issue_reported at issue_responded Survey_response_Date Customer_City Product_category Item_price connected_handling_time Agent_name Supervisor Manager Tenure Bucket Agent Shift CSAT Score
Unique id1.000.00-0.010.00-0.010.00-0.000.000.000.000.000.00-0.00-0.00-0.000.000.00-0.00-0.010.00
channel_name0.001.000.020.030.00-0.01-0.020.060.060.06-0.02-0.06-0.040.020.00-0.010.030.03-0.030.03
category-0.010.021.000.39-0.00-0.02-0.04-0.01-0.01-0.01-0.04-0.07-0.09-0.00-0.010.04-0.02-0.010.010.08
Sub-category0.000.030.391.000.01-0.02-0.020.010.020.01-0.03-0.04-0.070.010.000.02-0.00-0.01-0.000.02
Customer Remarks-0.010.00-0.000.011.000.010.01-0.00-0.00-0.00-0.000.010.01-0.000.00-0.000.00-0.000.01-0.09
Order_id0.00-0.01-0.02-0.020.011.000.040.110.110.110.050.100.05-0.00-0.00-0.000.010.03-0.01-0.01
order_date_time-0.00-0.02-0.04-0.020.010.041.000.01-0.01-0.010.080.200.08-0.000.00-0.00-0.00-0.010.00-0.04
Issue_reported at0.000.06-0.010.01-0.000.110.011.000.980.98-0.02-0.03-0.04-0.00-0.00-0.040.100.18-0.010.03
issue_responded0.000.06-0.010.02-0.000.11-0.010.981.001.00-0.02-0.04-0.04-0.00-0.00-0.040.100.18-0.000.03
Survey_response_Date0.000.06-0.010.01-0.000.11-0.010.981.001.00-0.02-0.04-0.04-0.00-0.00-0.040.100.180.000.03
= 1.00 (Perfect Correlation)
= Positive < 1.00
= Negative
= 0.00
= -0.00
Correlation Analysis Visualization

๐Ÿ† Top 3 Features Most Related to Target

FeatureCorrelation Value
category0.077
Issue_reported at0.033
Survey_response_Date0.032

These features are the strongest (positive) Pearson correlations with the CSAT Score target in this dataset. Use these as priority predictors when building models.

Correlation Analysis Visualization

๐Ÿ“˜ Manual Step-by-Step Correlation Calculation (Sample Rows)

Feature explained: category vs Target: CSAT Score

categoryCSAT Score(X - Xฬ„)(Y - ศฒ)(X - Xฬ„)*(Y - ศฒ)(X - Xฬ„)ยฒ(Y - ศฒ)ยฒ
850.0422550.7578430.0320230.0017850.574326
850.0422550.7578430.0320230.0017850.574326
55-2.9577450.757843-2.2415068.7482560.574326
1052.0422550.7578431.5477084.1708050.574326
15-6.9577450.757843-5.27287748.4102160.574326
1052.0422550.7578431.5477084.1708050.574326
850.0422550.7578430.0320230.0017850.574326
1052.0422550.7578431.5477084.1708050.574326
1052.0422550.7578431.5477084.1708050.574326
1153.0422550.7578432.3055519.2553150.574326
ฮฃ((Xโˆ’Xฬ„)(Yโˆ’ศฒ)) = 24,743.0307 ฮฃ(Xโˆ’Xฬ„)ยฒ = 626,958.6144 ฮฃ(Yโˆ’ศฒ)ยฒ = 163,339.4034 Denominator = 320,011.0093 Calculated Correlation (Manual): 0.0773 Correlation from Pandas: 0.0773

The manual computation confirms the Pearson correlation reported by pandas. This validates the calculation and helps build statistical intuition.

Project Progress Milestone

This Weekโ€™s Milestone (as per course outline):
โ€œIdentify key predictive variables.โ€

What You Have Achieved

๐Ÿ† Final Week 4 Project Milestone:
Key predictive features identified for the modeling phase.
(You now know which independent variables strongly influence your target variable.)