Week 3 Assignment – EDA Detailed Explanation

Overview – Exploratory Data Analysis (EDA)

In Week 3, the focus of the assignment is on performing Exploratory Data Analysis (EDA) to gain meaningful insights from the provided E-commerce Customer Support Dataset. The objective is to explore, clean, and understand the dataset by applying various data visualization and summarization techniques. This involves analyzing the structure, quality, and relationships of data through detailed tables, descriptive statistics, and multiple graphical representations.

The task requires generating a Dataset Information Table to examine missing values and data types, followed by creating statistical summaries for numeric features such as Item_price, connected_handling_time, and CSAT Score. Next, several types of charts and plots—including distribution plots, count plots, bar charts, boxplots, and correlation heatmaps—are used to visually interpret patterns and trends. These visualizations help identify how different factors, such as customer service channels, product categories, handling time, and agent shifts, influence customer satisfaction.

Through this process, students are expected to demonstrate their ability to interpret data trends, detect anomalies, understand correlations, and extract actionable insights. By the end of this week’s assignment, the learner should be able to perform a comprehensive EDA, transforming raw data into a well-structured and meaningful analysis that can guide decision-making in improving customer support operations.

1. Dataset Information Table

Column Name Non-Null Count Missing Values Data Type
Unique id859070object
channel_name859070object
category859070object
Sub-category859070object
Customer Remarks2874257165object
Order_id6767518232object
order_date_time1721468693object
Issue_reported at859070object
issue_responded859070object
Survey_response_Date859070object
Customer_City1707968828object
Product_category1719668711object
Item_price1720668701float64
connected_handling_time24285665float64
Agent_name859070object
Supervisor859070object
Manager859070object
Tenure Bucket859070object
Agent Shift859070object
CSAT Score859070int64

Purpose:

To get a high-level view of the dataset: how many rows are complete, what type of data is present, and which columns contain missing values.

Explanation of Columns:

Insights:

Columns like Customer Remarks, Order_id, and Customer_City have large missing values.
Missing Customer Remarks means many customers didn’t leave comments.
Missing Order_id could indicate data collection errors or anonymization.
Numeric columns like Item_price, connected_handling_time, and CSAT Score are mostly valid, which allows quantitative analysis.
Categorical columns like channel_name and category are mostly complete, useful for group-wise analysis.

Why it matters:

Missing data affects calculations (mean, median) and visualization. Helps decide whether to drop, impute, or ignore missing values in analysis.

2. Numeric Feature Statistics Table

Statistic Item_price connected_handling_time CSAT Score
Count (Non-Null)1720624285907
Missing68701856650
Mean5660.77462.404.24
Median979.0427.05.0
Mode999.0282.0, 299.0, 301.0, 418.05
Std Dev12825.73246.301.38
Min0.00.01.0
Max164999.01986.05.0

Purpose:

To summarize key statistics of numeric features (Item_price, connected_handling_time, CSAT Score).

Explanation of Metrics:

Insights for Each Column:

Item_price: If mean < median → right-skewed distribution (few expensive products). Max price much higher than mean → outliers exist.
connected_handling_time: Mean may be high due to outliers (e.g., rare cases taking very long). Std deviation indicates how inconsistent handling times are across agents.
CSAT Score: Typically 1–5. Mean closer to 4–5 indicates overall high customer satisfaction. Std deviation shows variation among customers’ feedback.

Why it matters:

Helps identify central tendency, spread, and anomalies in numeric features. Important for later visualizations (histograms, boxplots, correlation).

3. Distribution Plots (Histograms + KDE)

Item Price Distribution

Insights:

Item_price: Peak at low prices → majority of products are inexpensive. Long tail at high prices → few expensive products exist (skewed right).

Interpretation:

Item_price Distribution: Most products are low-cost → peak at lower prices. Long right tail → few expensive products exist (outliers). Insight: Customer satisfaction may vary slightly with price, but most interactions are for low-value products.
Connected Handling Time Distribution

Insights:

connected_handling_time: Short peaks → most requests handled quickly. Long tail → rare cases taking long time → may decrease CSAT.

Interpretation:

Connected_handling_time Distribution: Peak at short times → most requests resolved quickly. Long tail → rare cases taking much longer. Insight: Outliers (long handling times) may negatively affect CSAT. Process improvement could reduce these extreme cases.
CSAT Score Distribution

Insights:

CSAT Score: Peaks at 4–5 → majority satisfied. Small peaks at 1–2 → unhappy customers, focus for improvement.

Interpretation:

CSAT Score Distribution: Peak at 4–5 → majority of customers are satisfied. Small peaks at 1–2 → indicate dissatisfied customers. Insight: Service quality is generally good, but low CSAT cases need investigation.

Purpose:

To visualize the frequency of numeric values and distribution shape. KDE (Kernel Density Estimate) shows smooth probability distribution.

Why it matters:

Helps identify data distribution, skewness, and outliers visually. Guides decisions for scaling, transformations, or outlier treatment.

Key Takeaway: Histograms + KDE show data concentration, spread, skewness, and potential outliers.

4. Count Plots for Categorical Features

Channel Name Count Plot

Interpretation:

Channel_name: Shows which communication channel (email, chat, phone) is used most. High-count channels → heavier workload → monitor performance. Low-count channels → fewer interactions → may need fewer resources.
Category Count Plot

Interpretation:

Category: Reveals top complaint/product categories. Categories with high counts → frequently reported issues → prioritize improvements.
Sub-category Count Plot

Interpretation:

Sub-category (Top 10 only): Only top 10 plotted → readable chart. Some sub-categories dominate → indicates specific problem areas within a category.
Agent Shift Count Plot

Interpretation:

Agent Shift: Visualizes distribution of tickets per shift. Unequal distribution → may explain differences in CSAT by shift.
Tenure Bucket Count Plot

Interpretation:

Tenure Bucket: Shows experience levels of agents. Majority in mid-range → may influence handling efficiency and satisfaction.

Purpose:

To see frequency of each category.

Why it matters:

Identifies hotspots, workload distribution, and potential improvement areas.

Explanation by Column:

Key Takeaway: Count plots identify volume hotspots, workload, and potential problem areas.

5. Bar Plots (Average CSAT Score by Category)

CSAT by Channel

Insights:

Channels with low average CSAT → need attention. High average CSAT → best practices can be replicated.

Interpretation:

Example: Phone support average CSAT = 4.8 → excellent. Chat = 4.8 (great), Email = 4.2 (moderate).
CSAT by Category

Insights:

Top 10 categories only → readable chart. Low CSAT categories → identify product/service issues.

Interpretation:

Helps prioritize improvement areas based on average CSAT.

Purpose:

Shows average satisfaction (CSAT) per category/channel.

Why it matters:

Supports data-driven decision-making for improving service quality.

Key Takeaway: Bar plots highlight which channels/categories impact customer satisfaction and prioritize improvements.

6. Boxplots (Numeric Features vs CSAT Score)

Item Price vs CSAT

Insights:

Median item price fairly consistent across CSAT scores → price does not strongly affect satisfaction. Outliers may indicate very high-value purchases with low satisfaction.

Interpretation:

Boxplots reveal relationships, spread, medians, and outliers. Helps identify process improvements.
Connected Handling Time vs CSAT

Insights:

Median handling time higher for lower CSAT scores → longer handling → unhappy customers. Outliers at high handling times → significant dissatisfaction risk.

Purpose:

Visualizes the spread of numeric features across CSAT ratings.

Why it matters:

Helps identify patterns, outliers, and relationships visually. Guides business decisions (e.g., reduce handling time for high-value items).

Boxplot components:

Key Takeaway: Boxplots reveal relationships, spread, medians, and outliers. Helps identify process improvements.

7. Correlation Heatmap

Correlation Heatmap

Insights:

connected_handling_time vs CSAT Score: Negative correlation → longer handling reduces satisfaction.
Item_price vs CSAT Score: Weak correlation → price not strongly linked to satisfaction.

Interpretation:

Correlation heatmap helps identify numeric features influencing satisfaction, guiding business decisions and ML models.

Purpose:

Shows linear relationships between numeric features.

Why it matters:

Helps identify features that influence satisfaction. Useful for predictive modeling (e.g., machine learning).

Correlation range:

Key Takeaway: Correlation heatmap helps identify numeric features influencing satisfaction, guiding business decisions and ML models.