Week 3 Assignment – EDA Detailed Explanation
Overview – Exploratory Data Analysis (EDA)
In Week 3, the focus of the assignment is on performing
Exploratory Data Analysis (EDA) to gain meaningful insights from
the provided E-commerce Customer Support Dataset. The objective
is to explore, clean, and understand the dataset by applying various data visualization and
summarization techniques. This involves analyzing the structure, quality, and relationships
of data through detailed tables, descriptive statistics, and multiple graphical representations.
The task requires generating a Dataset Information Table to examine
missing values and data types, followed by creating statistical summaries for numeric features
such as Item_price, connected_handling_time, and
CSAT Score. Next, several types of charts and plots—including
distribution plots, count plots, bar charts,
boxplots, and correlation heatmaps—are used to visually interpret
patterns and trends. These visualizations help identify how different factors, such as
customer service channels, product categories, handling time, and agent shifts,
influence customer satisfaction.
Through this process, students are expected to demonstrate their ability to interpret data trends,
detect anomalies, understand correlations, and extract actionable insights. By the end of this
week’s assignment, the learner should be able to perform a comprehensive EDA, transforming raw data
into a well-structured and meaningful analysis that can guide decision-making in improving
customer support operations.
1. Dataset Information Table
| Column Name |
Non-Null Count |
Missing Values |
Data Type |
| Unique id | 85907 | 0 | object |
| channel_name | 85907 | 0 | object |
| category | 85907 | 0 | object |
| Sub-category | 85907 | 0 | object |
| Customer Remarks | 28742 | 57165 | object |
| Order_id | 67675 | 18232 | object |
| order_date_time | 17214 | 68693 | object |
| Issue_reported at | 85907 | 0 | object |
| issue_responded | 85907 | 0 | object |
| Survey_response_Date | 85907 | 0 | object |
| Customer_City | 17079 | 68828 | object |
| Product_category | 17196 | 68711 | object |
| Item_price | 17206 | 68701 | float64 |
| connected_handling_time | 242 | 85665 | float64 |
| Agent_name | 85907 | 0 | object |
| Supervisor | 85907 | 0 | object |
| Manager | 85907 | 0 | object |
| Tenure Bucket | 85907 | 0 | object |
| Agent Shift | 85907 | 0 | object |
| CSAT Score | 85907 | 0 | int64 |
Purpose:
To get a high-level view of the dataset: how many rows are complete, what type of data is present, and which columns contain missing values.
Explanation of Columns:
- Column Name → Names of all features in your dataset.
- Non-Null Count → Number of rows where data is available.
- Missing Values → Number of rows where data is absent.
- Data Type → Type of data: int64 (integer), float64 (decimal), object (text).
Insights:
Columns like Customer Remarks, Order_id, and Customer_City have large missing values.
Missing Customer Remarks means many customers didn’t leave comments.
Missing Order_id could indicate data collection errors or anonymization.
Numeric columns like Item_price, connected_handling_time, and CSAT Score are mostly valid, which allows quantitative analysis.
Categorical columns like channel_name and category are mostly complete, useful for group-wise analysis.
Why it matters:
Missing data affects calculations (mean, median) and visualization. Helps decide whether to drop, impute, or ignore missing values in analysis.
2. Numeric Feature Statistics Table
| Statistic |
Item_price |
connected_handling_time |
CSAT Score |
| Count (Non-Null) | 17206 | 242 | 85907 |
| Missing | 68701 | 85665 | 0 |
| Mean | 5660.77 | 462.40 | 4.24 |
| Median | 979.0 | 427.0 | 5.0 |
| Mode | 999.0 | 282.0, 299.0, 301.0, 418.0 | 5 |
| Std Dev | 12825.73 | 246.30 | 1.38 |
| Min | 0.0 | 0.0 | 1.0 |
| Max | 164999.0 | 1986.0 | 5.0 |
Purpose:
To summarize key statistics of numeric features (Item_price, connected_handling_time, CSAT Score).
Explanation of Metrics:
- Count (Non-Null): Number of valid entries.
- Missing: Count of missing values.
- Mean: Average value.
- Median: Middle value (less sensitive to outliers than mean).
- Mode: Most frequent value (helps identify repeated patterns).
- Std Dev: Measures spread or variability of data.
- Min / Max: Smallest and largest values; identifies outliers.
Insights for Each Column:
Item_price: If mean < median → right-skewed distribution (few expensive products). Max price much higher than mean → outliers exist.
connected_handling_time: Mean may be high due to outliers (e.g., rare cases taking very long). Std deviation indicates how inconsistent handling times are across agents.
CSAT Score: Typically 1–5. Mean closer to 4–5 indicates overall high customer satisfaction. Std deviation shows variation among customers’ feedback.
Why it matters:
Helps identify central tendency, spread, and anomalies in numeric features. Important for later visualizations (histograms, boxplots, correlation).
3. Distribution Plots (Histograms + KDE)
Insights:
Item_price: Peak at low prices → majority of products are inexpensive. Long tail at high prices → few expensive products exist (skewed right).
Interpretation:
Item_price Distribution: Most products are low-cost → peak at lower prices. Long right tail → few expensive products exist (outliers). Insight: Customer satisfaction may vary slightly with price, but most interactions are for low-value products.
Insights:
connected_handling_time: Short peaks → most requests handled quickly. Long tail → rare cases taking long time → may decrease CSAT.
Interpretation:
Connected_handling_time Distribution: Peak at short times → most requests resolved quickly. Long tail → rare cases taking much longer. Insight: Outliers (long handling times) may negatively affect CSAT. Process improvement could reduce these extreme cases.
Insights:
CSAT Score: Peaks at 4–5 → majority satisfied. Small peaks at 1–2 → unhappy customers, focus for improvement.
Interpretation:
CSAT Score Distribution: Peak at 4–5 → majority of customers are satisfied. Small peaks at 1–2 → indicate dissatisfied customers. Insight: Service quality is generally good, but low CSAT cases need investigation.
Purpose:
To visualize the frequency of numeric values and distribution shape. KDE (Kernel Density Estimate) shows smooth probability distribution.
Why it matters:
Helps identify data distribution, skewness, and outliers visually. Guides decisions for scaling, transformations, or outlier treatment.
Key Takeaway: Histograms + KDE show data concentration, spread, skewness, and potential outliers.
4. Count Plots for Categorical Features
Interpretation:
Channel_name: Shows which communication channel (email, chat, phone) is used most. High-count channels → heavier workload → monitor performance. Low-count channels → fewer interactions → may need fewer resources.
Interpretation:
Category: Reveals top complaint/product categories. Categories with high counts → frequently reported issues → prioritize improvements.
Interpretation:
Sub-category (Top 10 only): Only top 10 plotted → readable chart. Some sub-categories dominate → indicates specific problem areas within a category.
Interpretation:
Agent Shift: Visualizes distribution of tickets per shift. Unequal distribution → may explain differences in CSAT by shift.
Interpretation:
Tenure Bucket: Shows experience levels of agents. Majority in mid-range → may influence handling efficiency and satisfaction.
Purpose:
To see frequency of each category.
Why it matters:
Identifies hotspots, workload distribution, and potential improvement areas.
Explanation by Column:
- channel_name: Which communication channel (email, chat, phone) is most used. Insights: focus training/resources on channels with high volume.
- category: Most reported product/service categories. Helps identify problem-prone product lines.
- Sub-category (Top 10 only): Only top 10 plotted → prevents x-axis overlap. Shows detailed product/service issues clearly.
- Agent Shift: Shows distribution of workload across shifts. Could relate to CSAT variation by shift.
- Tenure Bucket: Shows experience distribution among agents. Helps correlate experience with customer satisfaction.
Key Takeaway: Count plots identify volume hotspots, workload, and potential problem areas.
5. Bar Plots (Average CSAT Score by Category)
Insights:
Channels with low average CSAT → need attention. High average CSAT → best practices can be replicated.
Interpretation:
Example: Phone support average CSAT = 4.8 → excellent. Chat = 4.8 (great), Email = 4.2 (moderate).
Insights:
Top 10 categories only → readable chart. Low CSAT categories → identify product/service issues.
Interpretation:
Helps prioritize improvement areas based on average CSAT.
Purpose:
Shows average satisfaction (CSAT) per category/channel.
Why it matters:
Supports data-driven decision-making for improving service quality.
Key Takeaway: Bar plots highlight which channels/categories impact customer satisfaction and prioritize improvements.
6. Boxplots (Numeric Features vs CSAT Score)
Insights:
Median item price fairly consistent across CSAT scores → price does not strongly affect satisfaction. Outliers may indicate very high-value purchases with low satisfaction.
Interpretation:
Boxplots reveal relationships, spread, medians, and outliers. Helps identify process improvements.
Insights:
Median handling time higher for lower CSAT scores → longer handling → unhappy customers. Outliers at high handling times → significant dissatisfaction risk.
Purpose:
Visualizes the spread of numeric features across CSAT ratings.
Why it matters:
Helps identify patterns, outliers, and relationships visually. Guides business decisions (e.g., reduce handling time for high-value items).
Boxplot components:
- Box: Middle 50% of data (IQR)
- Line inside box: Median
- Whiskers: Range excluding outliers
- Dots outside whiskers: Outliers
Key Takeaway: Boxplots reveal relationships, spread, medians, and outliers. Helps identify process improvements.
7. Correlation Heatmap
Insights:
connected_handling_time vs CSAT Score: Negative correlation → longer handling reduces satisfaction.
Item_price vs CSAT Score: Weak correlation → price not strongly linked to satisfaction.
Interpretation:
Correlation heatmap helps identify numeric features influencing satisfaction, guiding business decisions and ML models.
Purpose:
Shows linear relationships between numeric features.
Why it matters:
Helps identify features that influence satisfaction. Useful for predictive modeling (e.g., machine learning).
Correlation range:
- 1 → strong positive correlation
- -1 → strong negative correlation
- 0 → no correlation
Key Takeaway: Correlation heatmap helps identify numeric features influencing satisfaction, guiding business decisions and ML models.