Adult Income Dataset - EDA Report with Interactive Tutorials

📚 Analysis Methodology

This report analyzes tabular data (structured CSV format) with focus on data quality, feature distributions, and relationships. Different from text analysis, we examine:

Analysis Type	Purpose	Key Metrics
Missing Values	Data quality assessment	Missing %, Patterns, Imputation strategy
Numerical Features	Distribution & outliers	Mean, Median, Std, IQR, Outliers
Categorical Features	Category frequencies	Unique values, Mode, Cardinality
Correlation	Feature relationships	Pearson r, Multicollinearity
Target Analysis	Class balance & relationships	Distribution, Target vs features

💡 Key Principle: Understand data quality and distributions BEFORE applying machine learning models.

📊 Dataset Overview

32,561

Training Samples

16,281

Test Samples

14

Total Features

6

Numerical Features

8

Categorical Features

Target Variable: income

24,720

<=50K (75.9%)

7,841

>50K (24.1%)

📋 Feature Descriptions

Feature Name	Type	Description
age	Numerical	Age of the individual in years
workclass	Categorical	Type of employer (Private, Self-emp, Government, etc.)
fnlwgt	Numerical	Final weight - number of people the census believes the entry represents
education	Categorical	Highest level of education achieved (Bachelors, HS-grad, Masters, etc.)
education_num	Numerical	Education level represented as a number (higher = more education)
marital_status	Categorical	Marital status (Married, Divorced, Never-married, etc.)
occupation	Categorical	Type of occupation (Prof-specialty, Craft-repair, Exec-managerial, etc.)
relationship	Categorical	Relationship status (Husband, Wife, Own-child, Unmarried, etc.)
race	Categorical	Race (White, Black, Asian-Pac-Islander, Amer-Indian-Eskimo, Other)
sex	Categorical	Biological sex (Male, Female)
capital_gain	Numerical	Capital gains income in dollars
capital_loss	Numerical	Capital loss in dollars
hours_per_week	Numerical	Number of hours worked per week
native_country	Categorical	Country of origin (United-States, Mexico, Philippines, etc.)
income	Target	Income level (≤50K or >50K per year)

🔍 Missing Values Analysis

⏳ Loading tutorial content...

4,262

Total Missing Cells

3

Columns with Missing

Feature	Missing Count	Missing %	Severity
occupation	1,843	5.66%	⚠️ Medium
workclass	1,836	5.64%	⚠️ Medium
native_country	583	1.79%	✓ Low

📈 Numerical Features Distribution

⏳ Loading tutorial content...

Feature Statistics

Age

Mean: 38.58
Median: 37.00
Std: 13.64

Fnlwgt

Mean: 189778.37
Median: 178356.00
Std: 105549.98

Education Num

Mean: 10.08
Median: 10.00
Std: 2.57

Capital Gain

Mean: 1077.65
Median: 0.00
Std: 7385.29

Capital Loss

Mean: 87.30
Median: 0.00
Std: 402.96

Hours Per Week

Mean: 40.44
Median: 40.00
Std: 12.35

📊 Categorical Features Distribution

⏳ Loading tutorial content...

🎯 Target Variable Distribution

⏳ Loading tutorial content...

Class	Count	Percentage
<=50K	24,720	75.92%
>50K	7,841	24.08%

🔗 Correlation Analysis

⏳ Loading tutorial content...

📦 Outlier Detection (IQR Method)

⏳ Loading tutorial content...

Feature	Outlier Count	Percentage	Severity	Valid Range (IQR)
Age	143	0.44%	✓ Low	[-2.00, 78.00]
Fnlwgt	992	3.05%	⚠️ Medium	[-61009.00, 415887.00]
Education Num	1,198	3.68%	⚠️ Medium	[4.50, 16.50]
Capital Gain	2,712	8.33%	⚠️ High	[0.00, 0.00]
Capital Loss	1,519	4.67%	⚠️ Medium	[0.00, 0.00]
Hours Per Week	9,008	27.66%	⚠️ High	[32.50, 52.50]

🎯 Target vs Categorical Features

⏳ Loading tutorial content...

📝 Sample Data Rows

📊 Samples for: <=50K

age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K

📊 Samples for: >50K

age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	45	United-States	>50K
31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	50	United-States	>50K
42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	40	United-States	>50K

💡 Key Insights

Dataset Characteristics

Dataset contains 48,842 samples with 14 features
⚠️ Imbalanced dataset: <=50K (75.9%) vs >50K (24.1%) - ratio 3.2:1
⚠️ Found missing values in 3 columns
• occupation: 1,843 missing (5.66%)
• workclass: 1,836 missing (5.64%)
• native_country: 583 missing (1.79%)
💡 Suggestion: Use mode/median imputation for columns with <5% missing
⚠️ Detected 15572 outliers across 6 features
• hours_per_week: 9008 outliers (27.66%)
• capital_gain: 2712 outliers (8.33%)
• capital_loss: 1519 outliers (4.67%)
💡 Suggestion: Investigate features with >5% outliers - may need capping or transformation
No high correlations found (|r| > 0.7) - features are relatively independent

💰 Adult Income Dataset

📚 Analysis Methodology

📊 Dataset Overview

Target Variable: income

📋 Feature Descriptions

🔍 Missing Values Analysis

📈 Numerical Features Distribution

Feature Statistics

📊 Categorical Features Distribution

🎯 Target Variable Distribution

🔗 Correlation Analysis

📦 Outlier Detection (IQR Method)

🎯 Target vs Categorical Features

📝 Sample Data Rows

📊 Samples for: <=50K

📊 Samples for: >50K

💡 Key Insights

Dataset Characteristics