💰 Adult Income Dataset

Exploratory Data Analysis Report with Interactive Tutorials
|

📚 Analysis Methodology

This report analyzes tabular data (structured CSV format) with focus on data quality, feature distributions, and relationships. Different from text analysis, we examine:

Analysis Type Purpose Key Metrics
Missing Values Data quality assessment Missing %, Patterns, Imputation strategy
Numerical Features Distribution & outliers Mean, Median, Std, IQR, Outliers
Categorical Features Category frequencies Unique values, Mode, Cardinality
Correlation Feature relationships Pearson r, Multicollinearity
Target Analysis Class balance & relationships Distribution, Target vs features
💡 Key Principle: Understand data quality and distributions BEFORE applying machine learning models.

📊 Dataset Overview

32,561
Training Samples
16,281
Test Samples
14
Total Features
6
Numerical Features
8
Categorical Features

Target Variable: income

24,720
<=50K (75.9%)
7,841
>50K (24.1%)

📋 Feature Descriptions

Feature Name Type Description
age Numerical Age of the individual in years
workclass Categorical Type of employer (Private, Self-emp, Government, etc.)
fnlwgt Numerical Final weight - number of people the census believes the entry represents
education Categorical Highest level of education achieved (Bachelors, HS-grad, Masters, etc.)
education_num Numerical Education level represented as a number (higher = more education)
marital_status Categorical Marital status (Married, Divorced, Never-married, etc.)
occupation Categorical Type of occupation (Prof-specialty, Craft-repair, Exec-managerial, etc.)
relationship Categorical Relationship status (Husband, Wife, Own-child, Unmarried, etc.)
race Categorical Race (White, Black, Asian-Pac-Islander, Amer-Indian-Eskimo, Other)
sex Categorical Biological sex (Male, Female)
capital_gain Numerical Capital gains income in dollars
capital_loss Numerical Capital loss in dollars
hours_per_week Numerical Number of hours worked per week
native_country Categorical Country of origin (United-States, Mexico, Philippines, etc.)
income Target Income level (≤50K or >50K per year)

🔍 Missing Values Analysis

⏳ Loading tutorial content...
4,262
Total Missing Cells
3
Columns with Missing
Feature Missing Count Missing % Severity
occupation 1,843 5.66% ⚠️ Medium
workclass 1,836 5.64% ⚠️ Medium
native_country 583 1.79% ✓ Low

📈 Numerical Features Distribution

⏳ Loading tutorial content...

Feature Statistics

Age
Mean: 38.58
Median: 37.00
Std: 13.64
Fnlwgt
Mean: 189778.37
Median: 178356.00
Std: 105549.98
Education Num
Mean: 10.08
Median: 10.00
Std: 2.57
Capital Gain
Mean: 1077.65
Median: 0.00
Std: 7385.29
Capital Loss
Mean: 87.30
Median: 0.00
Std: 402.96
Hours Per Week
Mean: 40.44
Median: 40.00
Std: 12.35

📊 Categorical Features Distribution

⏳ Loading tutorial content...

🎯 Target Variable Distribution

⏳ Loading tutorial content...
Class Count Percentage
<=50K 24,720 75.92%
>50K 7,841 24.08%

🔗 Correlation Analysis

⏳ Loading tutorial content...

📦 Outlier Detection (IQR Method)

⏳ Loading tutorial content...
Feature Outlier Count Percentage Severity Valid Range (IQR)
Age 143 0.44% ✓ Low [-2.00, 78.00]
Fnlwgt 992 3.05% ⚠️ Medium [-61009.00, 415887.00]
Education Num 1,198 3.68% ⚠️ Medium [4.50, 16.50]
Capital Gain 2,712 8.33% ⚠️ High [0.00, 0.00]
Capital Loss 1,519 4.67% ⚠️ Medium [0.00, 0.00]
Hours Per Week 9,008 27.66% ⚠️ High [32.50, 52.50]

🎯 Target vs Categorical Features

⏳ Loading tutorial content...

📝 Sample Data Rows

📊 Samples for: <=50K

age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country income
39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K

📊 Samples for: >50K

age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country income
52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States >50K
31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States >50K
42 Private 159449 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 5178 0 40 United-States >50K

💡 Key Insights

Dataset Characteristics

  • Dataset contains 48,842 samples with 14 features
  • ⚠️ Imbalanced dataset: <=50K (75.9%) vs >50K (24.1%) - ratio 3.2:1
  • ⚠️ Found missing values in 3 columns
  • • occupation: 1,843 missing (5.66%)
  • • workclass: 1,836 missing (5.64%)
  • • native_country: 583 missing (1.79%)
  • 💡 Suggestion: Use mode/median imputation for columns with <5% missing
  • ⚠️ Detected 15572 outliers across 6 features
  • • hours_per_week: 9008 outliers (27.66%)
  • • capital_gain: 2712 outliers (8.33%)
  • • capital_loss: 1519 outliers (4.67%)
  • 💡 Suggestion: Investigate features with >5% outliers - may need capping or transformation
  • No high correlations found (|r| > 0.7) - features are relatively independent