📰 BBC News Dataset

Exploratory Data Analysis Report with Interactive Tutorials
|

📚 Analysis Methodology

This report uses two types of text processing approaches depending on the analysis goal:

Analysis Type 🔴 Raw Data 🟢 Stopwords Removed Reason
Basic Statistics
(Word/Character counts)
- Measure actual document size
Stop Words Analysis - Analyze stopword frequency
Word Frequency
(Top overall words)
- Find meaningful keywords
Category Keywords - Category-specific terms
Vocabulary Richness - Content vocabulary diversity
TF-IDF Terms - Most distinctive terms
N-grams (Bigrams) - Meaningful phrase patterns
Distributions
(Word/Char histograms)
- Show actual text lengths
💡 Key Principle: Use raw data for size/distribution analysis, use filtered data (stopwords removed) for content/semantic analysis.

📊 Dataset Overview

Note: Statistics include all words (before stop words removal)

2225
Total Articles
5
Categories
384
Avg Words/Article (All)
2260
Avg Chars/Article

📈 Category Distribution

⏳ Loading tutorial content...
Category Count Percentage Avg Words

🚫 Stop Words Analysis

🔴 Raw Data - Analyzing stopword frequency from original text

⏳ Loading tutorial content...

Stop Words List

📊 Word Count Distribution

🔴 Raw Data - Distribution of actual document sizes

⏳ Loading tutorial content...

📏 Character Count Distribution

🔴 Raw Data - Distribution of actual character counts

⏳ Loading tutorial content...

📚 Vocabulary Richness by Category

🟢 Stopwords Removed - Measuring content vocabulary diversity

⏳ Loading tutorial content...
Category Total Words Unique Words Articles

🔑 Top 50 Words by Category

🟢 Stopwords Removed - Category-specific meaningful keywords

⏳ Loading tutorial content...

🎯 TF-IDF Top Terms by Category

🟢 Stopwords Removed - Most distinctive terms per category

⏳ Loading tutorial content...

🔗 N-gram Analysis (Bigrams)

🟢 Stopwords Removed - Meaningful phrase patterns (TF-IDF weighted)

⏳ Loading tutorial content...

🔄 Category Similarity Matrix

🟢 Stopwords Removed - Content-based category similarity (mean pairwise)

⏳ Loading tutorial content...

Interpretation Guide

1.000
Same category
(Perfect match)
0.45 - 0.80
High similarity
(⚠️ Potential confusion)
0.15 - 0.45
Moderate similarity
(Acceptable)
0.00 - 0.15
Low similarity
(✓ Easy to distinguish)

💡 Key Insight: Lower similarity scores are better for classification! Categories with high similarity may be harder for models to distinguish.

🔤 Most Frequent Words (Overall)

🟢 Stopwords Removed - Overall meaningful content words

⏳ Loading tutorial content...

📏 Text Statistics

Note: All statistics based on raw text (before stop words removal)

89
Min Words
4432
Max Words
332
Median Words
238
Std Dev

📝 Sample Articles

Business

Sample 1: UK economy facing 'major risks' The UK manufacturing sector will continue to face "serious challenges" over the next two years, the British Chamber of Commerce (BCC) has said. The group's quarterly su...
Sample 2: Aids and climate top Davos agenda Climate change and the fight against Aids are leading the list of concerns for the first day of the World Economic Forum in the Swiss resort of Davos. Some 2,000 busi...

Entertainment

Sample 1: Musicians to tackle US red tape Musicians' groups are to tackle US visa regulations which are blamed for hindering British acts' chances of succeeding across the Atlantic. A singer hoping to perform i...
Sample 2: U2's desire to be number one U2, who have won three prestigious Grammy Awards for their hit Vertigo, are stubbornly clinging to their status as one of the biggest bands in the world. The most popular ...

Politics

Sample 1: Baron Kinnock makes Lords debut Former Labour leader Neil Kinnock has officially been made a life peer during a ceremony in the House of Lords. He will be known Baron Kinnock of Bedwellty - after his ...
Sample 2: Howard taunts Blair over splits Tony Blair's feud with Gordon Brown is damaging the way the UK is governed, Tory leader Michael Howard has claimed in a heated prime minister's questions. Mr Howard ask...

Sport

Sample 1: Fuming Robinson blasts officials England coach Andy Robinson insisted he was "livid" after his side were denied two tries in Sunday's 19-13 Six Nations loss to Ireland in Dublin. Mark Cueto's first-ha...
Sample 2: Veteran Martinez wins Thai title Conchita Martinez won her first title in almost five years with victory over Anna-Lena Groenefeld at the Volvo Women's Open in Pattaya, Thailand. The 32-year-old Spani...

Tech

Sample 1: Mobiles rack up 20 years of use Mobile phones in the UK are celebrating their 20th anniversary this weekend. Britain's first mobile phone call was made across the Vodafone network on 1 January 1985 by...
Sample 2: Broadband steams ahead in the US More and more Americans are joining the internet's fast lane, according to official figures. The number of people and business connected to broadband jumped by 38% in ...

💡 Key Insights

Dataset Characteristics

  • Balanced Dataset: Categories are relatively well-balanced (17-23% each)
  • Article Length: Average 384 words per article (all words included)
  • Vocabulary: Rich vocabulary with diverse topics
  • Quality: Professional news articles with consistent formatting

Data Processing Notes

  • Raw Statistics: Word/character counts include ALL words (the, and, for, ...)
  • Vocabulary Analysis: Word frequency and unique words calculated AFTER removing ~40 common stop words
  • Stop Words Removed: the, and, for, that, with, was, has, are, been, will, said, etc.
  • Why Mixed Approach: Length stats show real article size; frequency stats show meaningful vocabulary