Scikit-Learn: Datasets & Train-Test Split | Digital E-Filing Coach

🤖 Scikit-Learn: Datasets & Train-Test Split

Simple, Clear Explanation for Every Student — Digital E-Filing Coach | Amanuddin Education

📌

What is Scikit-Learn? — Simple Introduction

▼

Imagine you want to teach a computer to recognize whether an email is spam or not spam. Scikit-Learn is a Python library (a set of ready-made tools) that helps you do exactly that — it is a Machine Learning library.

        💡 Simple Definition: Scikit-Learn = A free Python toolbox that lets you build, train, and test Machine Learning models without writing all the math from scratch.
      

Why Do We Need Datasets?

A machine learning model is like a student. It learns from examples (data). Without data, the model cannot learn anything. A dataset is a collection of information, like a big spreadsheet with rows and columns.

📊 A dataset is a table of information with rows (examples) and columns (features)
🎯 The model studies the dataset to find patterns
✅ After learning, it can make predictions on new, unseen data
📁 Scikit-Learn comes with ready-made practice datasets built inside it

Concept	Real Life Example	In Machine Learning
Dataset	Exam question paper with answers	Table of data used to train model
Feature (X)	Student's name, marks, attendance	Input columns the model studies
Label (y)	Pass or Fail result	Output the model must predict
Training	Student studying before exam	Model learning from 80% of data
Testing	Student giving the actual exam	Model tested on 20% unseen data

📦

Dataset Components — Input Features (X) & Output Variable (y)

▼

Every dataset has two main parts. Think of it like a hospital patient record — the patient details (age, blood pressure, symptoms) are the Input Features (X), and the final diagnosis (sick or healthy) is the Output Variable (y).

🔷 Input Features — X (Independent Variables)

        What is X? These are the columns of data that the model uses to learn. They are called "independent variables" because they are given to the model as information.
      

📝 Also called: Features, Predictors, Independent Variables
🚗 Example: For a used-car price prediction dataset: Name, Year, KM Driven, Fuel Type — these are all features (X)
📊 X is usually a 2D table (matrix) with many rows and columns
🧮 In code: X = df.drop('Price', axis=1) — means "everything except the price column"

🔶 Output Variable — y (Target / Dependent Variable)

        What is y? This is the answer the model must predict. It is called "dependent" because its value depends on the input features.
      

🎯 Also called: Target, Label, Dependent Variable, Response Variable
Two types: Classification (discrete values like 0/1, Yes/No) and Regression (continuous values like price, temperature)
🚗 Car example: Price is the target y (we want to predict the price)
🩺 Medical example: 0 = Healthy, 1 = Sick — this is a classification target

Property	Input Features (X)	Output Variable (y)
Also Called	Independent Variables, Features	Target, Label, Dependent Variable
Purpose	Given to model as input information	What the model must predict
Shape	2D Matrix (rows × columns)	1D Array (one value per row)
Car Example	Name, Year, KM Driven, Fuel	Price
Medical Example	Age, BP, Sugar Level, Weight	0 = Healthy / 1 = Sick
Type	Numbers, Text, Categories	Discrete (0,1) or Continuous

📌 Two Types of Output (y)

        Classification (Discrete): Output is a category. Example: Spam = 1, Not Spam = 0. The answer is one of a few fixed choices.
      

        Regression (Continuous): Output is a number. Example: House Price = ₹45,00,000. The answer can be any number in a range.
      

🗄️

Built-in Datasets in Scikit-Learn (sklearn.datasets)

▼

Scikit-Learn is so helpful that it comes with free, ready-made practice datasets built inside the library. You don't need to download anything — just call a function and get data immediately!

        🎁 Think of it like this: Scikit-Learn gives you free sample question papers (datasets) so you can practice machine learning without searching for data online.
      

📂 Module: `sklearn.datasets`

This is the sub-module (folder inside Scikit-Learn) that contains all the built-in datasets
Import it with: from sklearn import datasets
It contains toy datasets (small, for learning) and real-world datasets (larger)
All datasets follow the same structure: .data for features, .target for labels

🌸 load_iris — The Most Famous Practice Dataset

        What is Iris? Data about 3 types of iris flowers. Has 150 rows and 4 feature columns. A beginner's favorite!
      

🌺 Task: Classify which type of flower it is (0, 1, or 2) — Classification problem
📊 Features (X): Sepal Length, Sepal Width, Petal Length, Petal Width
🎯 Target (y): 0 = Setosa, 1 = Versicolour, 2 = Virginica
💻 Code: iris = datasets.load_iris()

🩺 load_breast_cancer — Medical Diagnosis Dataset

        What is Breast Cancer dataset? Data about tumors — is the tumor malignant (dangerous) or benign (safe)? 569 rows, 30 features.
      

🔬 Task: Classify tumor as malignant (0) or benign (1) — Binary Classification
📊 Features (X): 30 measurements of the tumor (radius, texture, perimeter, area, etc.)
🎯 Target (y): 0 = Malignant, 1 = Benign
💻 Code: cancer = datasets.load_breast_cancer()

🔑 Dataset Attributes: .data and .target

        Every built-in dataset object has two key parts: .data (the X features) and .target (the y labels).
      

Dataset	Rows	Features (X)	Target (y)	Problem Type	Real Use
load_iris	150	4	3 classes (0,1,2)	Classification	Flower identification
load_breast_cancer	569	30	2 classes (0,1)	Classification	Medical diagnosis
load_digits	1797	64	10 classes (0–9)	Classification	Handwritten digit recognition
load_boston (old)	506	13	Continuous price	Regression	House price prediction
load_wine	178	13	3 classes	Classification	Wine quality classification

💻 Sample Code: Loading a Built-in Dataset

from sklearn import datasets
import pandas as pd

# Load iris dataset
iris = datasets.load_iris()

# .data gives you X (features)
X = iris.data
print("Features Shape:", X.shape)    # Output: (150, 4)

# .target gives you y (labels)
y = iris.target
print("Target Shape:", y.shape)       # Output: (150,)
print("Classes:", iris.target_names)  # Output: ['setosa' 'versicolor' 'virginica']

# Load breast cancer dataset
cancer = datasets.load_breast_cancer()
print("Cancer Features:", cancer.data.shape)  # (569, 30)

✂️

Train-Test Split Process — train_test_split()

▼

Here is the most important concept! Imagine you are a teacher. You have 100 exam questions. You use 80 questions to teach the student, and keep 20 questions hidden to test the student later. That is exactly what Train-Test Split does!

        🎓 Train-Test Split Rule: Split your dataset into two parts — one part for the model to learn (training set) and one part to test if the model actually learned correctly (test set).
      

📐 Function: train_test_split()

📦 Comes from: from sklearn.model_selection import train_test_split
🔀 It randomly shuffles and splits your data into training and testing portions
🚫 The test set is never shown to the model during training — it is kept completely hidden
✅ After training, we use the test set to check if the model performs well on new data

⚙️ Parameters of train_test_split()

Parameter	What It Does	Common Value	Example
X	Your input features matrix	Required	X (all feature columns)
y	Your target labels array	Required	y (target column)
test_size	What fraction goes to testing	0.2 (= 20%)	0.2 means 20% for test
train_size	What fraction goes to training	0.8 (= 80%)	Auto-calculated if test_size given
random_state	Sets a seed for reproducibility	Any number (e.g., 42)	42 means the split is always the same
shuffle	Shuffle data before splitting	True (default)	shuffle=True recommended

📤 Output Variables: What Does the Function Return?

        The function returns 4 values in this exact order: X_train, X_test, y_train, y_test
      

Output Variable	What It Contains	Size	Used For
X_train	Training features (input)	80% of data	Model learns from this
X_test	Testing features (input)	20% of data	Model makes predictions on this
y_train	Training labels (correct answers)	80% of data	Model uses these to learn
y_test	Testing labels (correct answers)	20% of data	We compare predictions vs these

💡 What is random_state? — Very Important!

🎲 Without random_state: Every time you run the code, data is split differently — results change each time
🔒 With random_state=42: The split is always the same, so you can reproduce and compare results
📌 Any integer works (42 is popular by convention, but 0, 1, 100 all work the same)
🔬 For scientific/academic work, always set random_state so experiments are reproducible

💻 Complete Code Example

from sklearn import datasets
from sklearn.model_selection import train_test_split

# Step 1: Load dataset
iris = datasets.load_iris()
X = iris.data    # Features (150 rows × 4 columns)
y = iris.target  # Labels (150 values: 0, 1, or 2)

# Step 2: Split into Train (80%) and Test (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X,              # Features
    y,              # Labels
    test_size=0.2,  # 20% goes to test set
    random_state=42 # Ensures same split every time
)

# Step 3: Check sizes
print("X_train shape:", X_train.shape)  # (120, 4) — 80% of 150
print("X_test shape:",  X_test.shape)   # (30, 4)  — 20% of 150
print("y_train shape:", y_train.shape)  # (120,)
print("y_test shape:",  y_test.shape)   # (30,)

⚙️

Workflow — Training, Testing, Evaluation & Goal

▼

Now let us understand the complete step-by-step workflow of how machine learning actually works after you split your data.

Step 1 — Training: Using 80% Data

        🎓 Training = Learning Phase. The model is shown X_train (features) and y_train (correct answers) together. It learns the patterns, rules, and relationships by seeing thousands of examples.
      

✅ Model uses: X_train + y_train together
📚 Like a student reading textbook with answers — the model sees both questions and answers
💻 Code: model.fit(X_train, y_train)
📈 The model adjusts its internal settings (called "parameters") to reduce mistakes

Step 2 — Testing: Using 20% Data

        📝 Testing = Exam Phase. The model is given only X_test (features, NO answers). It must predict y_pred on its own. This is like the real exam!
      

🚫 The model has NEVER seen this data before — this ensures a fair test
💻 Code: y_pred = model.predict(X_test)
📌 y_test (correct answers) are kept hidden from the model until evaluation
⚠️ Using training data for testing would be like giving the student the same questions — it would cheat!

Step 3 — Evaluation: Compare Predictions with y_test

        📊 Evaluation = Checking the Answer Sheet. We compare the model's predictions (y_pred) against the actual correct answers (y_test) to measure how accurate the model is.
      

📐 Metric for Classification: Accuracy Score = correct predictions ÷ total predictions × 100%
📐 Metric for Regression: Mean Squared Error (MSE), R² Score
💻 Code: from sklearn.metrics import accuracy_score
💻 Code: accuracy_score(y_test, y_pred)

Step 4 — Goal: Check Model Accuracy

        🎯 The Ultimate Goal is to find how well the model can predict on NEW, unseen data — this is called generalization ability. A model that scores 95% on test data is much more useful than one that only memorized training data.
      

Phase	Data Used	What Happens	Code
🎓 Training	X_train + y_train (80%)	Model learns from examples	model.fit(X_train, y_train)
📝 Testing	X_test only (20%)	Model makes predictions	y_pred = model.predict(X_test)
📊 Evaluation	y_pred vs y_test	We measure model accuracy	accuracy_score(y_test, y_pred)
🎯 Goal	Final metric	Check generalization ability	Print final score

🛠️

Implementation Tools — Scikit-Learn, Pandas & df.drop()

▼

When working with Scikit-Learn in real projects, you typically use three main tools together. Think of them as a team where each member does a specific job.

🔬 Tool 1: Scikit-Learn — The ML Engine

        Scikit-Learn is the main ML library. It provides all the machine learning algorithms (like Decision Tree, Random Forest, SVM) and utility functions (like train_test_split, accuracy_score).
      

📦 Install: pip install scikit-learn
🤖 Provides: 40+ ML algorithms ready to use
🔧 Key functions: train_test_split, fit(), predict(), accuracy_score()
📚 Import example: from sklearn.model_selection import train_test_split

🐼 Tool 2: Pandas — Data Loading & Cleaning

        Pandas is the data handling library. You use it to load data from CSV/Excel files, clean messy data, handle missing values, and prepare the dataset before passing it to Scikit-Learn.
      

📦 Install: pip install pandas
📂 Load data: df = pd.read_csv('data.csv')
🔍 Explore data: df.head(), df.info(), df.describe()
🧹 Clean data: df.dropna(), df.fillna(), df.rename()

✂️ Tool 3: df.drop() — Removing Target from X

        df.drop() is a Pandas function used to remove a column from the dataframe. It is critical because we need to separate X (features) from y (target) before training.
      

🎯 Purpose: Remove the target column from the feature set
💻 Usage: X = df.drop('Price', axis=1) — removes the 'Price' column
📌 axis=1 means column; axis=0 means row
📝 Then: y = df['Price'] — get the target column separately

💻 Complete Real-World Code (Using All Three Tools)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# ---- Step 1: Load data using Pandas ----
df = pd.read_csv('car_data.csv')
print(df.head())       # See first 5 rows
print(df.info())       # Check column types

# ---- Step 2: Separate X and y using df.drop() ----
X = df.drop('Sold', axis=1)  # Features: everything except 'Sold' column
y = df['Sold']               # Target: just the 'Sold' column

# ---- Step 3: Split using train_test_split ----
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ---- Step 4: Train model using Scikit-Learn ----
model = DecisionTreeClassifier()
model.fit(X_train, y_train)  # Train on 80%

# ---- Step 5: Predict and Evaluate ----
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Tool	Main Job	Key Functions	When to Use
🔬 Scikit-Learn	ML algorithms & utilities	train_test_split, fit, predict, accuracy_score	Building and evaluating models
🐼 Pandas	Data loading & cleaning	read_csv, head, dropna, fillna	Loading and preparing raw data
✂️ df.drop()	Separate X from y	df.drop('col', axis=1)	Before splitting data

🔄

Flowchart — Complete ML Workflow with Scikit-Learn

▼

🚀 START: Machine Learning Project

↓

📂 Step 1: Import Libraries
pandas, sklearn, numpy

↓

📊 Step 2: Load Dataset
pd.read_csv() OR sklearn.datasets.load_iris()

↓

🔍 Step 3: Explore Data
df.head() | df.info() | df.describe()

↓

🧹 Step 4: Clean & Prepare Data
Handle missing values | Encode categories

↓

✂️ Step 5: Separate X and y
X = df.drop('target', axis=1) | y = df['target']

↓

🔀 Step 6: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

↓

🎓 X_train
(80% Features)

📝 y_train
(80% Labels)

🔍 X_test
(20% Features)

🎯 y_test
(20% Labels)

↓

🤖 Step 7: Train Model
model.fit(X_train, y_train)

↓

🧪 Step 8: Make Predictions
y_pred = model.predict(X_test)

↓

📊 Step 9: Evaluate Model
accuracy_score(y_test, y_pred)

↓

✅ Accuracy > 90%
Model is GOOD → Deploy

❌ Accuracy < 70%
Tune Model → Go to Step 7

↓

🎉 END: Model Ready for Real-World Use

🧠

Mind Map — Scikit-Learn: Datasets & Train-Test Split

▼

Scikit-Learn:
Datasets &
Train-Test
Split

📦 Dataset Components

→

Input Features X Independent Variables Output Variable y Target Variable Classification: 0/1 Regression: Continuous Example: name, year, km

🗄️ Built-in Datasets

→

sklearn.datasets load_iris (150 rows) load_breast_cancer (569 rows) load_digits .data → features .target → labels

✂️ Train-Test Split

→

train_test_split() test_size=0.2 (20%) random_state (seed) X_train (80%) X_test (20%) y_train (80%) y_test (20%)

⚙️ Workflow

→

Training: 80% data model.fit(X_train, y_train) Testing: 20% data model.predict(X_test) Evaluation: y_pred vs y_test Goal: Check accuracy

🛠️ Implementation Tools

→

Scikit-Learn (ML) Pandas (Data Loading) df.drop() (Remove target) NumPy (Arrays) accuracy_score()

🗺️

Roadmap — Complete Learning Path for Scikit-Learn & ML

▼

① Python Basics

Variables
Lists & Loops
Functions
pip install

→

② NumPy & Pandas

Arrays
DataFrames
read_csv()
df.drop()

→

③ Datasets & EDA

load_iris()
load_breast_cancer()
Explore Data
Visualize

→

④ Train-Test Split

Separate X & y
train_test_split()
test_size=0.2
random_state

→

⑤ Train a Model

Decision Tree
Random Forest
model.fit()
SVM, KNN

→

⑥ Evaluate & Tune

accuracy_score()
Confusion Matrix
Cross-Validation
GridSearchCV

→

⑦ Deploy Model

Save with pickle
Flask / FastAPI
Predict New Data
Monitor Results

Stage	Topic	Key Libraries	Time (Approx.)
① Foundation	Python Basics	Python, pip	1–2 weeks
② Data Tools	NumPy & Pandas	numpy, pandas	1–2 weeks
③ Datasets	Loading & EDA	sklearn.datasets, matplotlib	1 week
④ Splitting	Train-Test Split	sklearn.model_selection	2–3 days
⑤ Modeling	Training ML Models	sklearn classifiers/regressors	2–3 weeks
⑥ Evaluation	Metrics & Tuning	sklearn.metrics, GridSearchCV	1–2 weeks
⑦ Deployment	Real-World Usage	pickle, Flask, FastAPI	2–4 weeks

⚠️ Educational Disclaimer:
This resource is for educational purposes only and does not constitute professional technical or legal advice. All code examples are simplified for learning. Always refer to official Scikit-Learn documentation at scikit-learn.org for production use. Content prepared by Digital E-Filing Coach — Amanuddin Education.

Scikit-Learn 1

🤖 Scikit-Learn: Datasets & Train-Test Split

What is Scikit-Learn? — Simple Introduction

Why Do We Need Datasets?

Dataset Components — Input Features (X) & Output Variable (y)

🔷 Input Features — X (Independent Variables)

🔶 Output Variable — y (Target / Dependent Variable)

📌 Two Types of Output (y)

Built-in Datasets in Scikit-Learn (sklearn.datasets)

📂 Module: `sklearn.datasets`

🌸 load_iris — The Most Famous Practice Dataset

🩺 load_breast_cancer — Medical Diagnosis Dataset

🔑 Dataset Attributes: .data and .target

💻 Sample Code: Loading a Built-in Dataset

Train-Test Split Process — train_test_split()

📐 Function: train_test_split()

⚙️ Parameters of train_test_split()

📤 Output Variables: What Does the Function Return?

💡 What is random_state? — Very Important!

💻 Complete Code Example

Workflow — Training, Testing, Evaluation & Goal

Step 1 — Training: Using 80% Data

Step 2 — Testing: Using 20% Data

Step 3 — Evaluation: Compare Predictions with y_test

Step 4 — Goal: Check Model Accuracy

Implementation Tools — Scikit-Learn, Pandas & df.drop()

🔬 Tool 1: Scikit-Learn — The ML Engine

🐼 Tool 2: Pandas — Data Loading & Cleaning

✂️ Tool 3: df.drop() — Removing Target from X

💻 Complete Real-World Code (Using All Three Tools)

Flowchart — Complete ML Workflow with Scikit-Learn

Mind Map — Scikit-Learn: Datasets & Train-Test Split

Roadmap — Complete Learning Path for Scikit-Learn & ML

Copyright © 2026 Digital E-filing Coach | Powered by CA Amanuddin Mallick

What is Scikit-Learn? — Simple Introduction

Why Do We Need Datasets?

Dataset Components — Input Features (X) & Output Variable (y)

🔷 Input Features — X (Independent Variables)

🔶 Output Variable — y (Target / Dependent Variable)

📌 Two Types of Output (y)

Built-in Datasets in Scikit-Learn (sklearn.datasets)

📂 Module: sklearn.datasets

🌸 load_iris — The Most Famous Practice Dataset

🩺 load_breast_cancer — Medical Diagnosis Dataset

🔑 Dataset Attributes: .data and .target

💻 Sample Code: Loading a Built-in Dataset

Train-Test Split Process — train_test_split()

📐 Function: train_test_split()

⚙️ Parameters of train_test_split()

📤 Output Variables: What Does the Function Return?

💡 What is random_state? — Very Important!

💻 Complete Code Example

Workflow — Training, Testing, Evaluation & Goal

Step 1 — Training: Using 80% Data

Step 2 — Testing: Using 20% Data

Step 3 — Evaluation: Compare Predictions with y_test

Step 4 — Goal: Check Model Accuracy

Implementation Tools — Scikit-Learn, Pandas & df.drop()

🔬 Tool 1: Scikit-Learn — The ML Engine

🐼 Tool 2: Pandas — Data Loading & Cleaning

✂️ Tool 3: df.drop() — Removing Target from X

💻 Complete Real-World Code (Using All Three Tools)

Flowchart — Complete ML Workflow with Scikit-Learn

Mind Map — Scikit-Learn: Datasets & Train-Test Split

Roadmap — Complete Learning Path for Scikit-Learn & ML

📂 Module: `sklearn.datasets`