๐ค Scikit-Learn: Datasets & Train-Test Split
Simple, Clear Explanation for Every Student โ Digital E-Filing Coach | Amanuddin Education
What is Scikit-Learn? โ Simple Introduction
Imagine you want to teach a computer to recognize whether an email is spam or not spam. Scikit-Learn is a Python library (a set of ready-made tools) that helps you do exactly that โ it is a Machine Learning library.
Why Do We Need Datasets?
A machine learning model is like a student. It learns from examples (data). Without data, the model cannot learn anything. A dataset is a collection of information, like a big spreadsheet with rows and columns.
- ๐ A dataset is a table of information with rows (examples) and columns (features)
- ๐ฏ The model studies the dataset to find patterns
- โ After learning, it can make predictions on new, unseen data
- ๐ Scikit-Learn comes with ready-made practice datasets built inside it
| Concept | Real Life Example | In Machine Learning |
|---|---|---|
| Dataset | Exam question paper with answers | Table of data used to train model |
| Feature (X) | Student's name, marks, attendance | Input columns the model studies |
| Label (y) | Pass or Fail result | Output the model must predict |
| Training | Student studying before exam | Model learning from 80% of data |
| Testing | Student giving the actual exam | Model tested on 20% unseen data |
Dataset Components โ Input Features (X) & Output Variable (y)
Every dataset has two main parts. Think of it like a hospital patient record โ the patient details (age, blood pressure, symptoms) are the Input Features (X), and the final diagnosis (sick or healthy) is the Output Variable (y).
๐ท Input Features โ X (Independent Variables)
- ๐ Also called: Features, Predictors, Independent Variables
- ๐ Example: For a used-car price prediction dataset: Name, Year, KM Driven, Fuel Type โ these are all features (X)
- ๐ X is usually a 2D table (matrix) with many rows and columns
- ๐งฎ In code:
X = df.drop('Price', axis=1)โ means "everything except the price column"
๐ถ Output Variable โ y (Target / Dependent Variable)
- ๐ฏ Also called: Target, Label, Dependent Variable, Response Variable
- Two types: Classification (discrete values like 0/1, Yes/No) and Regression (continuous values like price, temperature)
- ๐ Car example: Price is the target y (we want to predict the price)
- ๐ฉบ Medical example: 0 = Healthy, 1 = Sick โ this is a classification target
| Property | Input Features (X) | Output Variable (y) |
|---|---|---|
| Also Called | Independent Variables, Features | Target, Label, Dependent Variable |
| Purpose | Given to model as input information | What the model must predict |
| Shape | 2D Matrix (rows ร columns) | 1D Array (one value per row) |
| Car Example | Name, Year, KM Driven, Fuel | Price |
| Medical Example | Age, BP, Sugar Level, Weight | 0 = Healthy / 1 = Sick |
| Type | Numbers, Text, Categories | Discrete (0,1) or Continuous |
๐ Two Types of Output (y)
Built-in Datasets in Scikit-Learn (sklearn.datasets)
Scikit-Learn is so helpful that it comes with free, ready-made practice datasets built inside the library. You don't need to download anything โ just call a function and get data immediately!
๐ Module: sklearn.datasets
- This is the sub-module (folder inside Scikit-Learn) that contains all the built-in datasets
- Import it with:
from sklearn import datasets - It contains toy datasets (small, for learning) and real-world datasets (larger)
- All datasets follow the same structure: .data for features, .target for labels
๐ธ load_iris โ The Most Famous Practice Dataset
- ๐บ Task: Classify which type of flower it is (0, 1, or 2) โ Classification problem
- ๐ Features (X): Sepal Length, Sepal Width, Petal Length, Petal Width
- ๐ฏ Target (y): 0 = Setosa, 1 = Versicolour, 2 = Virginica
- ๐ป Code:
iris = datasets.load_iris()
๐ฉบ load_breast_cancer โ Medical Diagnosis Dataset
- ๐ฌ Task: Classify tumor as malignant (0) or benign (1) โ Binary Classification
- ๐ Features (X): 30 measurements of the tumor (radius, texture, perimeter, area, etc.)
- ๐ฏ Target (y): 0 = Malignant, 1 = Benign
- ๐ป Code:
cancer = datasets.load_breast_cancer()
๐ Dataset Attributes: .data and .target
| Dataset | Rows | Features (X) | Target (y) | Problem Type | Real Use |
|---|---|---|---|---|---|
| load_iris | 150 | 4 | 3 classes (0,1,2) | Classification | Flower identification |
| load_breast_cancer | 569 | 30 | 2 classes (0,1) | Classification | Medical diagnosis |
| load_digits | 1797 | 64 | 10 classes (0โ9) | Classification | Handwritten digit recognition |
| load_boston (old) | 506 | 13 | Continuous price | Regression | House price prediction |
| load_wine | 178 | 13 | 3 classes | Classification | Wine quality classification |
๐ป Sample Code: Loading a Built-in Dataset
from sklearn import datasets
import pandas as pd
# Load iris dataset
iris = datasets.load_iris()
# .data gives you X (features)
X = iris.data
print("Features Shape:", X.shape) # Output: (150, 4)
# .target gives you y (labels)
y = iris.target
print("Target Shape:", y.shape) # Output: (150,)
print("Classes:", iris.target_names) # Output: ['setosa' 'versicolor' 'virginica']
# Load breast cancer dataset
cancer = datasets.load_breast_cancer()
print("Cancer Features:", cancer.data.shape) # (569, 30)
Train-Test Split Process โ train_test_split()
Here is the most important concept! Imagine you are a teacher. You have 100 exam questions. You use 80 questions to teach the student, and keep 20 questions hidden to test the student later. That is exactly what Train-Test Split does!
๐ Function: train_test_split()
- ๐ฆ Comes from:
from sklearn.model_selection import train_test_split - ๐ It randomly shuffles and splits your data into training and testing portions
- ๐ซ The test set is never shown to the model during training โ it is kept completely hidden
- โ After training, we use the test set to check if the model performs well on new data
โ๏ธ Parameters of train_test_split()
| Parameter | What It Does | Common Value | Example |
|---|---|---|---|
| X | Your input features matrix | Required | X (all feature columns) |
| y | Your target labels array | Required | y (target column) |
| test_size | What fraction goes to testing | 0.2 (= 20%) | 0.2 means 20% for test |
| train_size | What fraction goes to training | 0.8 (= 80%) | Auto-calculated if test_size given |
| random_state | Sets a seed for reproducibility | Any number (e.g., 42) | 42 means the split is always the same |
| shuffle | Shuffle data before splitting | True (default) | shuffle=True recommended |
๐ค Output Variables: What Does the Function Return?
X_train, X_test, y_train, y_test
| Output Variable | What It Contains | Size | Used For |
|---|---|---|---|
| X_train | Training features (input) | 80% of data | Model learns from this |
| X_test | Testing features (input) | 20% of data | Model makes predictions on this |
| y_train | Training labels (correct answers) | 80% of data | Model uses these to learn |
| y_test | Testing labels (correct answers) | 20% of data | We compare predictions vs these |
๐ก What is random_state? โ Very Important!
- ๐ฒ Without random_state: Every time you run the code, data is split differently โ results change each time
- ๐ With random_state=42: The split is always the same, so you can reproduce and compare results
- ๐ Any integer works (42 is popular by convention, but 0, 1, 100 all work the same)
- ๐ฌ For scientific/academic work, always set random_state so experiments are reproducible
๐ป Complete Code Example
from sklearn import datasets
from sklearn.model_selection import train_test_split
# Step 1: Load dataset
iris = datasets.load_iris()
X = iris.data # Features (150 rows ร 4 columns)
y = iris.target # Labels (150 values: 0, 1, or 2)
# Step 2: Split into Train (80%) and Test (20%)
X_train, X_test, y_train, y_test = train_test_split(
X, # Features
y, # Labels
test_size=0.2, # 20% goes to test set
random_state=42 # Ensures same split every time
)
# Step 3: Check sizes
print("X_train shape:", X_train.shape) # (120, 4) โ 80% of 150
print("X_test shape:", X_test.shape) # (30, 4) โ 20% of 150
print("y_train shape:", y_train.shape) # (120,)
print("y_test shape:", y_test.shape) # (30,)
Workflow โ Training, Testing, Evaluation & Goal
Now let us understand the complete step-by-step workflow of how machine learning actually works after you split your data.
Step 1 โ Training: Using 80% Data
- โ Model uses: X_train + y_train together
- ๐ Like a student reading textbook with answers โ the model sees both questions and answers
- ๐ป Code:
model.fit(X_train, y_train) - ๐ The model adjusts its internal settings (called "parameters") to reduce mistakes
Step 2 โ Testing: Using 20% Data
- ๐ซ The model has NEVER seen this data before โ this ensures a fair test
- ๐ป Code:
y_pred = model.predict(X_test) - ๐ y_test (correct answers) are kept hidden from the model until evaluation
- โ ๏ธ Using training data for testing would be like giving the student the same questions โ it would cheat!
Step 3 โ Evaluation: Compare Predictions with y_test
- ๐ Metric for Classification: Accuracy Score = correct predictions รท total predictions ร 100%
- ๐ Metric for Regression: Mean Squared Error (MSE), Rยฒ Score
- ๐ป Code:
from sklearn.metrics import accuracy_score - ๐ป Code:
accuracy_score(y_test, y_pred)
Step 4 โ Goal: Check Model Accuracy
| Phase | Data Used | What Happens | Code |
|---|---|---|---|
| ๐ Training | X_train + y_train (80%) | Model learns from examples | model.fit(X_train, y_train) |
| ๐ Testing | X_test only (20%) | Model makes predictions | y_pred = model.predict(X_test) |
| ๐ Evaluation | y_pred vs y_test | We measure model accuracy | accuracy_score(y_test, y_pred) |
| ๐ฏ Goal | Final metric | Check generalization ability | Print final score |
Implementation Tools โ Scikit-Learn, Pandas & df.drop()
When working with Scikit-Learn in real projects, you typically use three main tools together. Think of them as a team where each member does a specific job.
๐ฌ Tool 1: Scikit-Learn โ The ML Engine
- ๐ฆ Install:
pip install scikit-learn - ๐ค Provides: 40+ ML algorithms ready to use
- ๐ง Key functions:
train_test_split, fit(), predict(), accuracy_score() - ๐ Import example:
from sklearn.model_selection import train_test_split
๐ผ Tool 2: Pandas โ Data Loading & Cleaning
- ๐ฆ Install:
pip install pandas - ๐ Load data:
df = pd.read_csv('data.csv') - ๐ Explore data:
df.head(), df.info(), df.describe() - ๐งน Clean data:
df.dropna(), df.fillna(), df.rename()
โ๏ธ Tool 3: df.drop() โ Removing Target from X
- ๐ฏ Purpose: Remove the target column from the feature set
- ๐ป Usage:
X = df.drop('Price', axis=1)โ removes the 'Price' column - ๐
axis=1means column;axis=0means row - ๐ Then:
y = df['Price']โ get the target column separately
๐ป Complete Real-World Code (Using All Three Tools)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# ---- Step 1: Load data using Pandas ----
df = pd.read_csv('car_data.csv')
print(df.head()) # See first 5 rows
print(df.info()) # Check column types
# ---- Step 2: Separate X and y using df.drop() ----
X = df.drop('Sold', axis=1) # Features: everything except 'Sold' column
y = df['Sold'] # Target: just the 'Sold' column
# ---- Step 3: Split using train_test_split ----
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# ---- Step 4: Train model using Scikit-Learn ----
model = DecisionTreeClassifier()
model.fit(X_train, y_train) # Train on 80%
# ---- Step 5: Predict and Evaluate ----
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
| Tool | Main Job | Key Functions | When to Use |
|---|---|---|---|
| ๐ฌ Scikit-Learn | ML algorithms & utilities | train_test_split, fit, predict, accuracy_score | Building and evaluating models |
| ๐ผ Pandas | Data loading & cleaning | read_csv, head, dropna, fillna | Loading and preparing raw data |
| โ๏ธ df.drop() | Separate X from y | df.drop('col', axis=1) | Before splitting data |
Flowchart โ Complete ML Workflow with Scikit-Learn
pandas, sklearn, numpy
pd.read_csv() OR sklearn.datasets.load_iris()
df.head() | df.info() | df.describe()
Handle missing values | Encode categories
X = df.drop('target', axis=1) | y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
(80% Features)
(80% Labels)
(20% Features)
(20% Labels)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
Model is GOOD โ Deploy
Tune Model โ Go to Step 7
Mind Map โ Scikit-Learn: Datasets & Train-Test Split
Datasets &
Train-Test
Split
Roadmap โ Complete Learning Path for Scikit-Learn & ML
- Variables
- Lists & Loops
- Functions
- pip install
- Arrays
- DataFrames
- read_csv()
- df.drop()
- load_iris()
- load_breast_cancer()
- Explore Data
- Visualize
- Separate X & y
- train_test_split()
- test_size=0.2
- random_state
- Decision Tree
- Random Forest
- model.fit()
- SVM, KNN
- accuracy_score()
- Confusion Matrix
- Cross-Validation
- GridSearchCV
- Save with pickle
- Flask / FastAPI
- Predict New Data
- Monitor Results
| Stage | Topic | Key Libraries | Time (Approx.) |
|---|---|---|---|
| โ Foundation | Python Basics | Python, pip | 1โ2 weeks |
| โก Data Tools | NumPy & Pandas | numpy, pandas | 1โ2 weeks |
| โข Datasets | Loading & EDA | sklearn.datasets, matplotlib | 1 week |
| โฃ Splitting | Train-Test Split | sklearn.model_selection | 2โ3 days |
| โค Modeling | Training ML Models | sklearn classifiers/regressors | 2โ3 weeks |
| โฅ Evaluation | Metrics & Tuning | sklearn.metrics, GridSearchCV | 1โ2 weeks |
| โฆ Deployment | Real-World Usage | pickle, Flask, FastAPI | 2โ4 weeks |
This resource is for educational purposes only and does not constitute professional technical or legal advice. All code examples are simplified for learning. Always refer to official Scikit-Learn documentation at scikit-learn.org for production use. Content prepared by Digital E-Filing Coach โ Amanuddin Education.
