Scikit-Learn 1

Scikit-Learn: Datasets & Train-Test Split | Digital E-Filing Coach

๐Ÿค– Scikit-Learn: Datasets & Train-Test Split

Simple, Clear Explanation for Every Student โ€” Digital E-Filing Coach | Amanuddin Education

๐Ÿ“Œ

What is Scikit-Learn? โ€” Simple Introduction

โ–ผ

Imagine you want to teach a computer to recognize whether an email is spam or not spam. Scikit-Learn is a Python library (a set of ready-made tools) that helps you do exactly that โ€” it is a Machine Learning library.

๐Ÿ’ก Simple Definition: Scikit-Learn = A free Python toolbox that lets you build, train, and test Machine Learning models without writing all the math from scratch.

Why Do We Need Datasets?

A machine learning model is like a student. It learns from examples (data). Without data, the model cannot learn anything. A dataset is a collection of information, like a big spreadsheet with rows and columns.

  • ๐Ÿ“Š A dataset is a table of information with rows (examples) and columns (features)
  • ๐ŸŽฏ The model studies the dataset to find patterns
  • โœ… After learning, it can make predictions on new, unseen data
  • ๐Ÿ“ Scikit-Learn comes with ready-made practice datasets built inside it
ConceptReal Life ExampleIn Machine Learning
DatasetExam question paper with answersTable of data used to train model
Feature (X)Student's name, marks, attendanceInput columns the model studies
Label (y)Pass or Fail resultOutput the model must predict
TrainingStudent studying before examModel learning from 80% of data
TestingStudent giving the actual examModel tested on 20% unseen data
๐Ÿ“ฆ

Dataset Components โ€” Input Features (X) & Output Variable (y)

โ–ผ

Every dataset has two main parts. Think of it like a hospital patient record โ€” the patient details (age, blood pressure, symptoms) are the Input Features (X), and the final diagnosis (sick or healthy) is the Output Variable (y).

๐Ÿ”ท Input Features โ€” X (Independent Variables)

What is X? These are the columns of data that the model uses to learn. They are called "independent variables" because they are given to the model as information.
  • ๐Ÿ“ Also called: Features, Predictors, Independent Variables
  • ๐Ÿš— Example: For a used-car price prediction dataset: Name, Year, KM Driven, Fuel Type โ€” these are all features (X)
  • ๐Ÿ“Š X is usually a 2D table (matrix) with many rows and columns
  • ๐Ÿงฎ In code: X = df.drop('Price', axis=1) โ€” means "everything except the price column"

๐Ÿ”ถ Output Variable โ€” y (Target / Dependent Variable)

What is y? This is the answer the model must predict. It is called "dependent" because its value depends on the input features.
  • ๐ŸŽฏ Also called: Target, Label, Dependent Variable, Response Variable
  • Two types: Classification (discrete values like 0/1, Yes/No) and Regression (continuous values like price, temperature)
  • ๐Ÿš— Car example: Price is the target y (we want to predict the price)
  • ๐Ÿฉบ Medical example: 0 = Healthy, 1 = Sick โ€” this is a classification target
PropertyInput Features (X)Output Variable (y)
Also CalledIndependent Variables, FeaturesTarget, Label, Dependent Variable
PurposeGiven to model as input informationWhat the model must predict
Shape2D Matrix (rows ร— columns)1D Array (one value per row)
Car ExampleName, Year, KM Driven, FuelPrice
Medical ExampleAge, BP, Sugar Level, Weight0 = Healthy / 1 = Sick
TypeNumbers, Text, CategoriesDiscrete (0,1) or Continuous

๐Ÿ“Œ Two Types of Output (y)

Classification (Discrete): Output is a category. Example: Spam = 1, Not Spam = 0. The answer is one of a few fixed choices.
Regression (Continuous): Output is a number. Example: House Price = โ‚น45,00,000. The answer can be any number in a range.
๐Ÿ—„๏ธ

Built-in Datasets in Scikit-Learn (sklearn.datasets)

โ–ผ

Scikit-Learn is so helpful that it comes with free, ready-made practice datasets built inside the library. You don't need to download anything โ€” just call a function and get data immediately!

๐ŸŽ Think of it like this: Scikit-Learn gives you free sample question papers (datasets) so you can practice machine learning without searching for data online.

๐Ÿ“‚ Module: sklearn.datasets

  • This is the sub-module (folder inside Scikit-Learn) that contains all the built-in datasets
  • Import it with: from sklearn import datasets
  • It contains toy datasets (small, for learning) and real-world datasets (larger)
  • All datasets follow the same structure: .data for features, .target for labels

๐ŸŒธ load_iris โ€” The Most Famous Practice Dataset

What is Iris? Data about 3 types of iris flowers. Has 150 rows and 4 feature columns. A beginner's favorite!
  • ๐ŸŒบ Task: Classify which type of flower it is (0, 1, or 2) โ€” Classification problem
  • ๐Ÿ“Š Features (X): Sepal Length, Sepal Width, Petal Length, Petal Width
  • ๐ŸŽฏ Target (y): 0 = Setosa, 1 = Versicolour, 2 = Virginica
  • ๐Ÿ’ป Code: iris = datasets.load_iris()

๐Ÿฉบ load_breast_cancer โ€” Medical Diagnosis Dataset

What is Breast Cancer dataset? Data about tumors โ€” is the tumor malignant (dangerous) or benign (safe)? 569 rows, 30 features.
  • ๐Ÿ”ฌ Task: Classify tumor as malignant (0) or benign (1) โ€” Binary Classification
  • ๐Ÿ“Š Features (X): 30 measurements of the tumor (radius, texture, perimeter, area, etc.)
  • ๐ŸŽฏ Target (y): 0 = Malignant, 1 = Benign
  • ๐Ÿ’ป Code: cancer = datasets.load_breast_cancer()

๐Ÿ”‘ Dataset Attributes: .data and .target

Every built-in dataset object has two key parts: .data (the X features) and .target (the y labels).
DatasetRowsFeatures (X)Target (y)Problem TypeReal Use
load_iris15043 classes (0,1,2)ClassificationFlower identification
load_breast_cancer569302 classes (0,1)ClassificationMedical diagnosis
load_digits17976410 classes (0โ€“9)ClassificationHandwritten digit recognition
load_boston (old)50613Continuous priceRegressionHouse price prediction
load_wine178133 classesClassificationWine quality classification

๐Ÿ’ป Sample Code: Loading a Built-in Dataset

from sklearn import datasets
import pandas as pd

# Load iris dataset
iris = datasets.load_iris()

# .data gives you X (features)
X = iris.data
print("Features Shape:", X.shape)    # Output: (150, 4)

# .target gives you y (labels)
y = iris.target
print("Target Shape:", y.shape)       # Output: (150,)
print("Classes:", iris.target_names)  # Output: ['setosa' 'versicolor' 'virginica']

# Load breast cancer dataset
cancer = datasets.load_breast_cancer()
print("Cancer Features:", cancer.data.shape)  # (569, 30)
      
โœ‚๏ธ

Train-Test Split Process โ€” train_test_split()

โ–ผ

Here is the most important concept! Imagine you are a teacher. You have 100 exam questions. You use 80 questions to teach the student, and keep 20 questions hidden to test the student later. That is exactly what Train-Test Split does!

๐ŸŽ“ Train-Test Split Rule: Split your dataset into two parts โ€” one part for the model to learn (training set) and one part to test if the model actually learned correctly (test set).

๐Ÿ“ Function: train_test_split()

  • ๐Ÿ“ฆ Comes from: from sklearn.model_selection import train_test_split
  • ๐Ÿ”€ It randomly shuffles and splits your data into training and testing portions
  • ๐Ÿšซ The test set is never shown to the model during training โ€” it is kept completely hidden
  • โœ… After training, we use the test set to check if the model performs well on new data

โš™๏ธ Parameters of train_test_split()

ParameterWhat It DoesCommon ValueExample
XYour input features matrixRequiredX (all feature columns)
yYour target labels arrayRequiredy (target column)
test_sizeWhat fraction goes to testing0.2 (= 20%)0.2 means 20% for test
train_sizeWhat fraction goes to training0.8 (= 80%)Auto-calculated if test_size given
random_stateSets a seed for reproducibilityAny number (e.g., 42)42 means the split is always the same
shuffleShuffle data before splittingTrue (default)shuffle=True recommended

๐Ÿ“ค Output Variables: What Does the Function Return?

The function returns 4 values in this exact order: X_train, X_test, y_train, y_test
Output VariableWhat It ContainsSizeUsed For
X_trainTraining features (input)80% of dataModel learns from this
X_testTesting features (input)20% of dataModel makes predictions on this
y_trainTraining labels (correct answers)80% of dataModel uses these to learn
y_testTesting labels (correct answers)20% of dataWe compare predictions vs these

๐Ÿ’ก What is random_state? โ€” Very Important!

  • ๐ŸŽฒ Without random_state: Every time you run the code, data is split differently โ€” results change each time
  • ๐Ÿ”’ With random_state=42: The split is always the same, so you can reproduce and compare results
  • ๐Ÿ“Œ Any integer works (42 is popular by convention, but 0, 1, 100 all work the same)
  • ๐Ÿ”ฌ For scientific/academic work, always set random_state so experiments are reproducible

๐Ÿ’ป Complete Code Example

from sklearn import datasets
from sklearn.model_selection import train_test_split

# Step 1: Load dataset
iris = datasets.load_iris()
X = iris.data    # Features (150 rows ร— 4 columns)
y = iris.target  # Labels (150 values: 0, 1, or 2)

# Step 2: Split into Train (80%) and Test (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X,              # Features
    y,              # Labels
    test_size=0.2,  # 20% goes to test set
    random_state=42 # Ensures same split every time
)

# Step 3: Check sizes
print("X_train shape:", X_train.shape)  # (120, 4) โ€” 80% of 150
print("X_test shape:",  X_test.shape)   # (30, 4)  โ€” 20% of 150
print("y_train shape:", y_train.shape)  # (120,)
print("y_test shape:",  y_test.shape)   # (30,)
      
โš™๏ธ

Workflow โ€” Training, Testing, Evaluation & Goal

โ–ผ

Now let us understand the complete step-by-step workflow of how machine learning actually works after you split your data.

Step 1 โ€” Training: Using 80% Data

๐ŸŽ“ Training = Learning Phase. The model is shown X_train (features) and y_train (correct answers) together. It learns the patterns, rules, and relationships by seeing thousands of examples.
  • โœ… Model uses: X_train + y_train together
  • ๐Ÿ“š Like a student reading textbook with answers โ€” the model sees both questions and answers
  • ๐Ÿ’ป Code: model.fit(X_train, y_train)
  • ๐Ÿ“ˆ The model adjusts its internal settings (called "parameters") to reduce mistakes

Step 2 โ€” Testing: Using 20% Data

๐Ÿ“ Testing = Exam Phase. The model is given only X_test (features, NO answers). It must predict y_pred on its own. This is like the real exam!
  • ๐Ÿšซ The model has NEVER seen this data before โ€” this ensures a fair test
  • ๐Ÿ’ป Code: y_pred = model.predict(X_test)
  • ๐Ÿ“Œ y_test (correct answers) are kept hidden from the model until evaluation
  • โš ๏ธ Using training data for testing would be like giving the student the same questions โ€” it would cheat!

Step 3 โ€” Evaluation: Compare Predictions with y_test

๐Ÿ“Š Evaluation = Checking the Answer Sheet. We compare the model's predictions (y_pred) against the actual correct answers (y_test) to measure how accurate the model is.
  • ๐Ÿ“ Metric for Classification: Accuracy Score = correct predictions รท total predictions ร— 100%
  • ๐Ÿ“ Metric for Regression: Mean Squared Error (MSE), Rยฒ Score
  • ๐Ÿ’ป Code: from sklearn.metrics import accuracy_score
  • ๐Ÿ’ป Code: accuracy_score(y_test, y_pred)

Step 4 โ€” Goal: Check Model Accuracy

๐ŸŽฏ The Ultimate Goal is to find how well the model can predict on NEW, unseen data โ€” this is called generalization ability. A model that scores 95% on test data is much more useful than one that only memorized training data.
PhaseData UsedWhat HappensCode
๐ŸŽ“ TrainingX_train + y_train (80%)Model learns from examplesmodel.fit(X_train, y_train)
๐Ÿ“ TestingX_test only (20%)Model makes predictionsy_pred = model.predict(X_test)
๐Ÿ“Š Evaluationy_pred vs y_testWe measure model accuracyaccuracy_score(y_test, y_pred)
๐ŸŽฏ GoalFinal metricCheck generalization abilityPrint final score
๐Ÿ› ๏ธ

Implementation Tools โ€” Scikit-Learn, Pandas & df.drop()

โ–ผ

When working with Scikit-Learn in real projects, you typically use three main tools together. Think of them as a team where each member does a specific job.

๐Ÿ”ฌ Tool 1: Scikit-Learn โ€” The ML Engine

Scikit-Learn is the main ML library. It provides all the machine learning algorithms (like Decision Tree, Random Forest, SVM) and utility functions (like train_test_split, accuracy_score).
  • ๐Ÿ“ฆ Install: pip install scikit-learn
  • ๐Ÿค– Provides: 40+ ML algorithms ready to use
  • ๐Ÿ”ง Key functions: train_test_split, fit(), predict(), accuracy_score()
  • ๐Ÿ“š Import example: from sklearn.model_selection import train_test_split

๐Ÿผ Tool 2: Pandas โ€” Data Loading & Cleaning

Pandas is the data handling library. You use it to load data from CSV/Excel files, clean messy data, handle missing values, and prepare the dataset before passing it to Scikit-Learn.
  • ๐Ÿ“ฆ Install: pip install pandas
  • ๐Ÿ“‚ Load data: df = pd.read_csv('data.csv')
  • ๐Ÿ” Explore data: df.head(), df.info(), df.describe()
  • ๐Ÿงน Clean data: df.dropna(), df.fillna(), df.rename()

โœ‚๏ธ Tool 3: df.drop() โ€” Removing Target from X

df.drop() is a Pandas function used to remove a column from the dataframe. It is critical because we need to separate X (features) from y (target) before training.
  • ๐ŸŽฏ Purpose: Remove the target column from the feature set
  • ๐Ÿ’ป Usage: X = df.drop('Price', axis=1) โ€” removes the 'Price' column
  • ๐Ÿ“Œ axis=1 means column; axis=0 means row
  • ๐Ÿ“ Then: y = df['Price'] โ€” get the target column separately

๐Ÿ’ป Complete Real-World Code (Using All Three Tools)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# ---- Step 1: Load data using Pandas ----
df = pd.read_csv('car_data.csv')
print(df.head())       # See first 5 rows
print(df.info())       # Check column types

# ---- Step 2: Separate X and y using df.drop() ----
X = df.drop('Sold', axis=1)  # Features: everything except 'Sold' column
y = df['Sold']               # Target: just the 'Sold' column

# ---- Step 3: Split using train_test_split ----
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ---- Step 4: Train model using Scikit-Learn ----
model = DecisionTreeClassifier()
model.fit(X_train, y_train)  # Train on 80%

# ---- Step 5: Predict and Evaluate ----
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
      
ToolMain JobKey FunctionsWhen to Use
๐Ÿ”ฌ Scikit-LearnML algorithms & utilitiestrain_test_split, fit, predict, accuracy_scoreBuilding and evaluating models
๐Ÿผ PandasData loading & cleaningread_csv, head, dropna, fillnaLoading and preparing raw data
โœ‚๏ธ df.drop()Separate X from ydf.drop('col', axis=1)Before splitting data
๐Ÿ”„

Flowchart โ€” Complete ML Workflow with Scikit-Learn

โ–ผ
๐Ÿš€ START: Machine Learning Project
โ†“
๐Ÿ“‚ Step 1: Import Libraries
pandas, sklearn, numpy
โ†“
๐Ÿ“Š Step 2: Load Dataset
pd.read_csv() OR sklearn.datasets.load_iris()
โ†“
๐Ÿ” Step 3: Explore Data
df.head() | df.info() | df.describe()
โ†“
๐Ÿงน Step 4: Clean & Prepare Data
Handle missing values | Encode categories
โ†“
โœ‚๏ธ Step 5: Separate X and y
X = df.drop('target', axis=1) | y = df['target']
โ†“
๐Ÿ”€ Step 6: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
โ†“
๐ŸŽ“ X_train
(80% Features)
๐Ÿ“ y_train
(80% Labels)
๐Ÿ” X_test
(20% Features)
๐ŸŽฏ y_test
(20% Labels)
โ†“
๐Ÿค– Step 7: Train Model
model.fit(X_train, y_train)
โ†“
๐Ÿงช Step 8: Make Predictions
y_pred = model.predict(X_test)
โ†“
๐Ÿ“Š Step 9: Evaluate Model
accuracy_score(y_test, y_pred)
โ†“
โœ… Accuracy > 90%
Model is GOOD โ†’ Deploy
โŒ Accuracy < 70%
Tune Model โ†’ Go to Step 7
โ†“
๐ŸŽ‰ END: Model Ready for Real-World Use
๐Ÿง 

Mind Map โ€” Scikit-Learn: Datasets & Train-Test Split

โ–ผ
Scikit-Learn:
Datasets &
Train-Test
Split
๐Ÿ“ฆ Dataset Components
โ†’
Input Features X Independent Variables Output Variable y Target Variable Classification: 0/1 Regression: Continuous Example: name, year, km
๐Ÿ—„๏ธ Built-in Datasets
โ†’
sklearn.datasets load_iris (150 rows) load_breast_cancer (569 rows) load_digits .data โ†’ features .target โ†’ labels
โœ‚๏ธ Train-Test Split
โ†’
train_test_split() test_size=0.2 (20%) random_state (seed) X_train (80%) X_test (20%) y_train (80%) y_test (20%)
โš™๏ธ Workflow
โ†’
Training: 80% data model.fit(X_train, y_train) Testing: 20% data model.predict(X_test) Evaluation: y_pred vs y_test Goal: Check accuracy
๐Ÿ› ๏ธ Implementation Tools
โ†’
Scikit-Learn (ML) Pandas (Data Loading) df.drop() (Remove target) NumPy (Arrays) accuracy_score()
๐Ÿ—บ๏ธ

Roadmap โ€” Complete Learning Path for Scikit-Learn & ML

โ–ผ
โ‘  Python Basics
  • Variables
  • Lists & Loops
  • Functions
  • pip install
โ†’
โ‘ก NumPy & Pandas
  • Arrays
  • DataFrames
  • read_csv()
  • df.drop()
โ†’
โ‘ข Datasets & EDA
  • load_iris()
  • load_breast_cancer()
  • Explore Data
  • Visualize
โ†’
โ‘ฃ Train-Test Split
  • Separate X & y
  • train_test_split()
  • test_size=0.2
  • random_state
โ†’
โ‘ค Train a Model
  • Decision Tree
  • Random Forest
  • model.fit()
  • SVM, KNN
โ†’
โ‘ฅ Evaluate & Tune
  • accuracy_score()
  • Confusion Matrix
  • Cross-Validation
  • GridSearchCV
โ†’
โ‘ฆ Deploy Model
  • Save with pickle
  • Flask / FastAPI
  • Predict New Data
  • Monitor Results
StageTopicKey LibrariesTime (Approx.)
โ‘  FoundationPython BasicsPython, pip1โ€“2 weeks
โ‘ก Data ToolsNumPy & Pandasnumpy, pandas1โ€“2 weeks
โ‘ข DatasetsLoading & EDAsklearn.datasets, matplotlib1 week
โ‘ฃ SplittingTrain-Test Splitsklearn.model_selection2โ€“3 days
โ‘ค ModelingTraining ML Modelssklearn classifiers/regressors2โ€“3 weeks
โ‘ฅ EvaluationMetrics & Tuningsklearn.metrics, GridSearchCV1โ€“2 weeks
โ‘ฆ DeploymentReal-World Usagepickle, Flask, FastAPI2โ€“4 weeks
โš ๏ธ Educational Disclaimer:
This resource is for educational purposes only and does not constitute professional technical or legal advice. All code examples are simplified for learning. Always refer to official Scikit-Learn documentation at scikit-learn.org for production use. Content prepared by Digital E-Filing Coach โ€” Amanuddin Education.

๐Ÿ“š Digital E-Filing Coach โ€” Amanuddin Education | Scikit-Learn: Datasets & Train-Test Split

Prepared for educational purposes. Content simplified for student understanding.

Scroll to Top