Data Science:- 2. Data Preprocessing using Scikit Learn

  1. Encoding: Encoding means converting information from one format to another format. Encoding is needed whenever we have categorical values. Encoding will assign one unique number to particular entities. Most of the time categorial values are in label form ( i.e. yes, no, true, false) So the computer will not consider them as features because the computer works with numbers. so we have to assign numerical quantity to each quantity and that process is called ‘Encoding’. There are two types of Encoding techniques. Label Encoding, Ordinal Encoding, One Hot Encoding are the types of the Encoding techniques.
  2. Normalization: Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms. when multiple attributes are there but attributes have values on different scales, this may lead to poor data models while performing data mining operations. So they are normalized to bring all the attributes on the same scale. There are 3 Methods of Data Normalization. 1. Decimal Scaling ,2. Min-Max Normalization and 3. z-Score Normalization(zero-mean Normalization)
  3. Standardization: Standardization is also one type of normalizer that will ensure that transformed data have mean = 0 & standard deviation = 1.Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
  4. Imputing Missing values: there might be the case when your data set contains some null values but it may counted as important feature value and it can alter your calculations. skit-learn provides inbuilt function like dropna() which will drop rows with any value as null in any of its column.
  5. Discretization: Discretization is the process of putting values into buckets so that there are a limited number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns.




IT Engineering at Charusat University

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Tableau Dashboard Actions

This is not a call against working out. I love working out, and spend a few hours a week running

Ten most pressing, high-impact questions to drive change in inclusive and sustainable mobility

Data Science Interviews Overview

Insights from Stack Overflow Developer Survey

Getting Started with NFT OnChained (Part 1)

The Basics: Regression and Classification

Integrate Jupyter Notebook into Data Pipelines

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Thakkar Krunal Balkrishna

Thakkar Krunal Balkrishna

IT Engineering at Charusat University

More from Medium

Hotel Booking Dataset Analysis

Click or Not: Ads click prediction with Logistic Regression

Supervised Learning: Explained, Briefly

Basics understanding of Regression in Machine Learning

This graph shows the best fit line of the regression line