Data science | Data preprocessing using scikit learn
This blog gives steps for preprocessing of data using the sci-kit learn library. Data preprocessing is an important step in the data mining process.
Data preprocessing is a data mining technique that is used to transform the raw data in a useful and efficient format.
- Data Cleaning
It is the process of removing incorrect, inaccurate, and incomplete data from the dataset. Also, it adds the missing data in the dataset. It is the most important step, as it ensures that the data is ready for downstream needs. - Data Integration
The process of combining multiple sources into a single dataset. - Data Reduction
This technique helps in reducing the amount of data which makes analysis easier and still produces the same result. Another advantage is, storage space would also be less. - Data Transformation
The change made in the structure of the data is called data transformation. This step can be simple or complex based on the requirements.
Implementation
Importing data and displaying information about the dataset.
Data Encoding
Encoding is a way to convert input data to binary(0/1s)
Firstly, I have used label encoding which converts the value based on alphabetical order
One hot encoder does the same things but in a different way. Label Encoder initializes the particular number but one hot encoder will assign a whole new column to particular categories. So if you have 3 categories in the column then one hot encoder will add 3 more columns to your dataset.
Removing null values
Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).
Github Link:https://github.com/amisavaliya/DSpracticals.git