Data Pre-processing with Data reduction techniques in Python

5 min readOct 28, 2021

Datasets nowadays are very detailed; including more features in the model makes the model more complex, and the model may be overfitting the data. Some features can be noise and potentially damage the model. By removing those unimportant features, the model may generalize better.

We will see other feature selection methods on the same data set to compare their performances. Use the SkLearn website for this.

Dataset Used

The dataset used for carrying out data reduction is the ‘Iris’ available in sklearn datasets library.

The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set.

The dataset now has 8 features now. In that 4 feature are important and another 4 are noise.

We only select features based on the information from the training set, not on the whole data set. We should hold out part of the entire data set as a test set to evaluate the feature selection and model performance. Thus the information from the test set cannot be seen while we conduct feature selection and train the model.

Principal Component Analysis (PCA)

We can speed up the fitting of a machine learning algorithm by changing the optimization algorithm. A more common way of speeding up a machine learning algorithm is using Principal Component Analysis (PCA). It is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss.

For a lot of machine learning applications, it helps to be able to visualize your data. Visualizing two or 3-dimensional data is not that challenging. The Iris dataset used is four-dimensional. We will use PCA to reduce that 4-dimensional data into 2 or 3 dimensions so that you can plot and hopefully understand the data better.

So, now let’s execute PCA for visualization on Iris Dataset

PCA Projection to 2D

The original data has four columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the actual data, which is four-dimensional into two dimensions. The new components are just the two main dimensions of variation.

Concatenating DataFrame along axis = 1. finalDf is the final DataFrame before plotting the data.

Now, let’s visualize the data frame, execute the following code:

PCA Projection to 3D

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). This section projects the original data, which is off our dimensional into 3 dimensions. The new components are just the three main dimensions of variation.

Now let’s visualize a 3D graph,

Variance Threshold

Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here.

Univariate Feature Selection

Univariate feature selection works by selecting the best features based on univariate statistical tests.

We compare each feature to the target variable to see a statistically significant relationship between them.

When we analyze the relationship between one feature and the target variable, we ignore the other features. That is why it is called ‘univariate’.

Each feature has its test score.

Finally, all the test scores are compared, and the features with top scores will be selected.

1. f_classif

Also known as ANOVA,

2. chi2

This score can be used to select the features with the highest values for the test chi-squared statistic from data, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

3. mutual_info_classif

It comes in 2 types:

for classification

for regression

Recursive Feature Elimination

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) selects features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features, and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of selected features is eventually reached.

In summary, we have seen how to use different feature selection methods on the same data and evaluated their performances.

Data Pre-processing with Data reduction techniques in Python

Written by AMI SAVALIYA