T3

Data Quality

Note

Cleaning and manipulating data may be considered as the:

Most Time-Consuming task
Least Enjoyable task (by some!)

Problems:

Missing Values: nformation that is not available because it wasn’t collected or because it consisted of sensitive information; features that are not applicable in all cases
Duplicated Records: Same (or similar) data collected from different sources
Noise: Modifications to the original records (data that is corrupted or distorted) due to technological limitations, sensor error or even human error
Outliers: A data point that differs significantly from other observations

Data Exploration

What to do?

Understand the data and its characteristics
Evaluate its quality
Find patterns and relevant information

How?

Central Tendency: average, mode, median…
Statistical dispersion: variance, standard deviation, interquartile range…
Probability distribution: Gaussian, Uniform, Exponential…
Correlation/Dependence: between pairs of features, with the dependent feature…
Data visualization: tables, charts, boxplots, scatter plots, histograms, …

Contingency Tables

Do the values of one categorical variable depend on the value of other categorical variables?
This test is also known as the chi-square test of association.

Correlation Matrix

Do we want to keep highly-correlated features?
Both positive and negatively correlated ones?
What about the correlation between the dependent and the independent features?

Features

Independent variables (X) : input features/input vector
Dependent Variable (y) : target/class/label

Data Visualization

Graphs!!!

Histplots
Boxplots
Trendlines

Data Preparation

Basics

Pasted image 20231213150615.png

A set of basic data preparation techniques can be used:

Union/intersection of columns;
Concatenation;
Sorters;
Filters (column, row, nominal, rule-based, …);
Basic aggregations (counts, unique, mean/sum, …)

Advanced

How?

Feature scaling
Outlier detection
Feature selection
Missing Values treatment
Nominal value discretization
Binning
Feature Engineering

Data Preparation: Feature Scaling

Normalizing the range of the independent feature

Rationale

Many classifiers use distance metrics (ex.: Euclidean distance) and, if one feature has a broad range of values, the distance will be governed by this particular feature. Hence, the range should be normalized so that each feature may contribute proportionately to the final distance.

Normalization
Reescaling data so that all values fall within the range of 0 and 1, for example.

$z = (b - a) \frac{x - m i n (x)}{(m a x (x) - m i n (x)} + a$

Standardization / Z-score normalization
Rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. Assumes observations fit a Gaussian distribution with a well-behaved mean and standard deviation, which may not always be the case.

$z = \frac{x_{i} - μ}{σ}$

Data Preparation: Outlier Detection

Statistical-based strategy: Z-Score, Box Plots, …
Knowledge-based strategy: Based on domain knowledge. For example, exclude everyone with a monthly salary higher than 1M € …
Model-based strategy: Using models such as one-class SVMs, isolation forests, clustering, …

The Outlier Dilemma: Drop or Cap?

To keep the dataset size, we may want to cap outliers instead of dropping them. However, it can affect the distribution of data!

Data Preparation: Feature Selection / Dimensionality Reduction

Rationale

Which features should we use to create a predictive model? Select a sub-set of the most important features to reduce dimensionality.

The removal of unimportant features:

May affect significantly the performance of a model
Reduces overfitting (less opportunity to make decisions based on noise)
Improves accuracy
Helps reducing the complexity of a model (reduces training time)

What can we remove:

Redundant features (duplicate)
Irrelevant and unneeded features (non-useful)

Process:

Remove a feature if the percentage of missing values is higher than a threshold;
Use the chi-square test to measure the degree of dependency between a feature and the target class;
Remove feature if data are highly skewed;
Remove feature if low standard deviation;
Remove features that are highly correlated between each other.

Principal Component Analysis (PCA)

a technique to reduce the dimension of the feature space. The goal is to reduce the number of features without losing too much information. A popular application of PCA is for visualizing higher dimensional data.

Wrapper Methods

Use a ML algorithm to select the most important features! Select a set of features as a search problem, prepare different combinations, evaluate and compare them! Measure the “usefulness” of features based on the classifier performance.

Embedded Methods

Algorithms that already have built-in feature selection methods. Lasso, for example, has their own feature selection methods. For example, if a feature’s weight is zero than it has no importance! Regularization - constrain/regularize or shrink the coefficient estimates towards zero.

Data Preparation: Missing Values

Options on how to deal with it:

Remove
Mean
Interpolation
Mask
....

Data Preparation: Nominal Value Discretization

Rationale

Categorical data often called nominal data, are variables that contain label values rather than numeric ones. Several methods may be applied:

one-hot encoding
label encoding - Integer values have a natural ordered relationship between each other. ML models may be able to understand such relationship.
binary encoding - Categorical features where no such ordinal relationship exists. However, for a huge number of categories it is not so good...

Data Preparation: Binning / Discretization

Binning: group numeric data into intervals - so called bins.

Rationale

Make the model more robust and prevent overfitting. However, it penalizes the model’s performance since every time you bin something, you sacrifice information.

Data Preparation: Feature Engineering

Rationale

The process of creating new features! The goal is to improve the performance of ML models.
Example: from the creation date of an observation (2021-10-29 16h30) what can we extract?
We may extract new features such as:

Year, month and day
Hour and minutes
Day of week (Thursday)
Is Weekend? (No)
Is Holiday? (No)