T3

Data Quality

Problems:

  1. Missing Values: nformation that is not available because it wasn’t collected or because it consisted of sensitive information; features that are not applicable in all cases
  2. Duplicated Records: Same (or similar) data collected from different sources
  3. Noise: Modifications to the original records (data that is corrupted or distorted) due to technological limitations, sensor error or even human error
  4. Outliers: A data point that differs significantly from other observations

Data Exploration

What to do?

  1. Understand the data and its characteristics
  2. Evaluate its quality
  3. Find patterns and relevant information

How?

  1. Central Tendency: average, mode, median…
  2. Statistical dispersion: variance, standard deviation, interquartile range…
  3. Probability distribution: Gaussian, Uniform, Exponential…
  4. Correlation/Dependence: between pairs of features, with the dependent feature…
  5. Data visualization: tables, charts, boxplots, scatter plots, histograms, …

Data Visualization

Graphs!!!

Data Preparation

Basics

Pasted image 20231213150615.png

A set of basic data preparation techniques can be used:

  1. Union/intersection of columns;
  2. Concatenation;
  3. Sorters;
  4. Filters (column, row, nominal, rule-based, …);
  5. Basic aggregations (counts, unique, mean/sum, …)

Advanced

How?

Data Preparation: Feature Scaling

Normalizing the range of the independent feature

Normalization
Reescaling data so that all values fall within the range of 0 and 1, for example.

z=(ba)xmin(x)(max(x)min(x)+a

Standardization / Z-score normalization
Rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. Assumes observations fit a Gaussian distribution with a well-behaved mean and standard deviation, which may not always be the case.

z=xiμσ

Data Preparation: Outlier Detection

  1. Statistical-based strategy: Z-Score, Box Plots, …
  2. Knowledge-based strategy: Based on domain knowledge. For example, exclude everyone with a monthly salary higher than 1M € …
  3. Model-based strategy: Using models such as one-class SVMs, isolation forests, clustering, …

Data Preparation: Feature Selection / Dimensionality Reduction

What can we remove:

Process:

  1. Remove a feature if the percentage of missing values is higher than a threshold;
  2. Use the chi-square test to measure the degree of dependency between a feature and the target class;
  3. Remove feature if data are highly skewed;
  4. Remove feature if low standard deviation;
  5. Remove features that are highly correlated between each other.

Data Preparation: Missing Values

Options on how to deal with it:

Data Preparation: Nominal Value Discretization

Data Preparation: Binning / Discretization

Binning: group numeric data into intervals - so called bins.

Data Preparation: Feature Engineering