Simple Outlier Detector using Python

Asad Ali
3 min readJul 30, 2021

Anamoly detection is are a set of various sets of methods or practices to determine a point or trends which deviate unusually from given distributions metrics of perceived data in terms of variance or distance. Anomaly detection methods and techniques are widespread in the industry since the control charts were introduced in production processes in the early 1900s by Shewhart. The cluster of points or patterns unusually forming on one side of the charts represented a special cause variation in production processes. With the advent of big data and high dimensional data, more refined methods such as time series analysis, autoencoder,
variational autoencoders were used depending upon the data being analyzed and the problem set. Most of these methods calculate some form of distance or variance of a cluster or single point from the learned data distribution.
Today anomaly detection methods are used everywhere, from website customers analysis, security analysis of software and system, defense, healthcare, production industry, etc.

Let's start by building a simple outlier detector. First, let's imports the required libraries

We are going to build our outlier model in four steps

  1. First will create a function to calculate the mean and standard deviation for each data row or example.

2. Then we will calculate the probability density function of a single data row given the mean and standard and based on the normal distribution formula as given below. We will calculate pdf for each of the features

3. Then for each of the multidimensional data points we calculate the likelihood or pdf based on below

4. Finally based on training data we will calculate an optimum threshold that generates the best classification score based on the F1 score.

Let's create two functions for the first two points as above

Now let's write a function to calculate the likelihood probability of each data row with multiple features, which is essentially a product of individual feature probabilities applied at each row.

Now let's write a function to select the threshold. The function will take a training set with labeled outlier and the corresponding probabilities. We will a simple linear search approach with min and max probabilities and step size. Based on the maximum F1 score we will select the best parameter.

Now let's test our model, we can create some sample train data and calculate the threshold value and apply it to the test set.

--

--

Asad Ali

Data Science, Analytics, and Machine Learning Professional.