15 Core Stats concepts for Data Scientists explained using Numpy

Asad Ali
7 min readJun 17, 2021

--

Data Science is a vast field. Mastery of data science discipline requires extensive study of the body of knowledge and practice. Below are some of the core statistical ideas that any aspiring data scientist should know. All of the below micro-ideas and toy examples are illustrated using the Numpy python library.

  1. Basic Matrix Operations. Linear Algebra is the foundation of machine learning. The basic matrix operations are listed below
import numpy as npMatrixA = np.random.rand(10,10)MatrixB = np.random.rand(10,10)MatrixC = np.zeros((10,10))def add_matrix(A:np.array,B:np.array) -> np.array:"""Adds two numpy matrix elements wise"""ar,ac = A.shapebr,bc = B.shapeassert ar == brassert ac ==bcreturn A + Bdef sub_matrix(A:np.array,B:np.array) -> np.array:"""Subs two numpy matrix elements wise"""ar,ac = A.shapebr,bc = B.shapeassert ar == brassert ac ==bcreturn A - Bdef matmul(a:np.array,b:np.array) -> np.array:"""Multiplies A with B and return matrix slowly"""ar,ac = a.shapebr,bc = b.shapeassert ac==brc = np.zeros(ar,ac)for i in range(ar):for j in range(bc):for k in range(ac):c[i,j] += a[i,k] * b[k,j]return cdef faster_matmul(a,b):"""Multiplies A with B and return matrix using broadcasting"""ar,ac = a.shapebr,bc = b.shapeassert ac==brreturn a@bdef fast_matmul(a,b):"""Multiplies A with B and return matrix using einstien notation"""ar,ac = a.shapebr,bc = b.shapeassert ac==brreturn np.einsum('ik,kj->ij',a,b)def hadmard_prod(a,b):"""Calculate the element wise matmul"""ar,ac = A.shapebr,bc = B.shapeassert ar == brassert ac ==bcreturn a*b

2. Correlation

Correlation is one of the most important tools for feature engineering and understanding problem features. The correlation/Covariance matrix can explain how each feature is related to each other and thus allows us to select the best feature list for the target variable. There are four main types of correlations: Pearson correlation, Kendall rank correlation, Spearman correlation, and the Point-Biserial correlation

def NaivecorrelationCoefficient(X, Y, n) :sum_X = 0sum_Y = 0sum_XY = 0squareSum_X = 0squareSum_Y = 0i = 0while i < n :# sum of elements of array X.sum_X = sum_X + X[i]# sum of elements of array Y.sum_Y = sum_Y + Y[i]# sum of X[i] * Y[i].sum_XY = sum_XY + X[i] * Y[i]# sum of square of array elements.squareSum_X = squareSum_X + X[i] * X[i]squareSum_Y = squareSum_Y + Y[i] * Y[i]i = i + 1# use formula for calculating correlation# coefficient.corr = (float)(n * sum_XY - sum_X * sum_Y)/(float)(math.sqrt((n * squareSum_X -sum_X * sum_X)* (n * squareSum_Y -sum_Y * sum_Y)))return corrdef numpy_cov_matrix(data1,data2):return np.cov(data1,data2)from numpy import cov# seed random number generatorseed(1)# prepare datadata1 = 20 * randn(1000) + 100data2 = data1 + (10 * randn(1000) + 50)# calculate covariance matrixprint(numpy_cov_matrix)# we can even try to implement convolution using pure Numpydef convolution2d(image, kernel, bias):m, n = kernel.shapeif (m == n):y, x = image.shapey = y - m + 1x = x - m + 1new_image = np.zeros((y,x))for i in range(y):for j in range(x):new_image[i][j] = np.sum(image[i:i+m, j:j+m]*kernel) + biasreturn new_image

3. Further Matrix Operations: Inverse, Identity, and Orthogonal Matrix

# Inverse using numpydef find_inverse(A):return np.linalg.inv(A)def Identity(size):"""Go row wise and for each diag , make 1 else 0"""for row in range(0, size):for col in range(0, size):     if (row == col):     print("1 ", end=" ")     else:     print("0 ", end=" ")     print()#numpy implementationI = np.indentity(10)#Random matrixM = np.random.rand(10,10)#Mutilpling by Identity producesM_I = M@Iprint(M)

4. Norm of a vector

The length of the vector is referred to as the vector norm or the vector’s magnitude, vector norm is an important concept in machine learning normally used in optimization functions. Mostly L1 and L2 types of norms are used

def get_norm_L2(A):return np.sqrt(np.sum([i*i for i in A ]))def numpy_norm(A,k):
return np.linalg.norm(A,k)

5. Eigen Vector and Eigen Values

Eigenvector values are used in numerous fields of modeling of dynamic systems. Eigenvectors also play an important concept in machine learning and are essentially a linear transformation that simply scales the vector and does not change the direction

Ax = Lambda(x) ; Where A is the vector, x is eigenvector ; Lambda is the scalar value

def EigenValues(A):w, v = np.linalg.eig(m)return w,v

6. Point estimates, sampling, and confidence intervals

data = np.random.normal(10,2,1000)def bootstrap_sample(data, N):"""get bootstrap samples of size N from Data"""return np.random.choice(data,size=N,replace=True)bootstrap = []no_of_collections = 10for i in range(no_of_collections):bootstrap.append(bootstrap_sample(data))np.percentile(np.array(bootstrap,2.5),np.array(bootstrap,97.5))point_estimates = np.mean(data)

7. PCA — Principle Component Analysis

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

def numpy_pca(data):data = data - data.mean(axis=0)cov = np.cov(data.T)/ data.shape[0]v,w = np.linalg.eig(cov)idx = v.argsort()[::-1]v = v[idx]w = [:,idx]res = data.dot(w[:,:k])return res

8. Central Limit theorem

Used a lot in analytics when combined with resampling basically, the central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed

from scipy.stats import normimport numpy as npx = np.linspace(-50.0, 50.0, 10000)y1 = norm.pdf(x)y2 = np.norm(x+2)rolls = randint(1, 7, 50)# generate random dice rollsfrom numpy.random import seedfrom numpy.random import randintfrom numpy import mean# seed the random number generatorseed(1)# generate a sample of die rollsrolls = randint(1, 7, 50)print(rolls)print(mean(rolls))# generate random dice rollsfrom numpy.random import seedfrom numpy.random import randintfrom numpy import mean# seed the random number generatorseed(1)# generate a sample of die rollsrolls = randint(1, 7, 50)print(rolls)print(mean(rolls))from numpy.random import seedfrom numpy.random import randintfrom numpy import meanfrom matplotlib import pyplot# seed the random number generatorseed(1)# calculate the mean of 50 dice rolls 1000 timesmeans = [mean(randint(1, 7, 50)) for _ in range(1000)]# plot the distribution of sample meanspyplot.hist(means)pyplot.show()

9. Various Probability Distributions

Generate prob. distributions#poissonsample = stats.poisson.rvs(2, size=100)#exponsample = stats.expon.rvs(scale=5, size=100)#weibullsample = stats.weibull_min.rvs(1.5, scale=5000, size=100)

10. Resampling and probability

import random, time#Lets estimate probabilities of virus infections across different groups from base probabilities and resamplingprobabilities= { "PersonA" : 0.50, "PersonB" : .25, "PersonC" : .25, "PersonBD" : .125,"PersonBE" : .125, "PersonCD" : .125, "PersonCE" : .125}def check_probability(person_name, n_repetitions = 100):prob_comp, repetitions = 0, 0p_person = probabilities[person_name]while repetitions < n_repetitions:x = random.random()if x <= p_person:prob_comp += 1repetitions += 1print ("{0} % chance of getting virus on {1}".format(round(prob_comp/100.0, 2), person_name))for key in probabilities:check_probability(key, 1000)

11. Likelihood function

#lets create simple likehood functions for binomial coin toss distributions
import numpy as np
theta = 0.2 # Probability of H is 0.2, hence NOT a fair coindata = [0, 1, 0, 1, 1, 1, 0, 0, 1, 1] # T, H, T, H, H, ....def likelihood(theta, h):
"""
returns the prob value
"""
return (theta)**(h)*(1-theta)**(1-h)likelihood(theta, 1) # 0.2likelihood(theta, 0) # 0.8singlethrow = [likelihood(theta, x) for x in data]# returns the likehood of data if the parameters are trueprob1 = np.prod(singlethrow)

12. Bayes Rule

def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a):# calculate P(not A)not_a = 1 - p_a# calculate P(B)p_b = p_b_given_a * p_a + p_b_given_not_a * not_a# calculate P(A|B)p_a_given_b = (p_b_given_a * p_a) / p_breturn p_a_given_b# P(A)p_a = 0.0002# P(B|A)p_b_given_a = 0.85# P(B|not A)p_b_given_not_a = 0.05# calculate P(A|B)result = bayes_theorem(p_a, p_b_given_a, p_b_given_not_a)# summarizeprint('P(A|B) = %.3f%%' % (result * 100))

13. Bayesian Inference

# lets estimate of postive virus cases in a districtnum_postive = 312num_tested = 1000# lets set up my priorfrom scipy.stats import binomimport numpy as npprops = np.arange(0.0, 1.01, 0.01)x = binom.pmf(312,1000,props)import matplotlib.pyplot as pltprior = np.ones(len(props)) * .002marginal_likelihood = x * prior.sum()posterior = x * prior / marginal_likelihoodprint(posterior)

14.MLE Maximum Likelihood

from scipy.optimize import minimizeimport scipy.stats as statsN = 100x = np.linspace(0,20,N)ϵ = np.random.normal(loc = 0.0, scale = 5.0, size = N)y = 3*x + ϵdef MLERegression(params):intercept, beta, sd = params[0], params[1], params[2] # inputs are guesses at our parametersyhat = intercept + beta*x # predictions# next, we flip the Bayesian question# compute PDF of observed values normally distributed around mean (yhat)# with a standard deviation of sdnegLL = -np.sum( stats.norm.logpdf(y, loc=yhat, scale=sd) )# return negative LLreturn(negLL)results = minimize(MLERegression, guess, method = 'Nelder-Mead',options={'disp': True})

15. KL Divergence

def KL(a, b):a = np.asarray(a, dtype=np.float)b = np.asarray(b, dtype=np.float)return np.sum(np.where(a != 0, a * np.log(a / b), 0))values1 = [1.346112,1.337432,1.246655]values2 = [1.033836,1.082015,1.117323]
print(KL(values1,values2))

--

--

Asad Ali
Asad Ali

Written by Asad Ali

Data Science, Analytics, and Machine Learning Professional.

No responses yet