Optimization approach for estimation of Linear Regression
As part of data science, we focus on estimation algorithms for classification and regression problems. Of course, the solution would involve unsupervised learning algorithms too. However, optimization doesn’t normally appear in the curriculum of a typical data science learning schedule. This could be because solutions using optimization is rare.
The objective of this article is to give higher level exposure to optimization by comparing it with a familiar approach to Linear Regression solution.
Even though, direct applications are less, optimization forms the pillar of datascience as it form the building block of many an algorithm. It is just that the user is not directly exposed to it. The structure of most algorithms starts with defining a cost function and ends with minimizing the it using an optimization routine. For example, Support Vector Machines (SVM) algorithm apply constrained optimization to estimate optimal weights.
Application of optimization for solution development is rare. We are aware of the categorization of Analytics as descriptive, predicitive and prescriptive. In fact, optimization is the building block of most prescriptive solutions. The chance of converting a predictive algorithm to prescriptive algorithm is not very common but when it is done, benefit is immense. Consider following examples that applied optimization to add value to predictive solutions (create prescriptive solution).
Advertisement Allocation:- It is quite common to estimate the sensitivity of sales to various types of advertisements (TV, Newspaper, Radio etc.). This information will be applied to estimate volume due to these advertisements and RoI. This is the typical output of a market mix modeling exercise. This result can be extended to estimate the optimum allocation of advertisement budget to maximise sales (a prescriptive solution).
Optimal Premium:- Auto insurers know that quoted premium influence the probability of renewal of a policy. In a typical renewal model, premium amount is part of the estimation of probability of renewal. However, this model can be improved to estimate the optimal premium for various classes of customers to maximize sales or profit.
The possible applications that can add value to a predictive applications are many. The bottle neck is that the professionals must recognize an opportunity for this application as they work on various problems. Hence, it is critical that data scientist must have working knowledge of the field of optimization.
The objective of this article is to apply optimization for estimating Linear Regression. The reason why linear regression was chosen as this is one of the simplest and most likely the first algorithm learnt. We understand that linear regression is estimated using the concept of Gradient Descent which is an iterative approach. Lets take a look at the result so that we can compare with solution developed using Optimization approach.
Linear Regression using Scikit Learn
Data:- Let’s use Boston house price data available with sklearn. Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The objective is to predict median value of a house of this town as a function of crime, pollution, education quality etc. The code segment below provide steps of building a model using scikit learn library.
The estimated coefficients are provided below
coefs
CRIM -0.928146
ZN 1.081569
INDUS 0.140900
CHAS 0.681740
NOX -2.056718
RM 2.674230
AGE 0.019466
DIS -3.104044
RAD 2.662218
TAX -2.076782
PTRATIO -2.060607
B 0.849268
LSTAT -3.743627
const 22.532806
The performance of the model was evaluated using r-square and RMSE.
Root Means Squared Error: 4.679191295697282
R-Square: 0.7406426641094094
Linear Regression solution using Optimization
Now let’s solve linear regression through optimization approach. We will use optimize library of Scipy. Minimise function of this library is useful for this purpose. Our objective is to minimise the Mean Squared Error and estimate the optimal weights that will help us to achieve it.
The optimization setup got following elements that are logically connected.
Objective function:- This function carries the metric that we are attempting to minimise. In the case of Linear Regression, we attempt to mininse Mean Squared Error.
Bounds:- Bounds are a limit on the value of weights. The algorithm will consider only the values between the bounds while estiamting the optimals weights.
Initial Guess:- Array of real elements of size (n,), where ’n’ is the number of independent variables.
Method:- There are various algorithms available for solving the optimization problem. The choice depends on the nature of the problem; whether it is linear or non-linear, dimensionality, does a derivative exist or not, nature of constraints etc.
We chose SLSQP (Sequential Least SQuares Programming) as it minimize a function of several variables with any combination of bounds, equality and inequality constraints. This method wraps the SLSQP Optimization subroutine originally implemented by Dieter Kraft.
Solution:-
fun: 21.894831519076373
jac: array([ 5.14984131e-05, -1.06334686e-04, 1.38998032e-04, -6.43730164e-05, -8.58306885e-05, 2.81333923e-05, 2.16245651e-04, 1.38998032e-04, -9.13143158e-05, 6.93798065e-05, 1.02281570e-04, -1.51395798e-04, -4.10795212e-04, -3.09944153e-05])
message: 'Optimization terminated successfully.'
nfev: 243
nit: 15
njev: 15
status: 0
success: True
x: array([-9.27963609e-01, 1.08141254e+00, 1.41149929e-01, 6.81704556e-01, -2.05689787e+00, 2.67403329e+00, 2.00840569e-02, -3.10356986e+00, 2.66177517e+00, -2.07635534e+00, -2.06060232e+00, 8.49091901e-01, -3.74436647e+00, 2.25327923e+01])
The solution provides optimised value of the objective function (fun: 21.894831519076373). This indicate that when weights are optimum the model MSE is equals to 21.894831519076373. It shows that optimization was successful (success: True). The ‘x’ values are the weights of each of the variables (corresponding to coefficients of sklearn regression).
Model Performance:-
Root Means Squared Error: 4.679191331744874
R-Square: 0.7406426601133311
The model performance is quite comparable to sklearn approach (gradient descent). We notice some difference at low decimal places which indicates that algorithms are different but very close in performance.
weights
CRIM -0.927964
ZN 1.081413
INDUS 0.141150
CHAS 0.681705
NOX -2.056898
RM 2.674033
AGE 0.020084
DIS -3.103570
RAD 2.661775
TAX -2.060602
B 0.849092
LSTAT -3.744366
const 22.532792
The chart below is hardly distinguishable compared to the chart of sklearn approach which is not a surprise.
Conclusions:-
The article compared two approaches to solving regression problem. First solution was the common gradient descent approach using sklearn library. Secondly we solved the regression problem using the optimization approach. The results are compared using a number of metrics and indicators. Mostly the results were indistinguishable which indicates the applicability of optimization for minimising a non-linear function.
References:-
Kraft, D. A software package for sequential quadratic programming. 1988. Tech. Rep. DFVLR-FB 88–28, DLR German Aerospace Center — Institute for Flight Mechanics, Koln, Germany.
https://towardsdatascience.com/linear-regression-using-gradient-descent-97a6c8700931
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#id8