Although we have the full suite of Regression Algorithms in this documentation we will dive deep into Linear regression and how to use it.

Linear regression attempts to model the relationship between two variables by fitting a linear equation (= a straight line) to the observed data. One variable is considered to be an explanatory variable (e.g. your income), and the other is considered to be a dependent variable (e.g. your expenses). In essence, it involves showing how the variation in the dependent variable can be captured by change in the independent variables.

In the field, this dependent variable can also be called the predictor or the factor of interest for ex., sales of a product, pricing, performance, risk etc. Independent variables are also called explanatory variables as they can explain the factors that influence the dependent variable along with the degree of the impact which can be calculated using “parameter estimates” or “coefficients”. These coefficients are tested for statistical significance by building confidence intervals around them so that the model that we are building is statistically robust and based on objective data. The elasticity based on the coefficient can tell us the extent to which a certain factor explains the dependent.

The key factor in any models is the right understanding of the domain and its business application.

What can it be used for?

<aside> 💡 Linear Regression is a powerful technique and can be used to generate insights on consumer behavior, understanding business and factors influencing profitability. Linear regressions can be used in business to evaluate trends and make estimates or forecasts. For example, if a company’s sales have increased steadily every month for the past few years, by conducting a linear analysis on the sales data with monthly sales, the company could forecast sales in future months.

</aside>

<aside> 💡 Linear regression can also be used to analyze the marketing effectiveness, pricing and promotions on sales of a product. For instance, if company ABC, wants to know if the funds that they have invested in marketing a particular brand has given them substantial return on investment, they can use linear regression. The beauty of linear regression is that it enables us to capture the isolated impacts of each of the marketing campaigns along with controlling the factors that could influence the sales.

</aside>

We can use linear regression to solve a few of our day-to-day problems related to supporting decision making, minimizing errors, increasing operational efficiency, discovering new insights, and creating predictive analytics.

While Linear regression has limited applicability in business situations because it can work only when the dependent variable is of continuous nature, it still is a very well known technique in the situations it can be used. It assumes a linear relation between the independent and dependent variables. It must be noted that sometimes transformations can also be applied to non linear relationships to make them applicable in a linear regression model.

How to use it?

First of all we need to identify the two variables, explanatory and dependent variables. These are what we will call the feature to predict y_column from the features x_columns. The goal is to come up with a model to predict y_column.

Knowing this already, we proceed to build the dataset only with the characteristics to be used, that is, only having x_columns and y_column. If your dataset or source contains more features that are going to be used in the model, then it is recommended to use the CustomSQL operator to only obtain the desired features.

With the perfectly constructed dataset, the data feature is entered into the model.

To do this, proceed as follows:

  1. TRAINING DATA: Remember the training table that can be prepared with a customSQL? Add it into your model.
  2. X_COLUMNS: Here the features that belong to the dependent variables are indicated, it can be one or more (minimum one feature).
  3. Y_COLUMN: Here you are given the feature to be predicted (only one is allowed).
  4. TRAINING PARAMS: In case you prefer to edit the model's training parameters, you can do it here, using the same format found in Spark's own documentation. This part is not mandatory, only if you want to customize it.