Although we have the full suite of classification algorithms this documentation will focus on Logistic Regression.

Logistic Regression, also known as Logit Regression or Logit Model, is a mathematical model used in statistics to estimate the probability of an event occurring having been given some previous data.

This model is a popular method to predict a categorical response, this can be used to predict a binary outcome or it can be used to predict a multiclass outcome.

Logistic Regression works with binary data, where either the event happens (1) or the event does not happen (0), or with categorical data up to 140 categories, each one being an integer positive value.

It has been useful when the linear regression is not applicable, for example, in a linear regression you have a quantitative variable as the output variable, and what if the variable is qualitative?. This is where the logistic regression comes in, which through mathematical operations transform the output variable with logistic operators, transforming the qualitative variable into a number that is the probability of approximation to one of the possible categories. In other words, it is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.

From here it follows that its main use is to solve classification problems, when the predictors (x) could be binary or multinomial variables and where a categorical answer (y) is predicted. Classification involves looking at data and assigning a class (or a label) to it.

What can it be used for?

<aside> đź’ˇ The Logistic Regression model is a powerful tool of data analysis which can help decision makers to identify the market segments that are more likely to respond to certain marketing actions.

</aside>

In marketing field it is a usual practice to find certain behaviours using multiple answers nominal scales that have a checklist of answers from which the respondent can choose one or more answer categories. Such questions are defined in data bases as dichotomous variables, which can be easily used in logistic regression models as far as other multivariate data analysis models ask for continuous variables. Thus, the regression models could put into relationship a future specific action with certain present behaviours. Based on such dependences the companies could establish proper marketing strategies meant to target specific market segments.

Another potential use is in the Sales-Marketing system. Most of the systems use live sales agents to contact customers through a personal visit or through mobile. There is a cost associated with generating leads and converting leads to clients and reaching out to all customers is both time and cost intensive task. With data such as the average cost of arrival to the customer and the invoice, plus conversion rates, it is possible to identify a target audience and decide on the budget. For this, it is normal to ask certain questions. What would be the cost involved in achieving the highest ROI and profit, and also the Profit-ROI equilibrium point?. Using our logistic regression model it is possible to predicts the probability that an observation falls into multiple categories. In the Sales-Marketing system, the outcome of the previous marketing campaign is the dependent variable and independent variables are previous campaign metrics like “number of times customer has been reached in the past”, “number of days since the last purchase”,.. and some demographic attributes of the customers. With the results of the classifier it is possible to find the predictions of the "probability of an event occurring" where it is considered as cut off for segregating if the customer will buy a product or will not buy a product.

How to use it?

First of all we need to identify the variables, independent or predictors and dependent or target variables. These are what we will call the feature to predict y_column from the features x_columns. The goal is to come up with a model to classify / identify y_column which has the higher probability of conversion and also estimate the profitability of targeting the identified group.

Knowing this already, we proceed to build the dataset only with the characteristics to be used, that is, only having x_columns and y_column. If your dataset or source contains more features that are going to be used in the model, then it is recommended to use the CustomSQL operator to only obtain the desired features.

With the perfectly constructed dataset, the data feature is entered into the model.

To do this, proceed as follows:

  1. TRAINING DATA: Remember the training table that can be prepared with a customSQL? Add it into your model.
  2. X_COLUMNS: Here the features that belong to the dependent variables are indicated, it can be one or more (minimum one feature).
  3. Y_COLUMN: Here you are given the feature to be predicted / classified (only one is allowed).