Friday, April 15, 2011

Multiple Regression Analysis


Multiple Regression Analysis can be a powerful tool for evaluation of multiple process, or product inputs versus a single output characteristic which is often a balanced scorecard element.  The results can provide insight into what makes a difference, what doesn’t make a difference, and unfortunately the just can’t tell.

The keys to the analysis boil down to the structuring of your data set for the multiple regression.  A common stratification factor that aligns all of the data elements is paramount and the most common one is the date that is associated with the data elements, both the input X’s and the output Y.  If some data is missing for a particular date the best practice is to remove all of the data elements for that date from the model.  Maintaining balance is critical.

Before throwing all of the data into a multiple regression analysis using a statistical analysis package such as JMP or Minitab look at the data graphically with a matrix scatterplot.  Look for relationships, either positive or negative, that relate an input X with the output Y.  Also, be wary of highly correlated X to X relationships.  That could be colinearity, which can lead to misinterpretation of your regression model.

The graphic look at the relationships will point out the X variables that don’t need to be included in the model to start with.  During the multiple regression analysis multicolinearity can be dealt with using variance inflation factors (VIFs) which require values of 5 or less.  Depending on the statistics software package during the first iteration of your regression model you will be presented with T statistics or F statistics.  These statistics are used to generate the all important p values. 

Pooling is the elimination of the non-significant variables.  The first step in reduction of variables is to identify the X’s that have either F<=1 or absolute value of T<=1.  These must be removed from the model.  Now re-run the regression model and check the VIFs.  Typically any VIFs that are greater than 5 can be removed based upon having a large p value.  The caution here is to take them out one at a time and rerun the model.  I forgot to tell you, that is why it is called multiple regression analysis, you will be doing this multiple times.

Now that your regression model has been reduced to the significant X’s you should have the following.

  • X variables have p values less than 0.05.
  • R squared is 65% or greater.
  • VIFs are all less than 5.

The next question after all of this work has been completed is, “Do the relationships in the model make sense?”  Use your process and product knowledge to give the results a sanity check.  If it makes sense than you are ready to use the mathematical model developed to make predictions.  The conservative approach is to stay within the minimum and maximum values for each of the significant X variables that were used to develop your model, remember that is the only data that you have and is the basis for the model.

Here is why.  There was a study done in Oslo.  The population of storks was increasing along with the number of babies being born which is a strong positive correlation.  Therefore, we can declare that Storks bring Babies!  The truth is, storks nest near warm chimneys so as the population was growing more houses were built, more chimneys, and thus more nesting places for the birds.  Remember, you’re a priori knowledge about products, processes, and services should be used to give your multiple regression model results a sanity check.

No comments:

Post a Comment

Post your comments here. Thank You!