Simplified Discussion on How DataRevelation Works (US patent 7,225,113)

                                                                 Purpose

DataRevelation automatically finds a world-class statistical model and approximates missing data.  It is hoped that the power and easy use of this tool will shed new light on your specific data as well as facilitate new avenues of thought about the subject of data analysis.

                                                                 Details                              

DataRevelation's models take this generalized form:  y = k1f(x1) + k2(x2) + k3f(x3) + ... + b where f(x1) is some function of independent variable x1, f(x2) is some function of independent variable x2, etc. The constants k1, k2, etc. will be omitted from the following discussion for simplicity.

The first task is to determine which independent variable is most important.  This is found by plotting the dependent variable (y) versus common transformations of each independent variable.  The one with the best fit is determined to be the most important.

The next step is to determine the optimum function of the most important independent variable (let's assume it is x1).  DataRevelation uses a combination of finesse and brute strength to find the transformation of x1 that gives the best fit when plotted against the dependent variable (y).

This yields a model with only one independent variable taking this form: y = f(x1) + b1.  The goal of the next step is to determine the next most important independent variable (let's assume it is x2) and create a model of the form y = f(x1) + f(x2) + b2.  The component of the dependent variable (y) that can be attributed to the most important independent variable (x1) is removed by subtracting the calculated values of (y) from the observed values of (y).  This is equivalent to y - f(x1) - b1 = f(x2) + b''.  Y - f(x1) - b1 is also called the residual of x1 which can be written as residual(x1).

The residual(x1) is then plotted against common transformations of each remaining independent variable.  The one with the best fit is determined to be the second most important independent variable.  Finesse and brute strength is then used to find the optimum function of x2 with the best fit when plotted against residual(x1).  This yields a model with two independent variables taking this form: y = f(x1) + f(x2) + b2

A third independent variable (let's assume x3) needs to be added into the model which will have this form: y = f(x1) + f(x2) + f(x3) + b3.  The component of the dependent variable (y) that can be attributed to the two most important independent variables (x1 and x2) is removed by subtracting the calculated values of (y) from the observed values of (y).  This is equivalent to y - f(x1) - f(x2) - b2 = f(x3) + b'''.  Y - f(x1) - f(x2) - b2 can be written residual(x1,x2).

The residual(x1,x2) is plotted against common transformations of the remaining independent variables in order to find the third most important independent variable, and the cycle repeats itself until all of the independent variables are added to the model. 

In summary, transformations of the independent variables are added to the model in a step-by-step method using residual analysis starting with the most important variable and proceeding to the least important variable.  See the directions for further thoughts.

                                                              Missing Data Approximations

This methodology automatically yields a very good model with just a few terms and also makes missing data approximations capable of being automatically handled. 

If there is missing data for the most important independent variable, x1, then those records are temporarily ignored while determining the model with only one independent variable taking the form of y = f(x1) + b1.  The missing data is solved for by knowing y, b1, and the specifics of f(x1).  Subsequent steps of the program use the estimates for the missing data of x1.

If there is missing data for the next most important independent variable, x2, then those records are temporarily ignored while determining the optimum function of x2.  A model with this form is created:  y - f(x1) - b1 = f(x2) + b''.  Y, f(x1), b1, b'', and the specifics of f(x2) are known so the missing data of x2 can be solved for.  Subsequent steps of the program uses the estimates for the missing data of x2.

Missing data of all remaining variables are handled in this way. 

                                                                Return to Home Page