BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Predictive modeling is a process that uses data mining and probability to forecast outcomes. Each model is made up of a number of predictors, which are variables that are likely to influence future results. Once data has been collected for relevant predictors, a statistical model is formulated. The model may employ a simple linear equation, or it may be a complex neural network, mapped out by sophisticated software. As additional data becomes available, the statistical analysis model is validated or revised.
Applications of predictive modeling
Predictive modeling is often associated with meteorology and weather forecasting, but it has many applications in business.
One of the most common uses of predictive modeling is in online advertising and marketing. Modelers use web surfers' historical data, running it through algorithms to determine what kinds of products users might be interested in and what they are likely to click on.
Bayesian spam filters use predictive modeling to identify the probability that a given message is spam. In fraud detection, predictive modeling is used to identify outliers in a data set that point toward fraudulent activity. And in customer relationship management (CRM), predictive modeling is used to target messaging to customers who are most likely to make a purchase. Other applications include capacity planning, change management, disaster recovery (DR), engineering, physical and digital security management and city planning.
Although it may be tempting to think that big data makes predictive models more accurate, statistical theorems show that, after a certain point, feeding more data into a predictive analytics model does not improve accuracy. Analyzing representative portions of the available information -- sampling -- can help speed development time on models and enable them to be deployed more quickly.
Once data scientists gather this sample data, they must select the right model. Linear regressions are among the simplest types of predictive models. Linear models essentially take two variables that are correlated -- one independent and the other dependent -- and plot one on the x-axis and one on the y-axis. The model applies a best fit line to the resulting data points. Data scientists can use this to predict future occurrences of the dependent variable.
Other more complex predictive models include decision trees, k-means clustering and Bayesian inference, to name just a few potential methods.
The most complex area of predictive modeling is the neural network. This type of machine learning model independently reviews large volumes of labeled data in search of correlations between variables in the data. It can detect even subtle correlations that only emerge after reviewing millions of data points. The algorithm can then make inferences about unlabeled data files that are similar in type to the data set it trained on. Neural networks form the basis of many of today's examples of artificial intelligence (AI), including image recognition, smart assistants and natural language generation (NLG).
Predictive modeling considerations
One of the most frequently overlooked challenges of predictive modeling is acquiring the right data to use when developing algorithms. By some estimates, data scientists spend about 80% of their time on this step.
While predictive modeling is often considered to be primarily a mathematical problem, users must plan for the technical and organizational barriers that might prevent them from getting the data they need. Often, systems that store useful data are not connected directly to centralized data warehouses. Also, some lines of business may feel that the data they manage is their asset, and they may not share it freely with data science teams.
Another potential stumbling block for predictive modeling initiatives is making sure projects address real business challenges. Sometimes, data scientists discover correlations that seem interesting at the time and build algorithms to investigate the correlation further. However, just because they find something that is statistically significant doesn't mean it presents an insight the business can use. Predictive modeling initiatives need to have a solid foundation of business relevance.