Predictive modeling, also called predictive analytics, is a mathematical process that seeks to predict future events or outcomes by analyzing patterns that are likely to forecast future results. The goal of predictive modeling is to answer this question: "Based on known past behavior, what is most likely to happen in the future?
Once data has been collected, the analyst selects and trains statistical models, using historical data. Although it may be tempting to think that big data makes predictive models more accurate, statistical theorems show that, after a certain point, feeding more data into a predictive analytics model does not improve accuracy. The old saying "All models are wrong, but some are useful" is often mentioned in terms of relying solely on predictive models to determine future action.
In many use cases, including weather predictions, multiple models are run simultaneously and results are aggregated to create one final prediction. This approach is known as ensemble modeling. As additional data becomes available, the statistical analysis will either be validated or revised.
Applications of predictive modeling
Predictive modeling is often associated with meteorology and weather forecasting, but it has many applications in business.
One of the most common uses of predictive modeling is in online advertising and marketing. Modelers use web surfers' historical data, running it through algorithms to determine what kinds of products users might be interested in and what they are likely to click on.
Bayesian spam filters use predictive modeling to identify the probability that a given message is spam. In fraud detection, predictive modeling is used to identify outliers in a data set that point toward fraudulent activity. And in customer relationship management (CRM), predictive modeling is used to target messaging to customers who are most likely to make a purchase. Other applications include capacity planning, change management, disaster recovery (DR), engineering, physical and digital security management and city planning.
Analyzing representative portions of the available information -- sampling -- can help speed development time on models and enable them to be deployed more quickly.
Once data scientists gather this sample data, they must select the right model. Linear regressions are among the simplest types of predictive models. Linear models essentially take two variables that are correlated -- one independent and the other dependent -- and plot one on the x-axis and one on the y-axis. The model applies a best fit line to the resulting data points. Data scientists can use this to predict future occurrences of the dependent variable.
Some of the most popular methods include:
- Decision trees. Decision tree algorithms take data (mined, open source, internal) and graphs it out in branches to display the possible outcomes of various decisions. Decision trees classify response variables and predict response variables based on past decisions, can be used with incomplete data sets and is easily explainable and accessible for novice data scientists.
- Time series analysis. This is a technique for the prediction of events through a sequence of time. You can predict future events by analyzing past trends and extrapolating from there.
- Logistic regression. This method is a statistical analysis method that aids in data preparation. As more data is brought in, the algorithm's ability to sort and classify it improves and therefore predictions can be made.
The most complex area of predictive modeling is the neural network. This type of machine learning model independently reviews large volumes of labeled data in search of correlations between variables in the data. It can detect even subtle correlations that only emerge after reviewing millions of data points. The algorithm can then make inferences about unlabeled data files that are similar in type to the data set it trained on. Neural networks form the basis of many of today's examples of artificial intelligence (AI), including image recognition, smart assistants and natural language generation (NLG).
Common algorithms for predictive modeling
Random Forest. An algorithm that combines unrelated decision trees and uses classification and regression to organize and label vast amounts of data.
Gradient boosted model. An algorithm that uses several decision trees, similar to Random Forest, but they are more closely related. In this, each tree corrects the flaws of the previous one and builds a more accurate picture.
K-Means. Groups data points in a similar fashion as a clustering model and is popular with personalized retail offers. It can create personalized offers when dealing with a large group by seeking out similarities.
Prophet. A forecasting procedure especially effective when dealing with capacity planning. This algorithm deals with time series data and is relatively flexible.
Predictive modeling tools
Before deploying a prediction model tool, it is crucial for your organization to ask questions. You must sort out the following: clarify who will be running the software, what the use case will be for these tools, what other tools will your predictive analytics be interacting with, as well as the budget.
Different tools have different data literacy requirements, are effective in different use cases, are best used with similar software and can be expensive. Once your organization has clarity on these issues, comparing tools becomes easier.
- Sisense. A business intelligence software aimed at a variety of companies that offers a range of business analytics features. This requires minimal IT background.
- Oracle Crystal Ball. A spreadsheet-based application focused at engineers, strategic planners and scientists across industries that can be used for predictive modeling, forecasting as well as simulation and optimization.
- IBM SPSS Predictive Analytics Enterprise. A business intelligence platform that supports open source integration and features descriptive and predictive analysis as well as data preparation.
- SAS advanced Analytics. A program that offers algorithms that identify the likelihood of future outcomes and can be used for data mining, forecasting and econometrics.
Predictive modeling considerations
One of the most frequently overlooked challenges of predictive modeling is acquiring the amount of data needed and sorting out the right data to use when developing algorithms. By some estimates, data scientists spend about 80% of their time on this step. Data collection is important but limited in usefulness if this data is not properly managed and cleaned.
Once the data has been sorted, organizations must be careful to avoid overfitting. Over-testing on training data can result in a model that appears very accurate but has memorized the key points in the data set rather than learned how to generalize.
While predictive modeling is often considered to be primarily a mathematical problem, users must plan for the technical and organizational barriers that might prevent them from getting the data they need. Often, systems that store useful data are not connected directly to centralized data warehouses. Also, some lines of business may feel that the data they manage is their asset, and they may not share it freely with data science teams.
Another potential stumbling block for predictive modeling initiatives is making sure projects address real business challenges. Sometimes, data scientists discover correlations that seem interesting at the time and build algorithms to investigate the correlation further. However, just because they find something that is statistically significant doesn't mean it presents an insight the business can use. Predictive modeling initiatives need to have a solid foundation of business relevance.