To be, or not to be, that is the question:
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing end them?
Hamlet's existential angst about the human condition is very well known, but a slightly lesser affliction affects data scientists when it comes to data and techniques used to extract predictive insights.
The dramatic question data scientists grapple with is:
Should you keep the data as is, or should you transform it?
Data that is used in machine learning situations is typically one of two varieties: numeric or nominal. Numeric data, as the name suggests, is based on numbers that can be ordered. Nominal data, on the other hand, consists of labels and is not ordered.
Most machine learning techniques also have restrictions on the kinds of data they use. For example, techniques like neural networks (deep learning) work best on numeric data rather than nominal data. Many studies suggest that neural networks do not perform as well even if the nominal data is converted into numeric data.
Decision tree based models work well in the presence of both numeric and nominal data. Decision trees, however, can grow very large depending on the number of different values of a nominal variable. For this reason, many predictive analytic tool sets restrict the number of distinct values any nominal variable can take when they create models such as random forest.
One of the most common class of problems for which models are created are binary classification problems. Either a lead becomes an opportunity or it does not. An opportunity results in a win or it does not. A customer renews their subscription or they do not. When modeling such problems the data often has both nominal and numeric variables. The product being purchased, for example, is often modeled as a nominal variable. Even if each opportunity has only a single product, the number of products and configurations available to sell can be fairly large -- tens of thousands for large B2B enterprises, or millions for B2C companies.
In such situations we found that we can replace the product label in the data with the qualification rate/win rate of leads/deals that contain that product without loss of performance. Technically speaking, this is replacing the nominal value with the conditional probability of the label belonging to the positive class in the binary classification problem.
At DxContinuum, we had a data set that had about 50 different variables and 200 million rows. The most important variable was a nominal variable that had about 70,000 different possible values. The resulting decision tree without any transformations was massive -- more than 30GB in size. Once we transformed the variable into a number the resulting model performed as well from a quality of prediction perspective, but it was an order of magnitude smaller in size leading to much faster execution time with far less computing resources.
These are the kinds of advanced transformations available in the patented DxContinuum platform that can banish the slings and arrows of outrageous data facing today's data scientists. In addition to making their lives easier and accelerating their output, using DxContinuum means they won't have to grapple with existential issues surrounding their data any more.