DxContinuum Blog

Nominal or Numeric -- That is the Question

October 26, 2016
By Dr. Kannan Govindarajan
To be, or not to be, that is the question:
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing end them?

Hamlet's existential angst about the human condition is very well known, but a slightly lesser affliction affects data scientists when it comes to data and techniques used to extract predictive insights.

The dramatic question data scientists grapple with is:

Should you keep the data as is, or should you transform it?

Data that is used in machine learning situations is typically one of two varieties: numeric or nominal. Numeric data, as the name suggests, is based on numbers that can be ordered. Nominal data, on the other hand, consists of labels and is not ordered.

Most machine learning techniques also have restrictions on the kinds of data they use. For example, techniques like neural networks (deep learning) work best on numeric data rather than nominal data. Many studies suggest that neural networks do not perform as well even if the nominal data is converted into numeric data.

Decision tree based models work well in the presence of both numeric and nominal data. Decision trees, however, can grow very large depending on the number of different values of a nominal variable. For this reason, many predictive analytic tool sets restrict the number of distinct values any nominal variable can take when they create models such as random forest.

One of the most common class of problems for which models are created are binary classification problems. Either a lead becomes an opportunity or it does not. An opportunity results in a win or it does not. A customer renews their subscription or they do not. When modeling such problems the data often has both nominal and numeric variables. The product being purchased, for example, is often modeled as a nominal variable. Even if each opportunity has only a single product, the number of products and configurations available to sell can be fairly large -- tens of thousands for large B2B enterprises, or millions for B2C companies.

In such situations we found that we can replace the product label in the data with the qualification rate/win rate of leads/deals that contain that product without loss of performance. Technically speaking, this is replacing the nominal value with the conditional probability of the label belonging to the positive class in the binary classification problem.

At DxContinuum, we had a data set that had about 50 different variables and 200 million rows. The most important variable was a nominal variable that had about 70,000 different possible values. The resulting decision tree without any transformations was massive -- more than 30GB in size. Once we transformed the variable into a number the resulting model performed as well from a quality of prediction perspective, but it was an order of magnitude smaller in size leading to much faster execution time with far less computing resources.

These are the kinds of advanced transformations available in the patented DxContinuum platform that can banish the slings and arrows of outrageous data facing today's data scientists. In addition to making their lives easier and accelerating their output, using DxContinuum means they won't have to grapple with existential issues surrounding their data any more.

Dr. Kannan Govindarajan

Written by Dr. Kannan Govindarajan

DxContinuum Co-Founder & Chief Product Officer Kannan’s 15+ year career in software and services has spanned multiple functions and businesses. In the first 10 years of his career in R&D, he shipped products for Oracle and HP focused on Java and web services middleware technologies. He was one of the architects of the team that created one of the earliest web services implementations, and represented HP in standards bodies such as UDDI. He co-founded a new research program in HP Labs for creating technology for automating business processes for services businesses. Subsequently, he was Chief Technologist of HP’s IT Outsourcing services and implemented some of the ideas from his research. Kannan has multiple patents and publications in peer-reviewed conferences and journals in a variety of areas. Kannan has spent the last 6+ years in Product Management/marketing/strategy roles where he managed the Application Operations Service Line in HP Outsourcing Services, and ran product marketing for HP’s data warehousing product. Most recently, as Director Strategy, he drove thought leadership for HP through the mega-trends project, and led business planning for Cloud applications and infrastructure services including the go-to-market model for cloud offerings in HP Enterprise services. Kannan has a Bachelors in Computer Science and Engineering from the Indian Institute of Technology Madras, a PhD in Computer Science from the State University of New York at Buffalo, and a Masters in Management from MIT Sloan School of Management.