top of page
  • Writer's pictureElipsa

How Much Data Does AI Need?

One of the biggest hang-ups for people when thinking about where to begin with machine learning is the question of how much data do I need to train a machine learning algorithm?

The assumption is that you always need more than you have. Many never get off the ground on their path towards AI efficiencies because they rule themselves out due to a perceived lack of data.

However, you may be surprised to know that this is not always the case and that you can even get up and running with some uses cases without any historical data at all.

What Do We Mean by AI for IoT?

To start, there are many different sub-categories or sub-disciplines of artificial intelligence. Whether you are talking about computer vision, natural language processing, or machine learning, these are all slightly different from each other and then all have different answers to the question of data requirements.

For this post, we are focused on machine learning in terms of analyzing sensors and machine data and particularly how much of that data do you need to build an effective and accurate model.

To further break it down, there are various types of machine learning algorithms that you can apply, each having its own applicability to a given use case. The three that we will explore are predicting an event, predicting a value, and predicting outliers.

The answer to how much data you need to build an effective model, partly lies in the answer to the question of which of those three algorithm types is being applied to your use case.

Types of Machine Learning for IoT

To further break it down, there are various types of machine learning algorithms that you can apply, each having its own applicability to a given use case. The three that we will explore are predicting an event, predicting a value, and predicting outliers.

Predicting an event could be used for monitoring a threshold such as will CO2 exceed 800ppm in the next 30 minutes, or for diagnosing a failure such as was the failure in the HVAC due to a coil problem. In machine learning speak, these algorithms are called classification algorithms.

Predicting a value could be used for predicting what a specific number or value will be such as how many hours of uptime remain or what is the percentage of contaminants that will exist in the final product based on current machine settings. In machine learning speak, these algorithms are called regression algorithms.

Finally, predicting an outlier can be used for areas such as predictive maintenance where the machine learning algorithm takes in data that is known to be from the machine running under normal circumstances so that it can learn the state of normal and effectively monitoring for abnormal activity. In addition to outlier detection, this is often called anomaly detection.

Example Use Case: Predictive Maintenance of HVAC

Machine learning on your sensor and machine data is all about finding patterns that can be used to predict answers to the question you train it on. The amount of data needed for training is based on the complexity of that question, or what you’re looking to predict, and whether there are clear patterns in the historical data provided.

Let’s explore each of the three types of machine learning algorithms under the specific use case of machine failure. In all cases, we will be looking at an HVAC system with 5 data points: vibration, RPMs of motor, intake temperature, discharge temp, and pressure.

Predicting an Event: Failure Diagnosis

For predicting an event, we are not going to look to predict failure in advance, but instead, we would look to use AI to instantly diagnose the failure once it occurs. This use case could save a tremendous amount of time and money to instantly know what the cause is in order to streamline fixing it. For another example of predicting an event with IoT data click here

In order to predict or classify an event, you need what is known as labeled data. So, if you want to visualize it, think of a spreadsheet with your 5 sensors as columns, and then a sixth column with the labels or answers to your question. So, for example, if you want to know if the HVAC stopped because of a failed coil, your “answers” column needs to align points in time where the HVAC in fact failed because of a coil with the sensor readings at the point of failure. If you do not have instances of failed coils in your data, you can’t train a model to predict that as a cause.

So, under this circumstance, how much data do we need? It depends on the complexity of the question and the complexity of the data.

In the HVAC example, if we’re trying to diagnose one particular cause of failure such as a coil we would need far fewer examples than if we were trying to predict 5 different kinds of failure. In addition, if instead of the 5 sensor readings we expanded that to 7 or 10 different sensors, the complexity of the data would expand in a way that would again require more data to be able to find patterns.

In all likelihood, you need to start with at least a few hundred examples of each failure but this is certainly an instance where more data is better.

Predicting a Value: Time Till Failure

In this instance, we are going to take the approach of predicting how many hours until a failure.

Again, in this case, we would visualize the training data as a spreadsheet with a column for each of the five sensor values. Similar to predicting an event, the sixth column would be what we are trying to predict but it would instead be the amount of time until the next failure for a given point in time snapshot.

So for example, if the machine failed at 5 pm, our training data would include sensor readings in a particular row at say 1 pm and the column to train on would be the value of 4 since there are 4 hours remaining until the machine failed.

As you can guess based on what we discussed predicting an event, in order to train a model to predict time till failure, you need examples of failures so you can provide the label of how much time remained at each sensor snapshot.

So, again back to the question of how much data. In this case, you need examples of failures and you need multiple failures. So, if you have a machine that goes weeks between failures or even months, you will have a fair amount of data between failures and so you will need to collect a large number of data points in order to train a model with enough failures in it.

In addition, unlike predicting an event such as the cause of failure which has up to say 5 options in our example, you might be looking to predict the number of hours ranging between say 0 and 1000. So, the number of possible answers when predicting a value could be drastically higher requiring a lot more data to train an accurate model.

Outlier Detection: Intelligent Asset Monitoring

This brings us to outlier detection, saving the best for last. To this point, predicting an event or value sounds like a daunting task in the potential amount of labeled data needed for training. However, if you have the data or are able to collect it, the benefits to the predictions could be enormous.

In addition, the advantage of IoT is that machines and sensors are tracking data every second, generating a lot of data in a short period of time.

Outlier detection can be the saving grace of data requirements and the easiest example of machine learning to get you going. With outlier detection, we would visualize the data the same way as before as rows of 5 columns with each column representing our point in time sensor readings. The difference is that for outlier detection we do not provide labeled data.

This is because outlier detection is not trying to use the sensors to learn a particular answer but instead it is trying to learn the patterns between the sensor readings.

With outlier detection, you are effectively teaching the system what normal looks like so that you can make predictions against that model in real-time to monitor for abnormal events.

So, for the question of how much data, it depends on how normal your training data is. If the sensor readings are for a period that you know is where the machine was functioning properly then you can build the model with smaller amounts of data. For example, we have seen instances of being able to train an initial model after only a few days of data collection. However, we recommend collecting at least a month’s worth of data.

Like predicting an event and value, if the number of sensors that you are monitoring expanded to 7 or 10, the complexity of the data increases to where you would need more data to train off how. However, the key is that outlier detection use cases do not require the time-consuming task of labeling data, and they can generally get started with less initial data.


There are certainly use cases where large amounts of historical data are necessary in order to train an accurate predictive model. However, it is not always the case. Certainly, more data is often better but the amount of data that you need depends on the complexity of the use case and data.

As we noted, anomaly detection for predictive maintenance ofter times requires the lightest lift to get started. For other use cases, there might be some trial and error involved to determine if you have enough data to be effective. Luckily, Elipsa enables users to build and test models for free.

With our no-code solution, you can upload your data to predict an event or value and instantly see the results. If the level of accuracy is not to your requirements, you likely need more data.

Don't let the misconception of data quantity stop you from utilizing AI to extract more value from your IoT data.

To get started, exploring no-code AI check us out at


bottom of page