Anomaly detection is a crucial idea that has been examined and examined in a variety of fields. This article attempts to present a well-organized overview of anomaly detection.
What is Anomaly Detection?
The discovery of patterns in data that do not conform to expected behaviour is known as anomaly detection. In layman’s terms, it is a technique used to identify unusual patterns that do not conform to normal patterns. These items are often referred to as anomalies, outliers, peculiarities or contaminants in different application domains.
Credit card fraud, cyber-intrusion, terrorist action, system failure, and other factors might cause data anomalies. Analysts are interested in these abnormalities. The “interestingness” or real-life relevance of anomalies is a key feature of anomaly detection.
Anomalies Vs Outliers
Anomalies and outliers are two terms used most commonly in the context of anomaly detection. Anomalies are patterns of different data within given data, whereas outliers would be merely extreme data points within data. If not conglomerated properly, anomalies may be neglected as outliers. To put it another way, a giraffe in a herd of camels is an outlier, yet a dwarf camel in the same herd is not. Outliers are all anomalies, but not all anomalies are outliers.
Anomaly Vs Noise
Erroneous values or contaminated objects or datasets are referred to as noise. For example, weight recorded incorrectly, a lemon was mixed in with a package of limes. Noise does not always produce strange quantities or objects, which can be difficult to notice, but it is still noise. Noise isn’t “interesting” unless it can be used to assess the data’s quality and accuracy. Anomalies may be interesting, provided they are not a result of noise. While noise and anomalies are related, they are distinct concepts.
Types of Anomaly
Anomalies can be classified into the following three categories:
(i) Point anomalies: If a single data sequence is aberrant in comparison to the rest of the data, it is referred to as a “Point anomaly.”
(ii) Contextual (or) Conditional anomaly: a contextual anomaly occurs when a data instance is anomalous in one context but not in another. Time-series data, for example.
(iii) Collective anomalies: A collective anomaly occurs when a group of linked data is abnormal in relation to the overall data set.
The labels attached to a data instance indicate whether it is normal or anomalous. Because labelling is typically done manually by a human expert, obtaining the labelled training data set requires a significant amount of effort and money. Anomaly detection techniques can operate in one of three modes depending on the labels:
- Supervised Anomaly detection
- Unsupervised Anomaly detection
- Semi-Supervised Anomaly detection
Supervised Anomaly detection
The training data set has labelled instances for normal, as well as anomaly class, fall under this category. In such cases, a typical approach is to create a predictive model for normal vs. anomaly classes. There are two major issues that arise in supervised anomaly detection. First, the anomalous instances are far fewer compared to the normal instances in the training data. Secondly, there are issues that arise as a result of unequal class distributions.
Unsupervised Anomaly Detection
The unsupervised mode doesn’t require any training data, and hence, the most extensively used mode. This method is based on the notion that regular occurrences are significantly more common in the test data than anomalies. If this assumption is incorrect, such techniques will have a high proportion of false alarms.
Semi-Supervised Anomaly Detection
Semisupervised techniques presume that the training data only contains labelled cases for the normal class. Since they do not require labels for the anomaly class, they are more widely applicable than supervised techniques.
For example, in spacecraft fault detection, an anomaly scenario would signify an accident, which is not easy to model. The typical approach used in such techniques is to build a model for the class corresponding to normal behaviour, and use the model to identify anomalies in the test data.
Considerations while detecting anomalies for applications
We must evaluate and identify four crucial characteristics for each and every application of anomaly detection:
- The notion of anomaly
- Nature of the data
- Challenges associated with detecting anomalies
- Existing anomaly detection techniques