Isolation Forests for Symbolic Data as a Tool for Outlier Mining
Keywords:
isolation forests, outliers, symbolic data analysisAbstract
Aim: Outlier detection is a key part of every data analysis. Although many definitions of outliers can be found in the literature, all of them emphasize that outliers are objects that are in some way different from other objects in the dataset. Many different approaches have been proposed, compared, and analyzed for the classical data case. However, there is a lack of papers that deal with the problem of outlier detection in symbolic data analysis. The paper aims to propose how to adapt isolation forests for symbolic data case.
Methodology: Isolation forest for symbolic data is used to detect outliers in four different artificial datasets with known cluster structure and a known number of outliers.
Results: The results show that the isolation forest for symbolic data is a fast and efficient tool for outlier mining.
Implications and recommendations: As the isolation forest for symbolic data turned out to be an efficient tool for outlier detection for artificial data, further studies should focus on real data sets that contain outliers (i.e. credit card fraud dataset) and this approach should be compared with other outlier mining tools (i.e. DBCSAN). The authors recommend using the same initial settings for isolation forest for symbolic data as the settings that are proposed for isolation forest for classical data.
Originality/value: This paper is the first of this kind, focusing not only on the problem of outlier detection in general, but it extends the well-known isolation forest model for symbolic data case.
Downloads
Downloads
Published
Issue
Section
Categories
License
Copyright (c) 2024 Marcin Pełka, Andrzej Dudek
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Accepted 2024-03-13
Published 2024-05-10