MAPREDUCE BASED-DENCLUE: AN EFFECTIVE DENSITY-BASED CLUSTERING FOR BIG DATA ANALYSIS

Dagne, Ephrem

MAPREDUCE BASED-DENCLUE: AN EFFECTIVE DENSITY-BASED CLUSTERING FOR BIG DATA ANALYSIS

Dagne, Ephrem

URI: http://hdl.handle.net/123456789/10533

Date: 2020-03-17

Abstract:

In the contemporary world, human beings are facing the reality of the challenges with uncontrollable exploration and accumulation of huge volume of data. The augmentation of advancements and usability of ubiquitous technologies contribute highly for the resolution of complex nature of data in terms of volume and heterogeneity. Since the customary data mining methods fail to process such data, we need to search for extended optimal techniques to deal with such big data. Clustering or segmentation is one of the widely used unsupervised data analyzing techniques in the field of Data Science that classifies data points based on similarity properties. Taking the advantages of density-based clustering, DENCLUE is one of the methods that endow some added striking characteristics that other clustering techniques don’t possess. DENCLUE is established with solid mathematical foundation and working with noisy, high-dimensional feature vectors. In this research, we proposed a model to parallelizing DENLCUE using MapReduce that is tailored for the big data scenario. We implemented a new approach named as MR-based DENLCUE (MR-DENCLUE) that can breed several clusters of data points residing and running on parallel machines. On the basis of our experimentations on two different datasets (UCI SEEDS, ABALONE and MFCCs datasets) runningon 4 data nodes and a master node on a Hadoop cluster, the execution timings of MR-DENCLUE are recorded to be 48.22% and 73.8% of the execution times of DENCLUE respectively. Thus, our experiment confirms that the proposed MR-DENLCUE results in efficient clustering quality in terms of speed when tested on different datasets. Moreover, our proposed method is organized to be scalable on Hadoop data nodes that work well on added data nodes to accommodate massive amount of datasets. In this way, we deliver a schemetoresolve the issue of velocity and volume- two major characteristic of big data analytics.

Show full item record