Abstract:
In many application areas data mining algorithms invariably operate on
centralized data, in practice related information is often acquired and stored at
geographically distributed locations due to organizational or operational
constraints. However centralization of such data before analysis may neither be
desirable nor feasible for most practical applications due to efficiency and
limitations on resources, such as network bandwidth. Moreover, data
preprocessing and data mining algorithms are known to be both compute and
data intensive. The Grid computing community promises to offer infrastructures
that allow on-demand access to distributed resources. [1].
The proposed and implemented solution uses Grid infrastructure to
perform mining on the given data sets. In this technique data is mined locally at
the sites and suitable representatives are extracted. These representative
models are then sent to a global server site where based on these local
representatives Global models are formed. This approach increases efficiency by
decreasing computational and bandwidth costs required for transmission.
The experimental results further verify this hypothesis by clearly displaying
the efficiency difference between centralized data mining and when done in a
distributed fashion using the proposed approach and the same data sets.