Improving

Kmeans analysis is one of the main methods of data analysis and the k-means

clustering algorithm. The main technique that is used for many practical

applications. But the original k-means algorithm is computationally expensive

and the final storage depends a lot on the correction of the initial Centroids,

which are selected at random. Many improvements have already been proposed to

improve performance of k-means, but most of

require supplementary inputs as inception values for the number of data

points in a set. This article proposes a new method to find the best initial

centroids and provide an efficient how to assign data points to appropriate

clusters. Reduce the complexity of time. This algorithm is easy to implement, and

improve the Kmeans efficiency which requires a simple data structure to

maintain certain information in each iteration to be used in next iteration.

INTRODUCTION

Advances

in scientific data collection methods high dimensionality, insensitivity to

order of attributes, have resulted in the large scale accumulation of promising

interoperability and usability. Cluster analysis is a data pertaining to

diverse fields of science and one of the primary data analysis tool in the data

mining technology. Owing to the development of novel Clustering algorithms are

mainly divided into two techniques for generating and collecting data, the rate

of categories: Hierarchical algorithms and Partition growth of scientific

databases has become tremendous algorithms. A hierarchical clustering algorithm

divides Hence it is practically impossible to extract useful the given data set

into smaller subsets in hierarchical information from them by using

conventional database fashion. A partition clustering algorithm partition the analysis

techniques.

Effective

mining methods are data set into desired number of sets in a single step. Absolutely

essential to unearth implicit information from numerous methods have been

proposed to solve huge databases clustering problem. The most popular

clustering method CLUSTERING is an important tool for a variety of is k-means

clustering algorithm developed by Mac Queen Applications in data mining,

statistical data analysis, data in 1967. The easiness of k-means clustering

algorithm compression and vector quantization. Clustering is a made this algorithm

used in several fields division of data into groups of similar objects.

Each

group the k-means clustering algorithm is more prominent consists of objects

that are similar between themselves since its intelligence to cluster massive

data rapidly and dissimilar to objects of other groups. From the efficiently.

But the computational complexity of the machine learning perspective,

Clustering can be viewed as original k-means algorithm is very high, especially

for unsupervised learning of concepts. Unsupervised large data sets. Moreover,

this algorithm results in machine learning means that clustering does not

depend different types of clusters depending on the random on predefined

classes and training examples while choice of initial centroids. The

effectiveness of a classifying the data objects.

KMEANS CLUSTERING ALGORITHM

Our

process is to classify a given set of data in k number of disjoint clusters, in

which the value of k is set go ahead .The algorithm consists of two separate

phases: The first phase consists in defining k centroids, one for each group.

The next step is to take every point belonging to the data establish it and

associate it with the nearest centroid Euclidean distance .It is generally

considered that determines the distance between data points and centroids. When

all the points are included in some clusters, the first step is completed the

initial group is ready. At this point we must recalculate the new centroids,

since the inclusion of new points can lead to a change in the group’s

centroids. Once you find new k centroids, a new link will be created between

them data points and the new centroid closer, generating a cycle.

As

a result of this cycle, k centroids can change them position gradually. In the

end, a situation be reached where the centroids no longer move. This indicates

the convergence criterion for grouping.

Algorithm

1:

The

k-means clustering algorithm Input:

D

= {d1, d2… dn} //set of n data items. k // Number of desired clusters

Output:

A set of k clusters.

Steps:

·

Arbitrarily choose k data-items from D as

initial centroids;

·

Repeat assign each item di to the cluster

which has the Closest centroid; Calculate new mean for each cluster;

·

Until convergence criteria is met.

K-means

appears to give partitions which are reasonably efficient in the sense of within

class variance, corroborated to some extend by mathematical analysis and

practical experience. Also, the k-means procedure is easily programmed and is

computationally economical, so that it is feasible to process very large

samples on a digital computer.

ENHANCED

ALGORITHM (Improvised Kmeans Algorithm)

Entrance:

D = {d1, d2… dn} // set of n data elements

K // Number of desired clusters

Exit:

A set of k clusters.

Steps:

Phase

1: determine the initial centroids of the groups for using algorithm 3.

Step

2: assign each data point to the appropriate clusters for using algorithm 4

RESULTS

The

k-Means algorithm is advanced with a first the paradigm, followed by an

advanced k-means algorithm. The improved k-sign algorithm can be used to

determine the cluster centroid. The investigative results are discussed for the

Kmeans the algorithm must take the time for which the complexity is greater

different data set. The resulting clusters of the normal K-Means distribution

the algorithm is presented. The normal distribution data Points are taken to

easily implement and take the steps of convenient for our data sets. The number

of clusters and data the points are given by the user during the execution of Program.

The number of data points is 1000 and the number of the cluster data is 10 (k =

10).

The algorithm is repeated

to allocate times for efficient output. The cluster centres (centroids) are

calculated for each cluster average value and cluster are formed depending on

the distance between the data points. For different input data points, the

algorithm provides different types of output.

Improved k-means is better than k-means in