Clustering when used with Association Rules
and Neural Networks can predict future events with such a high probability that
is beyond human capabilities today using any type of computational skills.
Clustering Algorithms help us in finding
hidden clusters in data and very interesting applications come out where
Clusters are formed whether it is group of similar habits people or events.
Data Mining help us do things which normally go unnoticed in a given data.
Search result grouping: In the process of intelligent grouping of the files and
websites, clustering may be used to create a more relevant set of search results
compared to normal search engines like Google. There are currently a number of
web based clustering tools such as Clusty.
Slippy map optimization: Flickr's map of photos and other map sites use
clustering to reduce the number of markers on a map. This makes it both faster
and reduces the amount of visual clutter.
IMRT segmentation: Clustering can be used to divide a fluence map into distinct
regions for conversion into deliverable fields in MLC-based Radiation Therapy.
Grouping of Shopping Items: Clustering can be used to group all the shopping
items available on the web into a set of unique products.
Mathematical chemistry: To find structural similarity, etc., for example, 3000
chemical compounds were clustered in the space of 90 topological indices.
Clustering
Data clustering algorithms
can be hierarchical or partitional. Hierarchical algorithms find successive
clusters using previously established clusters, whereas partitional algorithms
determine all clusters at once. Hierarchical algorithms can be agglomerative
("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each
element as a separate cluster and merge them into successively larger clusters.
Divisive algorithms begin with the whole set and proceed to divide it into
successively smaller clusters.
Two-way clustering, co-clustering or biclustering are clustering methods where
not only the objects are clustered but also the features of the objects, i.e.,
if the data is represented in a data matrix, the rows and columns are clustered
simultaneously.
Distance measure
An important step in any clustering is to select a distance measure, which will
determine how the similarity of two elements is calculated. This will influence
the shape of the clusters, as some elements may be close to one another
according to one distance and further away according to another. For example, in
a 2-dimensional space, the distance between the point (x=1, y=0) and the origin
(x=0, y=0) is always 1 according to the usual norms, but the distance between
the point (x=1, y=1) and the origin can be 2, or 1 if you take respectively the
1-norm, 2-norm or infinity-norm distance.
Common distance functions:
The Euclidean distance (also called distance as the crow flies or 2-norm
distance). A review of cluster analysis in health psychology research found that
the most common distance measure in published studies in that research area is
the Euclidean distance or the squared Euclidean distance.
The Manhattan distance (also called taxicab norm or 1-norm)
The maximum norm
The Mahalanobis distance corrects data for different scales and correlations in
the variables
The angle between two vectors can be used as a distance measure when clustering
high dimensional data. See Inner product space.
The Hamming distance (sometimes edit distance) measures the minimum number of
substitutions required to change one member into another
Cluster diagram is shown on the
left.
Biology
In biology clustering has many applications
In imaging, data clustering may take different form based on the data
dimensionality. For example, the SOCR EM Mixture model segmentation activity and
applet shows how to obtain point, region or volume classification using the
online SOCR computational libraries.
In the fields of plant and animal ecology, clustering is used to describe and to
make spatial and temporal comparisons of communities (assemblages) of organisms
in heterogeneous environments; it is also used in plant systematics to generate
artificial phylogenies or clusters of organisms (individuals) at the species,
genus or higher level that share a number of attributes
In computational biology and bioinformatics:
In transcriptomics, clustering is used to build groups of genes with related
expression patterns (also known as coexpressed genes). Often such groups contain
functionally related proteins, such as enzymes for a specific pathway, or genes
that are co-regulated. High throughput experiments using expressed sequence tags
(ESTs) or DNA microarrays can be a powerful tool for genome annotation, a
general aspect of genomics.
In sequence analysis, clustering is used to group homologous sequences into gene
families. This is a very important concept in bioinformatics, and evolutionary
biology in general. See evolution by gene duplication.
In high-throughput genotyping platforms clustering algorithms are used to
automatically assign genotypes.
Market research
Cluster analysis is widely used in market research when working with
multivariate data from surveys and test panels. Market researchers use cluster
analysis to partition the general population of consumers into market segments
and to better understand the relationships between different groups of
consumers/potential customers.
Segmenting the market and determining target markets
Product positioning
New product development
Selecting test markets (see : experimental techniques)
Other applications
Social network analysis: In the study of social networks, clustering may be used
to recognize communities within large groups of people.
Image segmentation: Clustering can be used to divide a digital image into
distinct regions for border detection or object recognition.
Data mining: Many data mining applications involve partitioning data items into
related subsets; the marketing applications discussed above represent some
examples. Another common application is the division of documents, such as World
Wide Web pages, into genres.