Clustering for web information retrieval
Supervised learning: a program that performs a task as good as humans. There is well defined task or target function. The training data is provided by human. Performance is measured by error/accuracy on the task.
Unsupervised learning: a program to find some king of structure in the data. Task is not clearly defined. No training data is provided and there is no performance measurement though there are some evaluation metrics.
Clustering is the most common form of unsupervised learning. It is the process of grouping a set of physical or abstract objects into classes of similar objects.
It can used in IR(information retrieval) to improve recall in search applications and for better navigation of search results.
Example 1: Improving Recall
cluster hypothesis: documents with similar text are related. Thus when a query matches a document D, also return other documents in the cluster containing D.
1. Flat clustering divides objects in groups(clusters)
2. Hierarchical clustering organize clusters in a subsuming hierarchy