By Taylor J.

The different approaches for outlier detection can be broadly categorized into three types [54]: • Statistical approach: Here, the data distribution or the probability model of the data set is considered as the primary factor. • Distance-based approach: The classical definition of an outlier in this context is: An object O in a data set T is a DB(p, D)-outlier if at least fraction p of the objects in T lies greater than distance D from O [77]. • Deviation-based approach: Deviation from the main characteristics of the objects are basically considered here.

The composite TF–IDF weight is the product of the TF and IDF components for a particular term. The TF term gives more importance to frequently occurring terms in a document. However, if a term occurs frequently in most of the documents in the document set then, in all probability, the term is not really that important. This is taken care of by the IDF factor. The above schemes are based strictly on the terms occurring in the documents and are referred to as vector space representation. An alternative to this strategy is latent semantic indexing (LSI).

A tail strength measure for assessing the overall univariate significance in a dataset (2006)(en)(15 by Taylor J.

