List of text mining methods

Different text mining methods are used based on their suitability for a data set. Text mining is the process of extracting data from unstructured text and finding patterns or relations. Below is a list of text mining methodologies.

Centroid-based Clustering: Unsupervised learning method. Clusters are determined based on data points.
- Fast Global KMeans: Made to accelerate Global KMeans.
- Global-K Means: Global K-means is an algorithm that begins with one cluster, and then divides in to multiple clusters based on the number required.
- KMeans: An algorithm that requires two parameters 1. K (a number of clusters) 2. Set of data.
- FW-KMeans: Used with vector space model. Uses the methodology of weight to decrease noise.
- Two-Level-KMeans: Regular KMeans algorithm takes place first. Clusters are then selected for subdivision into subclasses if they do not reach the threshold.
Cluster Algorithm
- Hierarchical Clustering
  - Agglomerative Clustering: Bottom-up approach. Each cluster is small and then aggregates together to form larger clusters.
  - Divisive Clustering: Top-down approach. Large clusters are split into smaller clusters.
- Density-based Clustering: A structure is determined by the density of data points.
  - DBSCAN
- Distribution-based Clustering: Clusters are formed based on mathematical methods from data.
  - Expectation-maximization algorithm
Collocation
Stemming Algorithm
- Truncating Methods: Removing the suffix or prefix of a word.
  - Lovins Stemmer: Removes longest suffix.
  - Porters Stemmer: Allows programmers to stem words based on their own criteria.
- Statistical Methods: Statistical procedure is involved and typically results in affixes being removed.
  - N-Gram Stemmer: A set of 'n' characters that are consecutive taken from a word
  - Hidden Markov Model (HMM) Stemmer: Moves between states are based on probability functions.
  - Yet Another Suffix Stripper (YASS) Stemmer: Hierarchal approach in creating clusters. Clusters are then considered a set of elements in classes and their centroids are the stems.
- Inflectional & Derivational Methods
  - Krovetz Stemmer: Changes words to word stems that are valid English words.
  - Xerox Stemmer: Removes prefixes.
Term Frequency
- Term Frequency Inverse Document Frequency
Topic Modeling
- Latent Semantic Analysis (LSA)
- Latent Dirichlet Allocation (LDA)
- Non-Negative Matrix Factorization (NMF)
- Bidirectional Encoder Representations from Transformers (BERT)
Wordscores: First estimates scores on word types based on a reference text. Then applies wordscores to a text that is not a reference text to get a document score. Lastly, documents that are not referenced are rescaled to then compare to the reference text.

References

^ "Different Types of Clustering Algorithm". GeeksforGeeks. 2018-01-15. Retrieved 2024-04-04.
^ Jalil, Abdennour Mohamed; Hafidi, Imad; Alami, Lamiae; Khouribga, Ensa (2016). "Comparative Study of Clustering Algorithms in Text Mining Context". International Journal of Interactive Multimedia and Artificial Intelligence. 3 (7): 42. doi:10.9781/ijimai.2016.376. ISSN 1989-1660.
^ "Agglomerative Methods in Machine Learning". GeeksforGeeks. 2021-02-01. Retrieved 2024-04-04.
Hahsler, Michael; et al. "dbscan: Fast Density-based Clustering with R" (PDF). cran.r-project.org. Retrieved 4 March 2024.
Ganesh Jivani, Anjali. "A Comparative Study of Stemming Algorithms" (PDF).
Lowe, Will (2008). "Understanding Wordscores" (PDF). Methods and Data Institute, School of Politics and International Relations, University of Nottingham, Nottingham. doi:10.2139/ssrn.1095280. ISSN 1556-5068.

Category:

Text mining