I have been pondering for a while, my news aggregator has been running for almost 2 years now, flawlessly connecting to various RSS feeds in sites of interest i specify, and recording these articles in my free tier mongo database.
It's a useful setup, I can easily see the latest trends in areas that interest me, my rss aggregation is collecting articles in Gaming, video production, films, Photography, general tech news and travel.
This has its downsides though, my feed reader collects a lot of information, and sometimes it's time consuming to scroll through everything and spot the keywords, and this is where I discovered algorithmic processing.
Latent Dirichlet Allocation (LDA) is a fascinating topic! Essentially, LDA is a generative statistical model often used in natural language processing to discover the underlying topics in a collection of documents.
Here’s a brief rundown of how LDA works:
1. **Documents and Words:** It starts with a set of documents, each composed of words.
2. **Topics:** LDA assumes there are a number of topics shared across the documents. Each topic is a distribution over words.
3. **Document-Topic Distribution:** Each document is considered a mixture of these topics, with different probabilities.
4. **Word-Topic Assignment:** LDA assigns each word in each document to one of the topics.
The goal of LDA is to find the hidden structure in the dataset, by discovering both the set of topics and the topic distribution in each document.
In simple terms, LDA helps in identifying what topics exist in a set of documents and how these topics are distributed among the documents. It's like uncovering the themes running through a collection of books without having to read each one in full.
There are both Azure services and C# libraries that can help you with Latent Dirichlet Allocation (LDA):
1. Azure Machine Learning: Azure Machine Learning provides a built-in component for LDA.
2. Azure Functions: You can use Azure Functions with Python to perform LDA
1. Infer.NET: This is a machine learning framework from Microsoft Research that includes an implementation of LDA. https://dotnet.github.io/infer/userguide/Infer.NET%20tutorials%20and%20examples.html
2. Microsoft ML.NET: ML.NET has a `LatentDirichletAllocationEstimator` which uses LightLDA to transform text into a vector indicating the similarity of the text with each topic identified
3. GitHub: There are also community-driven implementations of LDA in C#, such as the one available on GitHub by hyunjong-lee. https://github.com/hyunjong-lee/LatentDirichletAllocation
1. **Install Infer.NET**:
```bash
Install-Package Infer.NET -Version [version]
```
2. **Model Configuration**:
- Define the number of topics, alpha, beta, and other preferences in your model.
3. **Load Data**:
- Load and preprocess your text data.
4. **Run LDA**:
```csharp
var lda = new LatentDirichletAllocationModel(numTopics, alpha, beta);
var topics = lda.Run(textData);
```
5. **Evaluate and Adjust**:
- Evaluate the output and adjust your parameters if needed.
https://code-maze.com/csharp-machine-learning-probabilistic-programming-with-infer-net/
https://dotnet.github.io/infer/userguide/Frequently%20Asked%20Questions.html
https://dotnet.github.io/infer/userguide/Latent%20Dirichlet%20Allocation.html
I am going to look into the Infer .Net program, and try to train a model on my news data, and provide it my preferences on what I'd like to see, this should then be working concept on this concept.
I will update this blog post in the future.