Topic Modeling Evaluation
A toolkit to quickly evaluate model goodness over number of topics
Metrics
Coherence measure to be used.
-
Fastest method - 'u_mass', 'c_uci' also known as c_pmi
.
-
For 'u_mass' corpus should be provided, if texts is provided, it will be converted to corpus using the dictionary.
-
For 'c_v', 'c_uci' and 'c_npmi' texts
should be provided (corpus
isn't needed)
Examples
Example 1: estimate metrics for one topic model with specific number of topics
from tm_eval import *
input_file = "datasets/covid19_symptoms.pickle"
output_folder = "outputs"
model_name = "symptom"
num_topics = 10
results = evaluate_all_metrics_from_lda_model(input_file=input_file,
output_folder=output_folder,
model_name=model_name,
num_topics=num_topics)
print(results)
Example 2: find model goodness change over number of topics
from tm_eval import *
if __name__=="__main__":
input_file = "datasets/covid19_symptoms.pickle"
output_folder = "outputs"
model_name = "symptom"
start=2
end=5
list_results = explore_topic_model_metrics(input_file=input_file,
output_folder=output_folder,
model_name=model_name,
start=start,
end=end)
show_topic_model_metric_change(list_results,save=True,
save_path=f"{output_folder}/metrics.csv")
plot_tm_metric_change(csv_path=f"{output_folder}/metrics.csv",
save=True,save_folder=output_folder)
Output results
License
The tm-eval
toolkit is provided by Donghua Chen with MIT License.
References
- Topic Modeling in Python: Latent Dirichlet Allocation (LDA)
- Evaluate Topic Models: Latent Dirichlet Allocation (LDA)