Semantic Word Cloud for Thai and English
swordcloud
: A semantic word cloud generator that uses t-SNE and k-means clustering to visualize words in high-dimensional semantic space. Based on A. Mueller's wordcloud
module, swordcloud
can generate semantic word clouds from Thai and English texts based on any word vector models.
Content
- Installation
- Usage
2.1 Initialize SemanticWordCloud
instance
2.2 Generate from Raw Text
2.3 Generate from Word Frequencies
2.4 Generate k-means Cluster Clouds
2.5 Recolor Words
2.6 Export Word Clouds - Color "Functions"
Installation
swordcloud
can be installed using pip
:
pip install swordcloud
Optionally, if you want to be able to embed fonts directly into the generated SVGs, an embedfont
extra can also be specified:
pip install swordcloud[embedfont]
As of version 0.0.10, the exact list of dependencies is as follow:
python >= 3.8
numpy >= 1.21.0
pillow
matplotlib >= 1.5.3
gensim >= 4.0.0
pandas
pythainlp >= 3.1.0
k-means-constrained
scikit-learn
- (optional)
fonttools
Usage
All code below can also be found in the example folder.
Initialize SemanticWordCloud
instance
For most use cases, the SemanticWordCloud
class is the main API the users will be interacting with.
from swordcloud import SemanticWordCloud
from swordcloud.color_func import SingleColorFunc
wordcloud = SemanticWordCloud(
language = 'TH',
width = 1600,
height = 800,
max_font_size = 150,
prefer_horizontal = 1,
color_func = SingleColorFunc('black')
)
Please refer to the documentation in src/swordcloud/wordcloud.py or in your IDE for more detail about various options available for customizing the word cloud.
Generate from Raw Text
raw_text = list(map(str.strip, open('raw_text.txt', encoding='utf-8')))
wordcloud.generate_from_text(raw_text, random_state=42)
Generate from Word Frequencies
freq = {}
for line in open("word_frequencies.tsv", encoding="utf-8"):
word, count = line.strip().split('\t')
freq[word] = int(count)
wordcloud.generate_from_frequencies(freq, random_state=42)
Generate k-means Cluster Clouds
from swordcloud.color_func import FrequencyColorFunc
wordcloud = SemanticWordCloud(
language = 'TH',
width = 2400,
height = 1200,
max_font_size = 150,
prefer_horizontal = 1
)
wordcloud.generate_from_text(raw_text, kmeans=6, random_state=42, plot_now=False)
wordcloud.generate_kmeans_cloud(freq, n_clusters=6, random_state=42, plot_now=False)
for cloud, color in zip(wordcloud.sub_clouds, ["red", "blue", "brown", "green", "black", "orange"]):
cloud.recolor(FrequencyColorFunc(color), plot_now=False)
cloud.show()
Recolor Words
from swordcloud.color_func import RandomColorFunc
wordcloud.recolor(RandomColorFunc, random_state=42)
Export Word Clouds
img = wordcloud.to_image()
wordcloud.to_file('wordcloud.png')
svg = wordcloud.to_svg()
svg = wordcloud.to_svg(embed_font=True)
array = wordcloud.to_array()
Color "Functions"
A number of built-in color "functions" can be accessed from swordcloud.color_func
:
from swordcloud.color_func import <your_color_function_here>
The list of available functions is as follow:
RandomColorFunc
(Default)
Return a random color.ColorMapFunc
Return a random color from the user-specified matplotlib
's colormap.ImageColorFunc
Use a user-provided colored image array to determine word color at each position on the canvas.SingleColorFunc
Always return the user-specified color every single time, resulting in every word having the same color.ExactColorFunc
Use a user-provided color dictionary to determine exactly which word should have which color.FrequencyColorFunc
Assign colors based on word frequencies, with less frequent words having lighter colors. The base color is specified by the user.
All the above functions, except RandomColorFunc
which cannot be customized further, must be initialized before passing them to the SemanticWordCloud
class. For example:
from swordcloud.color_func import ColorMapFunc
color_func = ColorMapFunc("magma")
wordcloud = SemanticWordCloud(
...
color_func = color_func
...
)
Users can also implement their own color functions, provided that they are callable with the following signature:
Input:
word: str
The word we are coloringfrequency: float
Frequency of the word in a scale from 0 to 1font_size: int
Font size of the wordposition: tuple[int, int]
Coordinate of the top-left point of the word's bounding box on the canvasorientation: PIL.Image.Transpose | None
pillow
's orientation.font_path: str
Path to the font file (OTF or TFF)random_state: random.Random
Python's random.Random
object
Return:
Any object that can be interpreted as a color by pillow
. See pillow
's documentation for more detail.
Internally, arguments to color functions are always passed as keyword arguments so they can be in any order. However, if your functions only use some of them, make sure to include **kwargs
at the end of your function headers so that other arguments do not cause an error.