🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →

Book a Demo Install Sign in

connlp

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

connlp

A bunch of python codes to analyze text data in the construction industry. Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP)

0.0.18

PyPI

Maintainers: 1

connlp

A bunch of python codes to analyze text data in the construction industry.
Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP).

Project Information

Supported by C!LAB (@Seoul Nat'l Univ.)

Contributors

Seonghyeon Boris Moon (blank54@snu.ac.kr, https://github.com/blank54/)
Sehwan Chung (hwani751@snu.ac.kr)
Jungyeon Kim (janykjy@snu.ac.kr)

Initialize

Setup

Install connlp with pip.

pip install connlp

Install requirements.txt.

cd WORKSPACE
wget -O requirements_connlp.txt https://raw.githubusercontent.com/blank54/connlp/master/requirements.txt
pip install -r requirements_connlp.txt

Test

If the code below runs with no error, connlp is installed successfully.

from connlp.test import hello
hello()

# 'Helloworld'

Preprocess

Preprocessing module supports English and Korean.
NOTE: No plan for other languages (by 2021.04.02.).

Normalizer

Normalizer normalizes the input text by eliminating trash characters and remaining numbers, alphabets, and punctuation marks.

from connlp.preprocess import Normalizer
normalizer = Normalizer()

normalizer.normalize(text='I am a boy!')

# 'i am a boy'

EnglishTokenizer

EnglishTokenizer tokenizes the input text in English based on word spacing.
The ngram-based tokenization is in preparation.

from connlp.preprocess import EnglishTokenizer
tokenizer = EnglishTokenizer()

tokenizer.tokenize(text='I am a boy!')

# ['I', 'am', 'a', 'boy!']

KoreanTokenizer

KoreanTokenizer tokenizes the input text in Korean, and is based on either pre-trained or unsupervised approaches.

You are recommended to use pre-trained method unless you have a large size of corpus. This is the default setting.

If you want to use a pre-trained tokenizer, you have to select which analyzer you want to use. Available analyzers are based on KoNLPy (https://konlpy.org/ko/latest/api/konlpy.tag/), a python package for Korean language processing. The default analyzer is Hannanum

from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=True, analyzer='Hannanum')

If your corpus is big, you may use an unsupervised method, which is based on soynlp (https://github.com/lovit/soynlp), an unsupervised text analyzer in Korean.

from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=False)

train

If your KoreanTokenizer are pre-trained, you can neglect this step.

Otherwhise (i.e. you are using an unsupervised approach), the KoreanTokenizer object first needs to be trained on (unlabeled) corpus. 'Word score' is calculated for every subword in the corpus.

from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=False)

docs = ['코퍼스의 첫 번째 문서입니다.', '두 번째 문서입니다.', '마지막 문서']

tokenizer.train(text=docs)
print(tokenizer.word_score)

# {'서': 0.0, '코': 0.0, '째': 0.0, '.': 0.0, '의': 0.0, '마': 0.0, '막': 0.0, '번': 0.0, '문': 0.0, '코퍼': 1.0, '번째': 1.0, '마지': 1.0, '문서': 1.0, '코퍼스': 1.0, '문서입': 0.816496580927726, '마지막': 1.0, '코퍼스의': 1.0, '문서입니': 0.8735804647362989, '문서입니다': 0.9036020036098448, '문서입니다.': 0.9221079114817278}

tokenize

If you are using a pre-trained KoreanTokenizer, the selected KoNLPy analyzer will tokenize the input sentence based on morphological analysis.

from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=True, analyzer='Hannanum')
doc = docs[0] # '코퍼스의 첫 번째 문서입니다.'
tokenizer.tokenize(doc)

# ['코퍼스', '의', '첫', '번째', '문서', '입니다', '.']

If you are using an unsupervised KoreanTokenizer, tokenization is based on the 'word score' calculated from KoreanTokenizer.train method.

For each blank-separated token, a subword that has the maximum 'word score' is selectd as an individual 'word' and separated with the remaining part.

from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=False)
doc = docs[0] # '코퍼스의 첫 번째 문서입니다.'
tokenizer.tokenize(doc)

# ['코퍼스의', '첫', '번째', '문서', '입니다.']

StopwordRemover

StopwordRemover removes stopwords from a given sentence based on the user-customized stopword list.
Before utilizing StopwordRemover the user should normalize and tokenize the docs.

from connlp.preprocess import Normalizer, EnglishTokenizer, StopwordRemover
normalizer = Normalizer()
eng_tokenizer = EnglishTokenizer()
stopword_remover = StopwordRemover()

docs = ['I am a boy!', 'He is a boy..', 'She is a girl?']
tokenized_docs = []

for doc in eng_docs:
    normalized_doc = normalizer.normalize(text=doc)
    tokenized_doc = eng_tokenizer.tokenize(text=normalized_doc)
    tokenized_docs.append(tokenized_doc)

print(docs)
print(tokenized_docs)

# ['I am a boy!', 'He is a boy..', 'She is a girl?']
# [['i', 'am', 'a', 'boy'], ['he', 'is', 'a', 'boy'], ['she', 'is', 'a', 'girl']]

The user should prepare a customized stopword list (i.e., stoplist).
The stoplist should include user-customized stopwords divided by '\n' and the file should be in ".txt" format.

a
is
am

Initiate the StopwordRemover with appropriate filepath of user-customized stopword list.
If the stoplist is absent at the filepath, the stoplist would be ramain as a blank list.

fpath_stoplist = 'test/thesaurus/stoplist.txt'
stopword_remover.initiate(fpath_stoplist=fpath_stoplist)

print(stopword_remover)

# <connlp.preprocess.StopwordRemover object at 0x7f163e70c050>

The user can count the word frequencies and figure out additional stopwords based on the results.

stopword_remover.count_freq_words(docs=tokenized_docs)

# ========================================
# Word counts
#   | [1] a: 3
#   | [2] boy: 2
#   | [3] is: 2
#   | [4] i: 1
#   | [5] am: 1
#   | [6] he: 1
#   | [7] she: 1
#   | [8] girl: 1

After finally updating the stoplist, use remove method to remove the stopwords from text.

stopword_removed_docs = []
    for doc in tokenized_docs:
        stopword_removed_docs.append(stopword_remover.remove(sent=doc))

print(stopword_removed_docs)

# [['i', 'boy'], ['he', 'boy'], ['she', 'girl']]

The user can check which stopword was removed with check_removed_words methods.

stopword_remover.check_removed_words(docs=tokenized_docs, stopword_removed_docs=stopword_removed_docs)

# ========================================
# Check stopwords removed
#   | [1] BEFORE: a(3) ->
#   | [2] BEFORE: boy -> AFTER: boy(2)
#   | [3] BEFORE: is(2) ->
#   | [4] BEFORE: i -> AFTER: i(1)
#   | [5] BEFORE: am(1) ->
#   | [6] BEFORE: he -> AFTER: he(1)
#   | [7] BEFORE: she -> AFTER: she(1)
#   | [8] BEFORE: girl -> AFTER: girl(1)

Embedding

Vectorizer

Vectorizer includes several text embedding methods that have been commonly used for decades.

tfidf

TF-IDF is the most commonly used technique for word embedding.
The TF-IDF model counts the term frequency(TF) and inverse document frequency(IDF) from the given documents.
The results included the followings.

TF-IDF Vectorizer (a class of sklearn.feature_extraction.text.TfidfVectorizer')
TF-IDF Matrix
TF-IDF Vocabulary

from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

docs = ['I am a boy', 'He is a boy', 'She is a girl']
tfidf_vectorizer, tfidf_matrix, tfidf_vocab = vectorizer.tfidf(docs=docs)
type(tfidf_vectorizer)

# <class 'sklearn.feature_extraction.text.TfidfVectorizer'>

The user can get a document vector by indexing the tfidf_matrix.

tfidf_matrix[0]

# (0, 2)    0.444514311537431
# (0, 0)    0.34520501686496574
# (0, 1)    0.5844829010200651
# (0, 5)    0.5844829010200651

The tfidf_vocab returns an index for every token.

print(tfidf_vocab)

# {'i': 5, 'am': 1, 'a': 0, 'boy': 2, 'he': 4, 'is': 6, 'she': 7, 'girl': 3}

word2vec

Word2Vec is a distributed representation language model for word embedding.
The Word2vec model trains tokenized docs and returns word vectors.
The result is a class of 'gensim.models.word2vec.Word2Vec'.

from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

docs = ['I am a boy', 'He is a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
w2v_model = vectorizer.word2vec(docs=tokenized_docs)
type(w2v_model)

# <class 'gensim.models.word2vec.Word2Vec'>

The user can get a word vector by .wv method.

w2v_model.wv['boy']

# [-2.0130998e-03 -3.5652996e-03  2.7793974e-03 ...]

The Word2Vec model provides the topn-most similar word vectors.

w2v_model.wv.most_similar('boy', topn=3)

# [('He', 0.05311150848865509), ('a', 0.04154288396239281), ('She', -0.029122961685061455)]

word2vec (update)

The user can update the Word2Vec model with new data.

new_docs = ['Tom is a man', 'Sally is not a boy']
tokenized_new_docs = [tokenizer.tokenize(text=doc) for doc in new_docs]
w2v_model_updated = vectorizer.word2vec_update(w2v_model=w2v_model, new_docs=tokenized_new_docs)

w2v_model_updated.wv['man']

# [4.9649975e-03  3.8002312e-04 -1.5773597e-03 ...]

doc2vec

Doc2Vec is a distributed representation language model for longer text (e.g., sentence, paragraph, document) embedding.
The Doc2vec model trains tokenized docs with tags and returns document vectors.
The result is a class of 'gensim.models.doc2vec.Doc2Vec'.

from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

docs = ['I am a boy', 'He is a boy', 'She is a girl']
tagged_docs = [(idx, tokenizer.tokenize(text=doc)) for idx, doc in enumerate(docs)]
d2v_model = vectorizer.doc2vec(tagged_docs=tagged_docs)
type(d2v_model)

# <class 'gensim.models.doc2vec.Doc2Vec'>

The Doc2Vec model can infer a new document.

test_doc = ['My', 'name', 'is', 'Peter']
d2v_model.infer_vector(doc_words=test_doc)

# [4.8494316e-03 -4.3647490e-03  1.1437446e-03 ...]

Analysis

TopicModel

TopicModel is a class for topic modeling based on gensim LDA model.
It provides a simple way to train lda model and assign topics to docs.

Before using LDA topic modeling, the user should install the following packages.

pip install pyldavis==2.1.2

TopicModel requires two instances.

a dict of docs whose keys are the tag
the number of topics for modeling

from connlp.analysis_lda import TopicModel

num_topics = 2
docs = {'doc1': ['I', 'am', 'a', 'boy'],
        'doc2': ['He', 'is', 'a', 'boy'],
        'doc3': ['Cat', 'on', 'the', 'table'],
        'doc4': ['Mike', 'is', 'a', 'boy'],
        'doc5': ['Dog', 'on', 'the', 'table'],
        }

lda_model = TopicModel(docs=docs, num_topics=num_topics)

learn

The user can train the model with learn method. Unless parameters being provided by the user, the model trains based on default parameters.

After learn, TopicModel provides model instance that is a class of <'gensim.models.ldamodel.LdaModel'>

parameters = {
    'iterations': 100,
    'alpha': 0.7,
    'eta': 0.05,
}
lda_model.learn(parameters=parameters)
type(lda_model.model)

# <class 'gensim.models.ldamodel.LdaModel'>

coherence

TopicModel provides coherence value for model evaluation.
The coherence value is automatically calculated right after model training.

print(lda_model.coherence)

# 0.3607990279229385

assign

The user can easily assign the most proper topic to each doc using assign method.
After assign, the TopicModel provides tag2topic and topic2tag instances for convenience.

lda_model.assign()

print(lda_model.tag2topic)
print(lda_model.topic2tag)

# defaultdict(<class 'int'>, {'doc1': 1, 'doc2': 1, 'doc3': 0, 'doc4': 1, 'doc5': 0})
# defaultdict(<class 'list'>, {1: ['doc1', 'doc2', 'doc4'], 0: ['doc3', 'doc5']})

NamedEntityRecognition

Before using NER modules, the user should install proper versions of TensorFlow and Keras.

pip install config==0.4.2 gensim==3.8.1 gpustat==0.6.0 GPUtil==1.4.0 h5py==2.10.0 JPype1==0.7.1 Keras==2.2.4 konlpy==0.5.2 nltk==3.4.5 numpy==1.18.1 pandas==1.0.1 scikit-learn==0.22.1 scipy==1.4.1 silence-tensorflow==1.1.1 soynlp==0.0.493 tensorflow==1.14.0 tensorflow-gpu==1.14.0

The modules might require the module of keras-contrib.
The user can install the module by following the below.

git clone https://www.github.com/keras-team/keras-contrib.git 
cd keras-contrib 
python setup.py install

Labels

NER_Model is a class to conduct named entity recognition using Bi-directional Long-Short Term Memory (Bi-LSTM) and Conditional Random Field (CRF).

At the beginning, appropriate labels are required.
The labels should be numbered with start of 0.

from connlp.analysis_ner import NER_Labels

label_dict = {'NON': 0,     #None
              'PER': 1,     #PERSON
              'FOD': 2,}    #FOOD

ner_labels = NER_Labels(label_dict=label_dict)

Corpus

Next, the user should prepare data including sentences and labels, of which each data being matched by the same tag.
The tokenized sentences and labels are then combined via NER_LabeledSentence.
With the data, labels, and a proper size of max_sent_len (i.e., the maximum length of sentence for analysis), NER_Corpus would be developed.
Once the corpus was developed, every data of sentences and labels would be padded with the length of max_sent_len.

from connlp.preprocess import EnglishTokenizer
from connlp.analysis_ner import NER_LabeledSentence, NER_Corpus
tokenizer = EnglishTokenizer()

data_sents = {'sent1': 'Sam likes pizza',
              'sent2': 'Erik eats pizza',
              'sent3': 'Erik and Sam are drinking soda',
              'sent4': 'Flora cooks chicken',
              'sent5': 'Sam ordered a chicken',
              'sent6': 'Flora likes chicken sandwitch',
              'sent7': 'Erik likes to drink soda'}
data_labels = {'sent1': [1, 0, 2],
               'sent2': [1, 0, 2],
               'sent3': [1, 0, 1, 0, 0, 2],
               'sent4': [1, 0, 2],
               'sent5': [1, 0, 0, 2],
               'sent6': [1, 0, 2, 2],
               'sent7': [1, 0, 0, 0, 2]}

docs = []
for tag, sent in data_sents.items():
    words = [str(w) for w in tokenizer.tokenize(text=sent)]
    labels = data_labels[tag]
    docs.append(NER_LabeledSentence(tag=tag, words=words, labels=labels))

max_sent_len = 10
ner_corpus = NER_Corpus(docs=docs, ner_labels=ner_labels, max_sent_len=max_sent_len)
type(ner_corpus)

# <class 'connlp.analysis_ner.NER_Corpus'>

Word Embedding

Every word in the NER_Corpus should be embedded into numeric vector space.
The user can conduct embedding with Word2Vec which is provided in Vectorizer of connlp.
Note that the embedding process of NER_Corpus only requires the dictionary of word vectors and the feature size.

from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

tokenized_sents = [tokenizer.tokenize(sent) for sent in data_sents.values()]
w2v_model = vectorizer.word2vec(docs=tokenized_sents)

word2vector = vectorizer.get_word_vectors(w2v_model)
feature_size = w2v_model.vector_size
ner_corpus.word_embedding(word2vector=word2vector, feature_size=feature_size)
print(ner_corpus.X_embedded)

# [[[-2.40120804e-03  1.74632657e-03  ...]
#   [-3.57543468e-03  2.86567654e-03  ...]
#   ...
#   [ 0.00000000e+00  0.00000000e+00  ...]] ...]

Model Initialization

The parameters for Bi-LSTM and model training should be provided, however, they can be composed of a single dictionary.
The user should initialize the NER_Model with NER_Corpus and the parameters.

from connlp.analysis_ner import NER_Model

parameters = {
    # Parameters for Bi-LSTM.
    'lstm_units': 512,
    'lstm_return_sequences': True,
    'lstm_recurrent_dropout': 0.2,
    'dense_units': 100,
    'dense_activation': 'relu',

    # Parameters for model training.
    'test_size': 0.3,
    'batch_size': 1,
    'epochs': 100,
    'validation_split': 0.1,
}

ner_model = NER_Model()
ner_model.initialize(ner_corpus=ner_corpus, parameters=parameters)
type(ner_model)

# <class 'connlp.analysis_ner.NER_Model'>

Model Training

The user can train the NER_Model with customized parameters.
The model automatically gets the dataset from the NER_Corpus.

ner_model.train(parameters=parameters)

# Train on 3 samples, validate on 1 samples
# Epoch 1/100
# 3/3 [==============================] - 3s 1s/step - loss: 1.4545 - crf_viterbi_accuracy: 0.3000 - val_loss: 1.0767 - val_crf_viterbi_accuracy: 0.8000
# Epoch 2/100
# 3/3 [==============================] - 0s 74ms/step - loss: 0.8602 - crf_viterbi_accuracy: 0.7000 - val_loss: 0.5287 - val_crf_viterbi_accuracy: 0.8000
# ...

Model Evaluation

The model performance can be shown in the aspects of confusion matrix and F1 score.

ner_model.evaluate()

# |--------------------------------------------------
# |Confusion Matrix:
# [[ 3  0  3  6]
#  [ 1  3  0  4]
#  [ 0  0  2  2]
#  [ 4  3  5 12]]
# |--------------------------------------------------
# |F1 Score: 0.757
# |--------------------------------------------------
# |    [NON]: 0.600
# |    [PER]: 0.857
# |    [FOD]: 0.571

Save

The user can save the NER_Model.
The model would save the model itself ("<FileName>.pk") and the dataset ("<FileName>-dataset.pk") that was used in model development.
Note that the directory should exist before saving the model.

from connlp.util import makedir

fpath_model = 'test/ner/model.pk'
makedir(fpath=fpath_model)
ner_model.save(fpath_model=fpath_model)

Load

If the user wants to load the already trained model, just call the model and load.

fpath_model = 'test/ner/model.pk'
ner_model = NER_Model()
ner_model.load(fpath_model=fpath_model, ner_corpus=ner_corpus, parameters=parameters)

Prediction

NER_Model can conduct a new NER task on the given sentence.
The result is a class of NER_Result.

from connlp.preprocess import EnglishTokenizer
vectorizer = Vectorizer()

new_sent = 'Tom eats apple'
tokenized_sent = tokenizer.tokenize(new_sent)
ner_result = ner_model.predict(sent=tokenized_sent)
print(ner_result)

# Tom/PER eats/NON apple/FOD

Web Crawling

The connlp currently provides web crawling for Naver news articles.

Query

The user should prepare the proper queries first.
A single text file(.txt) should include every information of the query as below.

Date Start
Date End
Keywords

The web crawler utilizes the keywords separated with '\n\n' in the same time.
Meanwhile, the web crawler utilizes the keywords separated with '\n' as a different queries.

For example, if the queries are determined as below, the web crawler would search the articles with six queries: "smart+construction+safety at 20210718", "smart+construction+management at 20210718", "smart+construction+safety at 20210719", ...

20210718
20210720

smart

construction

safety
management

The NewsQueryParser parses the queries into appropriate formats.

from connlp.web_crawling import NewsQueryParser
query_parser = NewsQueryParser()

fpath_query = 'FILEPATH_OF_YOUR_QUERY'
query_list, date_list = query_parser.parse(fpath_query=fpath_query)

URLs

For the second step, the web crawler parses the web page that shows the list of news articles.
NaverNewsListScraper provides the function of parsing the list page.
The user is recommended to save the url lists and load them later.

from connlp.web_crawling import NaverNewsListScraper
list_scraper = NaverNewsListScraper()

for date in sorted(date_list, reverse=False):
    for query in query_list:
        url_list = list_scraper.get_url_list(query=query, date=date)

Articles

The last step is to parse the article page and get information from the article.
NaverNewsArticleParser returns a class of Article for a given article.
Remember to extend the query list of the article.

from connlp.web_crawling import NaverNewsArticleParser
article_parser = NaverNewsArticleParser()

query_list, _ = query_parser.urlname2query(fname_url_list=fname_url_list)
for url in url_list:
    article = article_parser.parse(url=url)
    article.extend_query(query_list)

Status

NewsStatus provides the status of the crawled corpus for given directories.

from connlp.web_crawling import NewsStatus
news_status = NewsStatus()

fdir_queries = 'DIRECTORY_FOR_QUERIES'
fdir_url_list = 'DIRECTORY_FOR_URLS'
fdir_article = 'DIRECTORY_FOR_ARTICLES'

news_status.queries(fdir_queries=fdir_queries)
news_status.urls(fdir_urls=fdir_url_list)
news_status.articles(fdir_articles=fdir_article)

Visualization

Visualizer

Visualizer includes several simple tools for text visualization.

Install the following packages.

pip install networkx wordcloud

network

network method provides a word network for tokenized docs.

from connlp.preprocess import EnglishTokenizer
from connlp.visualize import Visualizer
tokenizer = EnglishTokenizer()
visualizer = Visualizer()

docs = ['I am a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
word_network = visualizer.network(docs=tokenized_docs, show=True)

The word network is a matplotlib.pyplot object.
The user can save the figure by .savefig() method.

word_network.savefig(FILEPATH)

wordcloud

wordcloud method provides a word cloud for tokenized docs.

from connlp.preprocess import EnglishTokenizer
from connlp.visualize import Visualizer
tokenizer = EnglishTokenizer()
visualizer = Visualizer()

docs = ['I am a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
wordcloud = visualizer.wordcloud(docs=tokenized_docs, show=True)

The wordcloud is a matplotlib.pyplot object.
The user can save the figure by .savefig() method.

wordcloud.savefig(FILEPATH)

Extracting Text

TextConverter

TextConverter includes several methods that extract raw text from various types of files (e.g. PDF, HWP) and/or converts the files into plain text files (e.g. TXT).

hwp2txt

hwp2txt method converts a HWP file into a plain text file. Dependencies: pyhwp package

Install pyhwp (you need to install the pre-release version)

pip install --pre pyhwp

Example

from connlp.text_extract import TextConverter
converter = TextConverter()

hwp_fpath = '/data/raw/hwp_file.hwp'
output_fpath = '/data/processed/extracted_text.txt'

converter.hwp2txt(hwp_fpath, output_fpath) # returns 0 if no error occurs

GPU Utils

GPUMonitor

GPUMonitor generates a class to monitor and display the GPU status based on nvidia-smi.
Refer to "https://github.com/anderskm/gputil" and "https://data-newbie.tistory.com/561" for usages.

Install GPUtils module with pip.

pip install GPUtil

Write your code between the initiation of the GPUMonitor and monitor.stop().

from connlp.util import GPUMonitor

monitor = GPUMonitor(delay=3)
# >>>Write your code here<<<
monitor.stop()

# | ID | GPU | MEM |
# ------------------
# |  0 |  0% |  0% |
# |  1 |  1% |  0% |
# |  2 |  0% | 94% |

FAQs

What is connlp?

Is connlp well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

connlp

connlp

Project Information

Contributors

Initialize

Setup

Test

Preprocess

Normalizer

EnglishTokenizer

KoreanTokenizer

train

tokenize

StopwordRemover

Embedding

Vectorizer

tfidf

word2vec

word2vec (update)

doc2vec

Analysis

TopicModel

learn

coherence

assign

NamedEntityRecognition

Labels

Corpus

Word Embedding

Model Initialization

Model Training

Model Evaluation

Save

Load

Prediction

Web Crawling

Query

URLs

Articles

Status

Visualization

Visualizer

network

wordcloud

Extracting Text

TextConverter

hwp2txt

GPU Utils

GPUMonitor

Related posts

Open Source CAI Framework Handles Pen Testing Tasks up to 3,600× Faster Than Humans

Deno 2.4 Brings Back deno bundle, Improves Dependency Management and Observability