๐Ÿš€ Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more โ†’
Socket
DemoInstallSign in
Socket

soynlp

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

soynlp

Unsupervised Korean Natural Language Processing Toolkits

0.0.493
PyPI
Maintainers
1

soynlp

ํ•œ๊ตญ์–ด ๋ถ„์„์„ ์œ„ํ•œ pure python code ์ž…๋‹ˆ๋‹ค. ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์ง€ ์•Š์œผ๋ฉด์„œ ๋ฐ์ดํ„ฐ์— ์กด์žฌํ•˜๋Š” ๋‹จ์–ด๋ฅผ ์ฐพ๊ฑฐ๋‚˜, ๋ฌธ์žฅ์„ ๋‹จ์–ด์—ด๋กœ ๋ถ„ํ•ด, ํ˜น์€ ํ’ˆ์‚ฌ ํŒ๋ณ„์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋น„์ง€๋„ํ•™์Šต ์ ‘๊ทผ๋ฒ•์„ ์ง€ํ–ฅํ•ฉ๋‹ˆ๋‹ค.

Guide

Usage guide

soynlp ์—์„œ ์ œ๊ณตํ•˜๋Š” WordExtractor ๋‚˜ NounExtractor ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฌธ์„œ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•œ ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋น„์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•๋“ค์€ ํ†ต๊ณ„์  ํŒจํ„ด์„ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•˜๋‚˜์˜ ๋ฌธ์žฅ ํ˜น์€ ๋ฌธ์„œ์—์„œ ๋ณด๋‹ค๋Š” ์–ด๋А ์ •๋„ ๊ทœ๋ชจ๊ฐ€ ์žˆ๋Š” ๋™์ผํ•œ ์ง‘๋‹จ์˜ ๋ฌธ์„œ (homogeneous documents) ์—์„œ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์˜ํ™” ๋Œ“๊ธ€๋“ค์ด๋‚˜ ํ•˜๋ฃจ์˜ ๋‰ด์Šค ๊ธฐ์‚ฌ์ฒ˜๋Ÿผ ๊ฐ™์€ ๋‹จ์–ด๋ฅผ ์ด์šฉํ•˜๋Š” ์ง‘ํ•ฉ์˜ ๋ฌธ์„œ๋งŒ ๋ชจ์•„์„œ Extractors ๋ฅผ ํ•™์Šตํ•˜์‹œ๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค. ์ด์งˆ์ ์ธ ์ง‘๋‹จ์˜ ๋ฌธ์„œ๋“ค์€ ํ•˜๋‚˜๋กœ ๋ชจ์•„ ํ•™์Šตํ•˜๋ฉด ๋‹จ์–ด๊ฐ€ ์ž˜ ์ถ”์ถœ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

Parameter naming

soynlp=0.0.46 ๊นŒ์ง€๋Š” min_score, minimum_score, l_len_min ์ฒ˜๋Ÿผ ์ตœ์†Œ๊ฐ’์ด๋‚˜ ์ตœ๋Œ€๊ฐ’์„ ์š”๊ตฌํ•˜๋Š” parameters ์˜ ์ด๋ฆ„๋“ค์— ๊ทœ์น™์ด ์—†์—ˆ์Šต๋‹ˆ๋‹ค. ์ง€๊ธˆ๊นŒ์ง€ ์ž‘์—…ํ•˜์‹  ์ฝ”๋“œ๋“ค ์ค‘์—์„œ ์ง์ ‘ parameters ๋ฅผ ์„ค์ •ํ•˜์‹  ๋ถ„๋“ค์—๊ฒŒ ํ˜ผ๋ž€์„ ๋“œ๋ฆด ์ˆ˜ ์žˆ์œผ๋‚˜, ๋” ๋Šฆ๊ธฐ์ „์— ์ดํ›„์— ๋ฐœ์ƒํ•  ๋ถˆํŽธํ•จ์„ ์ค„์ด๊ธฐ ์œ„ํ•˜์—ฌ ๋ณ€์ˆ˜ ๋ช…์„ ์ˆ˜์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

0.0.47 ์ดํ›„ minimum, maximum ์˜ ์˜๋ฏธ๊ฐ€ ๋“ค์–ด๊ฐ€๋Š” ๋ณ€์ˆ˜๋ช…์€ min, max ๋กœ ์ค„์—ฌ ๊ธฐ์ž…ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๋’ค์— ์–ด๋–ค ํ•ญ๋ชฉ์˜ threshold parameter ์ธ์ง€ ์ด๋ฆ„์„ ๊ธฐ์ž…ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŒจํ„ด์œผ๋กœ parameter ์ด๋ฆ„์„ ํ†ต์ผํ•ฉ๋‹ˆ๋‹ค. {min, max}_{noun, word}_{score, threshold} ๋“ฑ์œผ๋กœ ์ด๋ฆ„์„ ํ†ต์ผํ•ฉ๋‹ˆ๋‹ค. ํ•ญ๋ชฉ์ด ์ž๋ช…ํ•œ ๊ฒฝ์šฐ์—๋Š” ์ด๋ฅผ ์ƒ๋žตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

soynlp ์—์„œ๋Š” substring counting ์„ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ๋นˆ๋„์ˆ˜์™€ ๊ด€๋ จ๋œ parameter ๋Š” count ๊ฐ€ ์•„๋‹Œ frequency ๋กœ ํ†ต์ผํ•ฉ๋‹ˆ๋‹ค.

index ์™€ idx ๋Š” idx ๋กœ ํ†ต์ผํ•ฉ๋‹ˆ๋‹ค.

์ˆซ์ž๋ฅผ ์˜๋ฏธํ•˜๋Š” num ๊ณผ n ์€ num ์œผ๋กœ ํ†ต์ผํ•ฉ๋‹ˆ๋‹ค.

Setup

$ pip install soynlp

Python version

  • Python 3.5+ ๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. 3.x ์—์„œ ์ฃผ๋กœ ์ž‘์—…์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— 3.x ๋กœ ์ด์šฉํ•˜์‹œ๊ธธ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.
  • Python 2.x ๋Š” ๋ชจ๋“  ๊ธฐ๋Šฅ์— ๋Œ€ํ•ด์„œ ํ…Œ์ŠคํŠธ๊ฐ€ ๋๋‚˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

Requires

  • numpy >= 1.12.1
  • psutil >= 5.0.1
  • scipy >= 1.1.0
  • scikit-learn >= 0.20.0

Noun Extractor

๋ช…์‚ฌ ์ถ”์ถœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ์‹œ๋„๋ฅผ ํ•œ ๊ฒฐ๊ณผ, v1, news, v2 ์„ธ ๊ฐ€์ง€ ๋ฒ„์ „์ด ๋งŒ๋“ค์–ด์กŒ์Šต๋‹ˆ๋‹ค. ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์€ v2 ์ž…๋‹ˆ๋‹ค.

WordExtractor ๋Š” ํ†ต๊ณ„๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด์˜ ๊ฒฝ๊ณ„ ์ ์ˆ˜๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ผ ๋ฟ, ๊ฐ ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ๋ฅผ ํŒ๋‹จํ•˜์ง€๋Š” ๋ชปํ•ฉ๋‹ˆ๋‹ค. ๋•Œ๋กœ๋Š” ๊ฐ ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ๋ฅผ ์•Œ์•„์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๋‹ค๋ฅธ ํ’ˆ์‚ฌ๋ณด๋‹ค๋„ ๋ช…์‚ฌ์—์„œ ์ƒˆ๋กœ์šด ๋‹จ์–ด๊ฐ€ ๊ฐ€์žฅ ๋งŽ์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค. ๋ช…์‚ฌ์˜ ์˜ค๋ฅธ์ชฝ์—๋Š” -์€, -๋Š”, -๋ผ๋Š”, -ํ•˜๋Š” ์ฒ˜๋Ÿผ ํŠน์ • ๊ธ€์ž๋“ค์ด ์ž์ฃผ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. ๋ฌธ์„œ์˜ ์–ด์ ˆ (๋„์–ด์“ฐ๊ธฐ ๊ธฐ์ค€ ์œ ๋‹›)์—์„œ ์™ผ์ชฝ์— ์œ„์น˜ํ•œ substring ์˜ ์˜ค๋ฅธ์ชฝ์— ์–ด๋–ค ๊ธ€์ž๋“ค์ด ๋“ฑ์žฅํ•˜๋Š”์ง€ ๋ถ„ํฌ๋ฅผ ์‚ดํŽด๋ณด๋ฉด ๋ช…์‚ฌ์ธ์ง€ ์•„๋‹Œ์ง€ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. soynlp ์—์„œ๋Š” ๋‘ ๊ฐ€์ง€ ์ข…๋ฅ˜์˜ ๋ช…์‚ฌ ์ถ”์ถœ๊ธฐ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋‘˜ ๋ชจ๋‘ ๊ฐœ๋ฐœ ๋‹จ๊ณ„์ด๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋–ค ๊ฒƒ์ด ๋” ์šฐ์ˆ˜ํ•˜๋‹ค ๋งํ•˜๊ธฐ๋Š” ์–ด๋ ต์Šต๋‹ˆ๋‹ค๋งŒ, NewsNounExtractor ๊ฐ€ ์ข€ ๋” ๋งŽ์€ ๊ธฐ๋Šฅ์„ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ถ”ํ›„, ๋ช…์‚ฌ ์ถ”์ถœ๊ธฐ๋Š” ํ•˜๋‚˜์˜ ํด๋ž˜์Šค๋กœ ์ •๋ฆฌ๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

Noun Extractor ver 1 & News Noun Extractor

from soynlp.noun import LRNounExtractor
noun_extractor = LRNounExtractor()
nouns = noun_extractor.train_extract(sentences) # list of str like

from soynlp.noun import NewsNounExtractor
noun_extractor = NewsNounExtractor()
nouns = noun_extractor.train_extract(sentences) # list of str like

2016-10-20 ์˜ ๋‰ด์Šค๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•œ ๋ช…์‚ฌ์˜ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

๋ด๋งˆํฌ  ์›ƒ๋ˆ  ๋„ˆ๋ฌด๋„ˆ๋ฌด๋„ˆ๋ฌด  ๊ฐ€๋ฝ๋™  ๋งค๋‰ด์–ผ  ์ง€๋„๊ต์ˆ˜
์ „๋ง์น˜  ๊ฐ•๊ตฌ  ์–ธ๋‹ˆ๋“ค  ์‹ ์‚ฐ์—…  ๊ธฐ๋ขฐ์ „  ๋…ธ์Šค
ํ• ๋ฆฌ์šฐ๋“œ  ํ”Œ๋ผ์ž  ๋ถˆ๋ฒ•์กฐ์—…  ์›”์ŠคํŠธ๋ฆฌํŠธ์ €๋„  2022๋…„  ๋ถˆํ—ˆ
๊ณ ์”จ  ์–ดํ”Œ  1987๋…„  ๋ถˆ์”จ  ์ ๊ธฐ  ๋ ˆ์Šค
์Šคํ€˜์–ด  ์ถฉ๋‹น๊ธˆ  ๊ฑด์ถ•๋ฌผ  ๋‰ด์งˆ๋žœ๋“œ  ์‚ฌ๊ฐ  ํ•˜๋‚˜์”ฉ
๊ทผ๋Œ€  ํˆฌ์ž์ฃผ์ฒด๋ณ„  4์œ„  ํƒœ๊ถŒ  ๋„คํŠธ์›์Šค  ๋ชจ๋ฐ”์ผ๊ฒŒ์ž„
์—ฐ๋™  ๋Ÿฐ์นญ  ๋งŒ์„ฑ  ์†์งˆ  ์ œ์ž‘๋ฒ•  ํ˜„์‹คํ™”
์˜คํ•ด์˜  ์‹ฌ์‚ฌ์œ„์›๋“ค  ๋‹จ์   ๋ถ€์žฅ์กฐ๋ฆฌ  ์ฐจ๊ด€๊ธ‰  ๊ฒŒ์‹œ๋ฌผ
์ธํ„ฐํฐ  ์›ํ™”  ๋‹จ๊ธฐ๊ฐ„  ํŽธ๊ณก  ๋ฌด์‚ฐ  ์™ธ๊ตญ์ธ๋“ค
์„ธ๋ฌด์กฐ์‚ฌ  ์„์œ ํ™”ํ•™  ์›Œํ‚น  ์›ํ”ผ์Šค  ์„œ์žฅ  ๊ณต๋ฒ”

๋” ์ž์„ธํ•œ ์„ค๋ช…์€ ํŠœํ† ๋ฆฌ์–ผ์— ์žˆ์Šต๋‹ˆ๋‹ค.

Noun Extractor ver 2

soynlp=0.0.46+ ์—์„œ๋Š” ๋ช…์‚ฌ ์ถ”์ถœ๊ธฐ version 2 ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ ๋ฒ„์ „์˜ ๋ช…์‚ฌ ์ถ”์ถœ์˜ ์ •ํ™•์„ฑ๊ณผ ํ•ฉ์„ฑ๋ช…์‚ฌ ์ธ์‹ ๋Šฅ๋ ฅ, ์ถœ๋ ฅ๋˜๋Š” ์ •๋ณด์˜ ์˜ค๋ฅ˜๋ฅผ ์ˆ˜์ •ํ•œ ๋ฒ„์ „์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ๋ฒ•์€ version 1 ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

from soynlp.utils import DoublespaceLineCorpus
from soynlp.noun import LRNounExtractor_v2

corpus_path = '2016-10-20-news'
sents = DoublespaceLineCorpus(corpus_path, iter_sent=True)

noun_extractor = LRNounExtractor_v2(verbose=True)
nouns = noun_extractor.train_extract(sents)

์ถ”์ถœ๋œ nouns ๋Š” {str:namedtuple} ํ˜•์‹์ž…๋‹ˆ๋‹ค.

print(nouns['๋‰ด์Šค']) # NounScore(frequency=4319, score=1.0)

_compounds_components ์—๋Š” ๋ณตํ•ฉ๋ช…์‚ฌ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋‹จ์ผ๋ช…์‚ฌ๋“ค์˜ ์ •๋ณด๊ฐ€ ์ €์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. '๋Œ€ํ•œ๋ฏผ๊ตญ', '๋…น์ƒ‰์„ฑ์žฅ'๊ณผ ๊ฐ™์ด ์‹ค์ œ๋กœ๋Š” ๋ณตํ•ฉํ˜•ํƒœ์†Œ์ด์ง€๋งŒ, ๋‹จ์ผ ๋ช…์‚ฌ๋กœ ์ด์šฉ๋˜๋Š” ๊ฒฝ์šฐ๋Š” ๋‹จ์ผ ๋ช…์‚ฌ๋กœ ์ธ์‹ํ•ฉ๋‹ˆ๋‹ค.

list(noun_extractor._compounds_components.items())[:5]

# [('์ž ์ˆ˜ํ•จ๋ฐœ์‚ฌํƒ„๋„๋ฏธ์‚ฌ์ผ', ('์ž ์ˆ˜ํ•จ', '๋ฐœ์‚ฌ', 'ํƒ„๋„๋ฏธ์‚ฌ์ผ')),
#  ('๋ฏธ์‚ฌ์ผ๋Œ€์‘๋Šฅ๋ ฅ์œ„์›ํšŒ', ('๋ฏธ์‚ฌ์ผ', '๋Œ€์‘', '๋Šฅ๋ ฅ', '์œ„์›ํšŒ')),
#  ('๊ธ€๋กœ๋ฒŒ๋…น์ƒ‰์„ฑ์žฅ์—ฐ๊ตฌ์†Œ', ('๊ธ€๋กœ๋ฒŒ', '๋…น์ƒ‰์„ฑ์žฅ', '์—ฐ๊ตฌ์†Œ')),
#  ('์‹œ์นด๊ณ ์˜ต์…˜๊ฑฐ๋ž˜์†Œ', ('์‹œ์นด๊ณ ', '์˜ต์…˜', '๊ฑฐ๋ž˜์†Œ')),
#  ('๋Œ€ํ•œ๋ฏผ๊ตญํŠน์ˆ˜์ž„๋ฌด์œ ๊ณต', ('๋Œ€ํ•œ๋ฏผ๊ตญ', 'ํŠน์ˆ˜', '์ž„๋ฌด', '์œ ๊ณต')),

LRGraph ๋Š” ํ•™์Šต๋œ corpus ์— ๋“ฑ์žฅํ•œ ์–ด์ ˆ์˜ L-R ๊ตฌ์กฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. get_r ๊ณผ get_l ์„ ์ด์šฉํ•˜์—ฌ ์ด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

noun_extractor.lrgraph.get_r('์•„์ด์˜ค์•„์ด')

# [('', 123),
#  ('์˜', 47),
#  ('๋Š”', 40),
#  ('์™€', 18),
#  ('๊ฐ€', 18),
#  ('์—', 7),
#  ('์—๊ฒŒ', 6),
#  ('๊นŒ์ง€', 2),
#  ('๋ž‘', 2),
#  ('๋ถ€ํ„ฐ', 1)]

๋” ์ž์„ธํ•œ ์„ค๋ช…์€ ํŠœํ† ๋ฆฌ์–ผ 2์— ์žˆ์Šต๋‹ˆ๋‹ค.

Word Extraction

2016 ๋…„ 10์›”์˜ ์—ฐ์˜ˆ๊ธฐ์‚ฌ ๋‰ด์Šค์—๋Š” 'ํŠธ์™€์ด์Šค', '์•„์ด์˜ค์•„์ด' ์™€ ๊ฐ™์€ ๋‹จ์–ด๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ง๋ญ‰์น˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋œ ํ’ˆ์‚ฌ ํŒ๋ณ„๊ธฐ / ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋Š” ์ด๋Ÿฐ ๋‹จ์–ด๋ฅผ ๋ณธ ์ ์ด ์—†์Šต๋‹ˆ๋‹ค. ๋Š˜ ์ƒˆ๋กœ์šด ๋‹จ์–ด๊ฐ€ ๋งŒ๋“ค์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šตํ•˜์ง€ ๋ชปํ•œ ๋‹จ์–ด๋ฅผ ์ œ๋Œ€๋กœ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๋Š” ๋ฏธ๋“ฑ๋ก๋‹จ์–ด ๋ฌธ์ œ (out of vocabulry, OOV) ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ์‹œ๊ธฐ์— ์ž‘์„ฑ๋œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์—ฐ์˜ˆ ๋‰ด์Šค ๊ธฐ์‚ฌ๋ฅผ ์ฝ๋‹ค๋ณด๋ฉด 'ํŠธ์™€์ด์Šค', '์•„์ด์˜ค์•„์ด' ๊ฐ™์€ ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๊ณ , ์‚ฌ๋žŒ์€ ์ด๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์„œ์ง‘ํ•ฉ์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ์—ฐ์†๋œ ๋‹จ์–ด์—ด์„ ๋‹จ์–ด๋ผ ์ •์˜ํ•œ๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” ํ†ต๊ณ„๋ฅผ ์ด์šฉํ•˜์—ฌ ์ด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ†ต๊ณ„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹จ์–ด(์˜ ๊ฒฝ๊ณ„)๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์–‘ํ•ฉ๋‹ˆ๋‹ค. soynlp๋Š” ๊ทธ ์ค‘, Cohesion score, Branching Entropy, Accessor Variety ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

from soynlp.word import WordExtractor

word_extractor = WordExtractor(min_frequency=100,
    min_cohesion_forward=0.05, 
    min_right_branching_entropy=0.0
)
word_extractor.train(sentences) # list of str or like
words = word_extractor.extract()

words ๋Š” Scores ๋ผ๋Š” namedtuple ์„ value ๋กœ ์ง€๋‹ˆ๋Š” dict ์ž…๋‹ˆ๋‹ค.

words['์•„์ด์˜ค์•„์ด']

Scores(cohesion_forward=0.30063636035733476,
        cohesion_backward=0,
        left_branching_entropy=0,
        right_branching_entropy=0,
        left_accessor_variety=0,
        right_accessor_variety=0,
        leftside_frequency=270,
        rightside_frequency=0
)

2016-10-26 ์˜ ๋‰ด์Šค ๊ธฐ์‚ฌ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•œ ๋‹จ์–ด ์ ์ˆ˜ (cohesion * branching entropy) ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•œ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

๋‹จ์–ด   (๋นˆ๋„์ˆ˜, cohesion, branching entropy)

์ดฌ์˜     (2222, 1.000, 1.823)
์„œ์šธ     (25507, 0.657, 2.241)
๋“ค์–ด     (3906, 0.534, 2.262)
๋กฏ๋ฐ     (1973, 0.999, 1.542)
ํ•œ๊ตญ     (9904, 0.286, 2.729)
๋ถํ•œ     (4954, 0.766, 1.729)
ํˆฌ์ž     (4549, 0.630, 1.889)
๋–จ์–ด     (1453, 0.817, 1.515)
์ง„ํ–‰     (8123, 0.516, 1.970)
์–˜๊ธฐ     (1157, 0.970, 1.328)
์šด์˜     (4537, 0.592, 1.768)
ํ”„๋กœ๊ทธ๋žจ  (2738, 0.719, 1.527)
ํด๋ฆฐํ„ด   (2361, 0.751, 1.420)
๋›ฐ์–ด     (927, 0.831, 1.298)
๋“œ๋ผ๋งˆ   (2375, 0.609, 1.606)
์šฐ๋ฆฌ     (7458, 0.470, 1.827)
์ค€๋น„     (1736, 0.639, 1.513)
๋ฃจ์ด     (1284, 0.743, 1.354)
ํŠธ๋Ÿผํ”„   (3565, 0.712, 1.355)
์ƒ๊ฐ     (3963, 0.335, 2.024)
ํŒฌ๋“ค     (999, 0.626, 1.341)
์‚ฐ์—…     (2203, 0.403, 1.769)
10      (18164, 0.256, 2.210)
ํ™•์ธ     (3575, 0.306, 2.016)
ํ•„์š”     (3428, 0.635, 1.279)
๋ฌธ์ œ     (4737, 0.364, 1.808)
ํ˜์˜     (2357, 0.962, 0.830)
ํ‰๊ฐ€     (2749, 0.362, 1.787)
20      (59317, 0.667, 1.171)
์Šคํฌ์ธ     (3422, 0.428, 1.604)

์ž์„ธํ•œ ๋‚ด์šฉ์€ word extraction tutorial ์— ์žˆ์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ๋ฒ„์ „์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ธฐ๋Šฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Tokenizer

WordExtractor ๋กœ๋ถ€ํ„ฐ ๋‹จ์–ด ์ ์ˆ˜๋ฅผ ํ•™์Šตํ•˜์˜€๋‹ค๋ฉด, ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด์˜ ๊ฒฝ๊ณ„๋ฅผ ๋”ฐ๋ผ ๋ฌธ์žฅ์„ ๋‹จ์–ด์—ด๋กœ ๋ถ„ํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. soynlp ๋Š” ์„ธ ๊ฐ€์ง€ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์ž˜ ๋˜์–ด ์žˆ๋‹ค๋ฉด LTokenizer ๋ฅผ ์ด์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด ์–ด์ ˆ์˜ ๊ตฌ์กฐ๋ฅผ "๋ช…์‚ฌ + ์กฐ์‚ฌ" ์ฒ˜๋Ÿผ "L + [R]" ๋กœ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

LTokenizer

L parts ์—๋Š” ๋ช…์‚ฌ/๋™์‚ฌ/ํ˜•์šฉ์‚ฌ/๋ถ€์‚ฌ๊ฐ€ ์œ„์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ด์ ˆ์—์„œ L ๋งŒ ์ž˜ ์ธ์‹ํ•œ๋‹ค๋ฉด ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„์ด R parts ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. LTokenizer ์—๋Š” L parts ์˜ ๋‹จ์–ด ์ ์ˆ˜๋ฅผ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.

from soynlp.tokenizer import LTokenizer

scores = {'๋ฐ์ด':0.5, '๋ฐ์ดํ„ฐ':0.5, '๋ฐ์ดํ„ฐ๋งˆ์ด๋‹':0.5, '๊ณต๋ถ€':0.5, '๊ณต๋ถ€์ค‘':0.45}
tokenizer = LTokenizer(scores=scores)

sent = '๋ฐ์ดํ„ฐ๋งˆ์ด๋‹์„ ๊ณต๋ถ€ํ•œ๋‹ค'

print(tokenizer.tokenize(sent, flatten=False))
#[['๋ฐ์ดํ„ฐ๋งˆ์ด๋‹', '์„'], ['๊ณต๋ถ€', '์ค‘์ด๋‹ค']]

print(tokenizer.tokenize(sent))
# ['๋ฐ์ดํ„ฐ๋งˆ์ด๋‹', '์„', '๊ณต๋ถ€', '์ค‘์ด๋‹ค']

๋งŒ์•ฝ WordExtractor ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜์˜€๋‹ค๋ฉด, ๋‹จ์–ด ์ ์ˆ˜ ์ค‘ ํ•˜๋‚˜๋ฅผ ํƒํ•˜์—ฌ scores ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” Forward cohesion ์˜ ์ ์ˆ˜๋งŒ์„ ์ด์šฉํ•˜๋Š” ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค. ๊ทธ ์™ธ์—๋„ ๋‹ค์–‘ํ•˜๊ฒŒ ๋‹จ์–ด ์ ์ˆ˜๋ฅผ ์ •์˜ํ•˜์—ฌ ์ด์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from soynlp.word import WordExtractor
from soynlp.utils import DoublespaceLineCorpus

file_path = 'your file path'
corpus = DoublespaceLineCorpus(file_path, iter_sent=True)

word_extractor = WordExtractor(
    min_frequency=100, # example
    min_cohesion_forward=0.05,
    min_right_branching_entropy=0.0
)

word_extractor.train(sentences)
words = word_extractor.extract()

cohesion_score = {word:score.cohesion_forward for word, score in words.items()}
tokenizer = LTokenizer(scores=cohesion_score)

๋ช…์‚ฌ ์ถ”์ถœ๊ธฐ์˜ ๋ช…์‚ฌ ์ ์ˆ˜์™€ Cohesion ์„ ํ•จ๊ป˜ ์ด์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•œ ์˜ˆ๋กœ, "Cohesion ์ ์ˆ˜ + ๋ช…์‚ฌ ์ ์ˆ˜"๋ฅผ ๋‹จ์–ด ์ ์ˆ˜๋กœ ์ด์šฉํ•˜๋ ค๋ฉด ์•„๋ž˜์ฒ˜๋Ÿผ ์ž‘์—…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from soynlp.noun import LRNounExtractor_2
noun_extractor = LRNounExtractor_v2()
nouns = noun_extractor.train_extract(corpus) # list of str like

noun_scores = {noun:score.score for noun, score in nouns.items()}
combined_scores = {noun:score + cohesion_score.get(noun, 0)
    for noun, score in noun_scores.items()}
combined_scores = combined_scores.update(
    {subword:cohesion for subword, cohesion in cohesion_score.items()
    if not (subword in combine_scores)}
)

tokenizer = LTokenizer(scores=combined_scores)

MaxScoreTokenizer

๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์ œ๋Œ€๋กœ ์ง€์ผœ์ง€์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๋ผ๋ฉด, ๋ฌธ์žฅ์˜ ๋„์–ด์“ฐ๊ธฐ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋‰˜์–ด์ง„ ๋‹จ์œ„๊ฐ€ L + [R] ๊ตฌ์กฐ๋ผ ๊ฐ€์ •ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์‚ฌ๋žŒ์€ ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์ง€์ผœ์ง€์ง€ ์•Š์€ ๋ฌธ์žฅ์—์„œ ์ต์ˆ™ํ•œ ๋‹จ์–ด๋ถ€ํ„ฐ ๋ˆˆ์— ๋“ค์–ด์˜ต๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์„ ๋ชจ๋ธ๋กœ ์˜ฎ๊ธด MaxScoreTokenizer ์—ญ์‹œ ๋‹จ์–ด ์ ์ˆ˜๋ฅผ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค.

from soynlp.tokenizer import MaxScoreTokenizer

scores = {'ํŒŒ์Šค': 0.3, 'ํŒŒ์Šคํƒ€': 0.7, '์ข‹์•„์š”': 0.2, '์ข‹์•„':0.5}
tokenizer = MaxScoreTokenizer(scores=scores)

print(tokenizer.tokenize('๋‚œํŒŒ์Šคํƒ€๊ฐ€์ข‹์•„์š”'))
# ['๋‚œ', 'ํŒŒ์Šคํƒ€', '๊ฐ€', '์ข‹์•„', '์š”']

print(tokenizer.tokenize('๋‚œํŒŒ์Šคํƒ€๊ฐ€ ์ข‹์•„์š”'), flatten=False)
# [[('๋‚œ', 0, 1, 0.0, 1), ('ํŒŒ์Šคํƒ€', 1, 4, 0.7, 3),  ('๊ฐ€', 4, 5, 0.0, 1)],
#  [('์ข‹์•„', 0, 2, 0.5, 2), ('์š”', 2, 3, 0.0, 1)]]

MaxScoreTokenizer ์—ญ์‹œ WordExtractor ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ด์šฉํ•˜์‹ค ๋•Œ์—๋Š” ์œ„์˜ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ์ ์ ˆํžˆ scores ๋ฅผ ๋งŒ๋“ค์–ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ ์•Œ๋ ค์ง„ ๋‹จ์–ด ์‚ฌ์ „์ด ์žˆ๋‹ค๋ฉด ์ด ๋‹จ์–ด๋“ค์€ ๋‹ค๋ฅธ ์–ด๋–ค ๋‹จ์–ด๋ณด๋‹ค๋„ ๋” ํฐ ์ ์ˆ˜๋ฅผ ๋ถ€์—ฌํ•˜๋ฉด ๊ทธ ๋‹จ์–ด๋Š” ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ํ•˜๋‚˜์˜ ๋‹จ์–ด๋กœ ์ž˜๋ผ๋ƒ…๋‹ˆ๋‹ค.

RegexTokenizer

๊ทœ์น™ ๊ธฐ๋ฐ˜์œผ๋กœ๋„ ๋‹จ์–ด์—ด์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ธ์–ด๊ฐ€ ๋ฐ”๋€Œ๋Š” ๋ถ€๋ถ„์—์„œ ์šฐ๋ฆฌ๋Š” ๋‹จ์–ด์˜ ๊ฒฝ๊ณ„๋ฅผ ์ธ์‹ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด "์•„์ด๊ณ ใ…‹ใ…‹ใ…œใ…œ์ง„์งœ?" ๋Š” [์•„์ด๊ณ , ใ…‹ใ…‹, ใ…œใ…œ, ์ง„์งœ, ?]๋กœ ์‰ฝ๊ฒŒ ๋‹จ์–ด์—ด์„ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.

from soynlp.tokenizer import RegexTokenizer

tokenizer = RegexTokenizer()

print(tokenizer.tokenize('์ด๋ ‡๊ฒŒ์—ฐ์†๋œ๋ฌธ์žฅ์€์ž˜๋ฆฌ์ง€์•Š์Šต๋‹ˆ๋‹ค๋งŒ'))
# ['์ด๋ ‡๊ฒŒ์—ฐ์†๋œ๋ฌธ์žฅ์€์ž˜๋ฆฌ์ง€์•Š์Šต๋‹ˆ๋‹ค๋งŒ']

print(tokenizer.tokenize('์ˆซ์ž123์ด์˜์–ดabc์—์„ž์—ฌ์žˆ์œผ๋ฉดใ…‹ใ…‹์ž˜๋ฆฌ๊ฒ ์ฃ '))
# ['์ˆซ์ž', '123', '์ด์˜์–ด', 'abc', '์—์„ž์—ฌ์žˆ์œผ๋ฉด', 'ใ…‹ใ…‹', '์ž˜๋ฆฌ๊ฒ ์ฃ ']

Part of Speech Tagger

๋‹จ์–ด ์‚ฌ์ „์ด ์ž˜ ๊ตฌ์ถ•๋˜์–ด ์žˆ๋‹ค๋ฉด, ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ์‚ฌ์ „ ๊ธฐ๋ฐ˜ ํ’ˆ์‚ฌ ํŒ๋ณ„๊ธฐ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹จ, ํ˜•ํƒœ์†Œ๋ถ„์„์„ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— 'ํ•˜๋Š”', 'ํ•˜๋‹ค', 'ํ•˜๊ณ '๋Š” ๋ชจ๋‘ ๋™์‚ฌ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. Lemmatizer ๋Š” ํ˜„์žฌ ๊ฐœ๋ฐœ/์ •๋ฆฌ ์ค‘์ž…๋‹ˆ๋‹ค.

pos_dict = {
    'Adverb': {'๋„ˆ๋ฌด', '๋งค์šฐ'}, 
    'Noun': {'๋„ˆ๋ฌด๋„ˆ๋ฌด๋„ˆ๋ฌด', '์•„์ด์˜ค์•„์ด', '์•„์ด', '๋…ธ๋ž˜', '์˜ค', '์ด', '๊ณ ์–‘'},
    'Josa': {'๋Š”', '์˜', '์ด๋‹ค', '์ž…๋‹ˆ๋‹ค', '์ด', '์ด๋Š”', '๋ฅผ', '๋ผ', '๋ผ๋Š”'},
    'Verb': {'ํ•˜๋Š”', 'ํ•˜๋‹ค', 'ํ•˜๊ณ '},
    'Adjective': {'์˜ˆ์œ', '์˜ˆ์˜๋‹ค'},
    'Exclamation': {'์šฐ์™€'}    
}

from soynlp.postagger import Dictionary
from soynlp.postagger import LRTemplateMatcher
from soynlp.postagger import LREvaluator
from soynlp.postagger import SimpleTagger
from soynlp.postagger import UnknowLRPostprocessor

dictionary = Dictionary(pos_dict)
generator = LRTemplateMatcher(dictionary)    
evaluator = LREvaluator()
postprocessor = UnknowLRPostprocessor()
tagger = SimpleTagger(generator, evaluator, postprocessor)

sent = '๋„ˆ๋ฌด๋„ˆ๋ฌด๋„ˆ๋ฌด๋Š”์•„์ด์˜ค์•„์ด์˜๋…ธ๋ž˜์ž…๋‹ˆ๋‹ค!!'
print(tagger.tag(sent))
# [('๋„ˆ๋ฌด๋„ˆ๋ฌด๋„ˆ๋ฌด', 'Noun'),
#  ('๋Š”', 'Josa'),
#  ('์•„์ด์˜ค์•„์ด', 'Noun'),
#  ('์˜', 'Josa'),
#  ('๋…ธ๋ž˜', 'Noun'),
#  ('์ž…๋‹ˆ๋‹ค', 'Josa'),
#  ('!!', None)]

๋” ์ž์„ธํ•œ ์‚ฌ์šฉ๋ฒ•์€ ์‚ฌ์šฉ๋ฒ• ํŠœํ† ๋ฆฌ์–ผ ์— ๊ธฐ์ˆ ๋˜์–ด ์žˆ์œผ๋ฉฐ, ๊ฐœ๋ฐœ๊ณผ์ • ๋…ธํŠธ๋Š” ์—ฌ๊ธฐ์— ๊ธฐ์ˆ ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

Vetorizer

ํ† ํฌ๋‚˜์ด์ €๋ฅผ ํ•™์Šตํ•˜๊ฑฐ๋‚˜, ํ˜น์€ ํ•™์Šต๋œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฌธ์„œ๋ฅผ sparse matrix ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค. minimum / maximum of term frequency / document frequency ๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Verbose mode ์—์„œ๋Š” ํ˜„์žฌ์˜ ๋ฒกํ„ฐ๋ผ์ด์ง• ์ƒํ™ฉ์„ print ํ•ฉ๋‹ˆ๋‹ค.

vectorizer = BaseVectorizer(
    tokenizer=tokenizer,
    min_tf=0,
    max_tf=10000,
    min_df=0,
    max_df=1.0,
    stopwords=None,
    lowercase=True,
    verbose=True
)

corpus.iter_sent = False
x = vectorizer.fit_transform(corpus)

๋ฌธ์„œ์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๊ฑฐ๋‚˜, ๊ณง๋ฐ”๋กœ sparse matrix ๋ฅผ ์ด์šฉํ•  ๊ฒƒ์ด ์•„๋‹ˆ๋ผ๋ฉด ์ด๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ฆฌ์ง€ ์•Š๊ณ  ๊ทธ๋Œ€๋กœ ํŒŒ์ผ๋กœ ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. fit_to_file() ํ˜น์€ to_file() ํ•จ์ˆ˜๋Š” ํ•˜๋‚˜์˜ ๋ฌธ์„œ์— ๋Œ€ํ•œ term frequency vector ๋ฅผ ์–ป๋Š”๋Œ€๋กœ ํŒŒ์ผ์— ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค. BaseVectorizer ์—์„œ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋Š” parameters ๋Š” ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

vectorizer = BaseVectorizer(min_tf=1, tokenizer=tokenizer)
corpus.iter_sent = False

matrix_path = 'YOURS'
vectorizer.fit_to_file(corpus, matrix_path)

ํ•˜๋‚˜์˜ ๋ฌธ์„œ๋ฅผ sparse matrix ๊ฐ€ ์•„๋‹Œ list of int ๋กœ ์ถœ๋ ฅ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ vectorizer.vocabulary_ ์— ํ•™์Šต๋˜์ง€ ์•Š์€ ๋‹จ์–ด๋Š” encoding ์ด ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

vectorizer.encode_a_doc_to_bow('์˜ค๋Š˜ ๋‰ด์Šค๋Š” ์ด๊ฒƒ์ด ์ „๋ถ€๋‹ค')
# {3: 1, 258: 1, 428: 1, 1814: 1}

list of int ๋Š” list of str ๋กœ decoding ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

vectorizer.decode_from_bow({3: 1, 258: 1, 428: 1, 1814: 1})
# {'๋‰ด์Šค': 1, '๋Š”': 1, '์˜ค๋Š˜': 1, '์ด๊ฒƒ์ด': 1}

dict ํ˜•์‹์˜ bag of words ๋กœ๋„ encoding ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

vectorizer.encode_a_doc_to_list('์˜ค๋Š˜์˜ ๋‰ด์Šค๋Š” ๋งค์šฐ ์‹ฌ๊ฐํ•ฉ๋‹ˆ๋‹ค')
# [258, 4, 428, 3, 333]

dict ํ˜•์‹์˜ bag of words ๋Š” decoding ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

vectorizer.decode_from_list([258, 4, 428, 3, 333])
['์˜ค๋Š˜', '์˜', '๋‰ด์Šค', '๋Š”', '๋งค์šฐ']

Normalizer

๋Œ€ํ™” ๋ฐ์ดํ„ฐ, ๋Œ“๊ธ€ ๋ฐ์ดํ„ฐ์— ๋“ฑ์žฅํ•˜๋Š” ๋ฐ˜๋ณต๋˜๋Š” ์ด๋ชจํ‹ฐ์ฝ˜์˜ ์ •๋ฆฌ ๋ฐ ํ•œ๊ธ€, ํ˜น์€ ํ…์ŠคํŠธ๋งŒ ๋‚จ๊ธฐ๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

from soynlp.normalizer import *

emoticon_normalize('ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œใ…œใ…œใ…œ', num_repeats=3)
# 'ใ…‹ใ…‹ใ…‹ใ…œใ…œใ…œ'

repeat_normalize('์™€ํ•˜ํ•˜ํ•˜ํ•˜ํ•˜ํ•˜ํ•˜ํ•˜ํ•˜ํ•ซ', num_repeats=2)
# '์™€ํ•˜ํ•˜ํ•ซ'

only_hangle('๊ฐ€๋‚˜๋‹คใ…ใ…‘ใ…“ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œabcd123!!์•„ํ•ซ')
# '๊ฐ€๋‚˜๋‹คใ…ใ…‘ใ…“ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œ ์•„ํ•ซ'

only_hangle_number('๊ฐ€๋‚˜๋‹คใ…ใ…‘ใ…“ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œabcd123!!์•„ํ•ซ')
# '๊ฐ€๋‚˜๋‹คใ…ใ…‘ใ…“ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œ 123 ์•„ํ•ซ'

only_text('๊ฐ€๋‚˜๋‹คใ…ใ…‘ใ…“ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œabcd123!!์•„ํ•ซ')
# '๊ฐ€๋‚˜๋‹คใ…ใ…‘ใ…“ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œabcd123!!์•„ํ•ซ'

๋” ์ž์„ธํ•œ ์„ค๋ช…์€ ํŠœํ† ๋ฆฌ์–ผ์— ์žˆ์Šต๋‹ˆ๋‹ค.

Point-wise Mutual Information (PMI)

์—ฐ๊ด€์–ด ๋ถ„์„์„ ์œ„ํ•œ co-occurrence matrix ๊ณ„์‚ฐ๊ณผ ์ด๋ฅผ ์ด์šฉํ•œ Point-wise Mutual Information (PMI) ๊ณ„์‚ฐ์„ ์œ„ํ•œ ํ•จ์ˆ˜๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜ sent_to_word_contexts_matrix ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ (word, context words) matrix ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. x ๋Š” scipy.sparse.csr_matrix ์ด๋ฉฐ, (n_vocabs, n_vocabs) ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค. idx2vocab ์€ x ์˜ ๊ฐ row, column ์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋œ list of str ์ž…๋‹ˆ๋‹ค. ๋ฌธ์žฅ์˜ ์•ž/๋’ค windows ๋‹จ์–ด๋ฅผ context ๋กœ ์ธ์‹ํ•˜๋ฉฐ, min_tf ์ด์ƒ์˜ ๋นˆ๋„์ˆ˜๋กœ ๋“ฑ์žฅํ•œ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋งŒ ๊ณ„์‚ฐ์„ ํ•ฉ๋‹ˆ๋‹ค. dynamic_weight ๋Š” context ๊ธธ์ด์— ๋ฐ˜๋น„๋ก€ํ•˜์—ฌ weighting ์„ ํ•ฉ๋‹ˆ๋‹ค. windows ๊ฐ€ 3 ์ผ ๊ฒฝ์šฐ, 1, 2, 3 ์นธ ๋–จ์–ด์ง„ ๋‹จ์–ด์˜ co-occurrence ๋Š” 1, 2/3, 1/3 ์œผ๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

from soynlp.vectorizer import sent_to_word_contexts_matrix

x, idx2vocab = sent_to_word_contexts_matrix(
    corpus,
    windows=3,
    min_tf=10,
    tokenizer=tokenizer, # (default) lambda x:x.split(),
    dynamic_weight=False,
    verbose=True
)

Co-occurrence matrix ์ธ x ๋ฅผ pmi ์— ์ž…๋ ฅํ•˜๋ฉด row ์™€ column ์„ ๊ฐ ์ถ•์œผ๋กœ PMI ๊ฐ€ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. pmi_dok ์€ scipy.sparse.dok_matrix ํ˜•์‹์ž…๋‹ˆ๋‹ค. min_pmi ์ด์ƒ์˜ ๊ฐ’๋งŒ ์ €์žฅ๋˜๋ฉฐ, default ๋Š” min_pmi = 0 ์ด๊ธฐ ๋•Œ๋ฌธ์— Positive PMI (PPMI) ์ž…๋‹ˆ๋‹ค. alpha ๋Š” PMI(x,y) = p(x,y) / ( p(x) * ( p(y) + alpha ) ) ์— ์ž…๋ ฅ๋˜๋Š” smoothing parameter ์ž…๋‹ˆ๋‹ค. ๊ณ„์‚ฐ ๊ณผ์ •์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— verbose = True ๋กœ ์„ค์ •ํ•˜๋ฉด ํ˜„์žฌ์˜ ์ง„ํ–‰ ์ƒํ™ฉ์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

from soynlp.word import pmi

pmi_dok = pmi(
    x,
    min_pmi=0,
    alpha=0.0001,
    verbose=True
)

๋” ์ž์„ธํ•œ ์„ค๋ช…์€ ํŠœํ† ๋ฆฌ์–ผ์— ์žˆ์Šต๋‹ˆ๋‹ค.

notes

Slides

  • slide files์— ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ ์›๋ฆฌ ๋ฐ ์„ค๋ช…์„ ์ ์–ด๋’€์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์•ผ๋†€์ž์—์„œ ๋ฐœํ‘œํ–ˆ๋˜ ์ž๋ฃŒ์ž…๋‹ˆ๋‹ค.
  • textmining tutorial ์„ ๋งŒ๋“ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. soynlp project ์—์„œ ๊ตฌํ˜„ ์ค‘์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ ์„ค๋ช… ๋ฐ ํ…์ŠคํŠธ ๋งˆ์ด๋‹์— ์ด์šฉ๋˜๋Š” ๋จธ์‹  ๋Ÿฌ๋‹ ๋ฐฉ๋ฒ•๋“ค์„ ์„ค๋ช…ํ•˜๋Š” slides ์ž…๋‹ˆ๋‹ค.

Blogs

  • github io blog ์—์„œ slides ์— ์žˆ๋Š” ๋‚ด์šฉ๋“ค์˜ ํ…์ŠคํŠธ ์„ค๋ช… ๊ธ€๋“ค์„ ์˜ฌ๋ฆฌ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Slides ์˜ ๋‚ด์šฉ์— ๋Œ€ํ•ด ๋” ์ž์„ธํ•˜๊ฒŒ ๋ณด๊ณ  ์‹ถ์œผ์‹ค ๋•Œ ์ฝ์œผ์‹œ๊ธธ ๊ถŒํ•ฉ๋‹ˆ๋‹ค.

ํ•จ๊ป˜ ์ด์šฉํ•˜๋ฉด ์ข‹์€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋“ค

์„ธ์ข… ๋ง๋ญ‰์น˜ ์ •์ œ๋ฅผ ์œ„ํ•œ utils

์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•˜์—ฌ ์„ธ์ข… ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ •์ œํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜๋“ค์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ํ˜•ํƒœ์†Œ/ํ’ˆ์‚ฌ ํ˜•ํƒœ๋กœ ์ •์ œ๋œ ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ํ•จ์ˆ˜, ์šฉ์–ธ์˜ ํ™œ์šฉ ํ˜•ํƒœ๋ฅผ ์ •๋ฆฌํ•˜์—ฌ ํ…Œ์ด๋ธ”๋กœ ๋งŒ๋“œ๋Š” ํ•จ์ˆ˜, ์„ธ์ข… ๋ง๋ญ‰์น˜์˜ ํ’ˆ์‚ฌ ์ฒด๊ณ„๋ฅผ ๋‹จ์ˆœํ™” ์‹œํ‚ค๋Š” ํ•จ์ˆ˜๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

soyspacing

๋„์–ด์“ฐ๊ธฐ ์˜ค๋ฅ˜๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ์ด๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด ํ…์ŠคํŠธ ๋ถ„์„์ด ์‰ฌ์›Œ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ถ„์„ํ•˜๋ ค๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋„์–ด์“ฐ๊ธฐ ์—”์ง„์„ ํ•™์Šตํ•˜๊ณ , ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋„์–ด์“ฐ๊ธฐ ์˜ค๋ฅ˜๋ฅผ ๊ต์ •ํ•ฉ๋‹ˆ๋‹ค.

KR-WordRank

ํ† ํฌ๋‚˜์ด์ €๋‚˜ ๋‹จ์–ด ์ถ”์ถœ๊ธฐ๋ฅผ ํ•™์Šตํ•  ํ•„์š”์—†์ด, HITS algorithm ์„ ์ด์šฉํ•˜์—ฌ substring graph ์—์„œ ํ‚ค์›Œ๋“œ๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

soykeyword

ํ‚ค์›Œ๋“œ ์ถ”์ถœ๊ธฐ์ž…๋‹ˆ๋‹ค. Logistic Regression ์„ ์ด์šฉํ•˜๋Š” ๋ชจ๋ธ๊ณผ ํ†ต๊ณ„ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ, ๋‘ ์ข…๋ฅ˜์˜ ํ‚ค์›Œ๋“œ ์ถ”์ถœ๊ธฐ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. scipy.sparse ์˜ sparse matrix ํ˜•์‹๊ณผ ํ…์ŠคํŠธ ํŒŒ์ผ ํ˜•์‹์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

Analytics

Keywords

korean-nlp

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts