FastTextRank

Extract abstracts and keywords from Chinese text

1.4

PyPI

Maintainers: 1

FastTextRank

Extract abstracts and keywords from Chinese text, use optimized iterative algorithms to improve running speed, and selectively use word vectors to improve accuracy.

PageRank

PageRank is a website page ranking algorithm from Google.
PageRank was originally used to calculate the importance of web pages. The entire www can be seen as a directed graph, and the node is a web page.
This algorithm can caculate all node's importance by their connections.

My algorithm changed the iterative algorithm to make the algorithm much faster, it costs 10ms per article, on the mean while TextRank4ZH costs 80ms on my data.
My algorithm also use word2vec to make the abstract more accurate, but it will cost more time to run the algorithm. Using word2vec costs 40ms per article on the same traning data.

W2VTextRank4Sentence

Introduction

Cut article into sentence
Calculate similarity between sentences:
- Using word vectors' cosine similarity
- Using two sentences' common words
Build a graph by sentences' similarity
Caculate the importance of each sentence by improved iterative algorithm
Get the abstract

API

use_stopword: boolean, default True
stop_words_file: str, default None. The stop words file you want to use. If it is None, you will use this package's stop words.
use_w2v: boolean, default False If it is True, you must input passing dict_path parameter.
dict_path: str, default None.
max_iter:maximum iteration round
tol: maximum tolerance error

W2VTextRank4Word

Introduction

Cut artile into word
Calculate similarity between word: If two words are all in window distance, then the graph's side of this two word add 1.0. Window is set by user.
Build a graph by word' similarity
Caculate the importance of each word by improved iterative algorithm
Get the key word

API

use_stopword=boolean, default True
stop_words_file=str, default None. The stop words file you want to use. If it is None, you will use this package's stop words.
max_iter=maximum iteration round
tol=maximum tolerance error
window=int, default 2 The window to determine if two words are related

FAQs

What is FastTextRank?

Is FastTextRank well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install