This source code is an implementation of textrank algorithm, under MIT licence.
The minimum requred Go version is 1.8.
MOTIVATION
If there was a program what could rank book size text's words, phrases and sentences continuously on multiple threads and it would be opened to modifing by objects, written in a simple, secure, static language and if it would be very well documented... Now, here it is.
DEMO
The following link Recona is a simple, pre-programmed a.i. what uses this library to ranking raw texts. It visualizes how ranking works and it represents how it could be used for different purposes: Recona.app
FEATURES
- Find the most important phrases.
- Find the most important words.
- Find the most important N sentences.
- Importance by phrase weights.
- Importance by word occurrence.
- Find the first N sentences, start from Xth sentence.
- Find sentences by phrase chains ordered by position in text.
- Access to the whole ranked data.
- Support more languages.
- Algorithm for weighting can be modified by interface implementation.
- Parser can be modified by interface implementation.
- Multi thread support.
INSTALL
You can install TextRank by Go's get:
go get github.com/DavidBelicza/TextRank
TextRank uses the DEP as vendoring tool, so the required dependencies are versioned under the vendor folder. The exact version number defined in the Gopkg.toml. If you want to reinstall the dependencies, use the DEP functions: flush the vendor folder and run:
dep ensure
DOCKER
Using Docker to TextRank isn't necessary, it's just an option.
Build image from the repository's root directory:
docker build -t go_text_rank_image .
Create container from the image:
docker run -dit --name textrank go_text_rank_image:latest
Run the go test -v . code inside the container:
docker exec -i -t textrank go test -v .
Stop, start or remove the container:
docker stop textrank
docker start textrank
docker rm textrank
HOW DOES IT WORK
Too see how does it work, the easiest way is to use the sample text. Sample text can be found in the textrank_test.go file at this line. It's a short size text about Gnome Shell.
- TextRank reads the text,
- parse it,
- remove the unnecessary stop words,
- tokenize it
- and counting the occurrence of the words and phrases
- and then it starts weighting
- by the occurrence of words and phrases and their relations.
- After weights are done, TextRank normalize weights to between 1 and 0.
- Then the different finder methods capable to find the most important words, phrases or sentences.
The most important phrases from the sample text are:
Phrase | Occurrence | Weight |
---|
gnome - shell | 5 | 1 |
extension - gnome | 3 | 0.50859946 |
icons - tray | 3 | 0.49631447 |
gnome - caffeine | 2 | 0.27027023 |
The gnome is the most often used word in this text and shell is also used multiple times. Two of them are used together as a phrase 5 times. This is the highest occurrence in this text, so this is the most important phrase.
The following two important phrases have same occurrence 3, however they are not equal. This is because the extension gnome phrase contains the word gnome, the most popular word in the text, and it increases the phrase's weight. It increases the weight of any word what is related to it, but not too much to overcome other important phrases what don't contain the gnome word.
The exact algorithm can be found in the algorithm.go file at this line.
Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax. Automatic data summarization is part of machine learning and data mining. The main idea of summarization is to find a representative subset of the data, which contains the information of the entire set. Summarization technologies are used in a large number of sectors in industry today. - Wikipedia
EXAMPLES
Find the most important phrases
This is the most basic and simplest usage of textrank.
package main
import (
"fmt"
"github.com/DavidBelicza/TextRank"
)
func main() {
rawText := "Your long raw text, it could be a book. Lorem ipsum..."
tr := textrank.NewTextRank()
rule := textrank.NewDefaultRule()
language := textrank.NewDefaultLanguage()
algorithmDef := textrank.NewDefaultAlgorithm()
tr.Populate(rawText, language, rule)
tr.Ranking(algorithmDef)
rankedPhrases := textrank.FindPhrases(tr)
fmt.Println(rankedPhrases[0])
fmt.Println(rankedPhrases[1])
}
All possible pre-defined finder queries
After ranking, the graph contains a lot of valuable data. There are functions in textrank package what contains logic to retrieve those data from the graph.
package main
import (
"fmt"
"github.com/DavidBelicza/TextRank"
)
func main() {
rawText := "Your long raw text, it could be a book. Lorem ipsum..."
tr := textrank.NewTextRank()
rule := textrank.NewDefaultRule()
language := textrank.NewDefaultLanguage()
algorithmDef := textrank.NewDefaultAlgorithm()
tr.Populate(rawText, language, rule)
tr.Ranking(algorithmDef)
rankedPhrases := textrank.FindPhrases(tr)
fmt.Println(rankedPhrases[0])
words := textrank.FindSingleWords(tr)
fmt.Println(words[0])
sentences := textrank.FindSentencesByRelationWeight(tr, 10)
fmt.Println(sentences)
sentences = textrank.FindSentencesByWordQtyWeight(tr, 10)
fmt.Println(sentences)
sentences = textrank.FindSentencesFrom(tr, 5, 10)
fmt.Println(sentences)
sentencesPh := textrank.FindSentencesByPhraseChain(tr, []string{"gnome", "shell", "extension"})
fmt.Println(sentencesPh[0])
}
Access to everything
After ranking, the graph contains a lot of valuable data. The GetRank function allows access to the graph and every data can be retrieved from this structure.
package main
import (
"fmt"
"github.com/DavidBelicza/TextRank"
)
func main() {
rawText := "Your long raw text, it could be a book. Lorem ipsum..."
tr := textrank.NewTextRank()
rule := textrank.NewDefaultRule()
language := textrank.NewDefaultLanguage()
algorithmDef := textrank.NewDefaultAlgorithm()
tr.Populate(rawText, language, rule)
tr.Ranking(algorithmDef)
rankData := tr.GetRankData()
wordId := rankData.WordValID["gnome"]
fmt.Println(rankData.Words[wordId].Weight)
fmt.Println(rankData.Words[wordId].Qty)
fmt.Println(rankData.Words[wordId].SentenceIDs)
fmt.Println(rankData.Words[wordId].ConnectionLeft)
fmt.Println(rankData.Words[wordId].ConnectionRight)
fmt.Println(rankData.Relation.Node[wordId])
}
Adding text continuously
It is possibe to add more text after another texts already have been added. The Ranking function can merge these multiple texts and it can recalculate the weights and all related data.
package main
import (
"fmt"
"github.com/DavidBelicza/TextRank"
)
func main() {
rawText := "Your long raw text, it could be a book. Lorem ipsum..."
tr := textrank.NewTextRank()
rule := textrank.NewDefaultRule()
language := textrank.NewDefaultLanguage()
algorithmDef := textrank.NewDefaultAlgorithm()
tr.Populate(rawText, language, rule)
tr.Ranking(algorithmDef)
rawText2 := "Another book or article..."
rawText3 := "Third another book or article..."
tr.Populate(rawText2, language, rule)
tr.Populate(rawText3, language, rule)
tr.Ranking(algorithmDef)
rankedPhrases := textrank.FindPhrases(tr)
fmt.Println(rankedPhrases[0])
fmt.Println(rankedPhrases[1])
}
Using different algorithm to ranking text
There are two algorithm has implemented, it is possible to write custom algorithm by Algorithm interface and use it instead of defaults.
package main
import (
"fmt"
"github.com/DavidBelicza/TextRank"
)
func main() {
rawText := "Your long raw text, it could be a book. Lorem ipsum..."
tr := textrank.NewTextRank()
rule := textrank.NewDefaultRule()
language := textrank.NewDefaultLanguage()
algorithmChain := textrank.NewChainAlgorithm()
tr.Populate(rawText, language, rule)
tr.Ranking(algorithmChain)
rankedPhrases := textrank.FindPhrases(tr)
fmt.Println(rankedPhrases[0])
fmt.Println(rankedPhrases[1])
}
Using multiple graphs
Graph ID exists because it is possible run multiple independent text ranking processes.
package main
import (
"fmt"
"github.com/DavidBelicza/TextRank"
)
func main() {
rawText := "Your long raw text, it could be a book. Lorem ipsum..."
tr1 := textrank.NewTextRank()
rule := textrank.NewDefaultRule()
language := textrank.NewDefaultLanguage()
algorithmDef := textrank.NewDefaultAlgorithm()
tr1.Populate(rawText, language, rule)
tr1.Ranking(algorithmDef)
tr2 := textrank.NewTextRank()
algorithmChain := textrank.NewChainAlgorithm()
tr2.Populate(rawText, language, rule)
tr2.Ranking(algorithmChain)
rankedPhrases := textrank.FindPhrases(tr1)
fmt.Println(rankedPhrases[0])
fmt.Println(rankedPhrases[1])
rankedPhrases2 := textrank.FindPhrases(tr2)
fmt.Println(rankedPhrases2[0])
fmt.Println(rankedPhrases2[1])
}
Using different non-English languages
Engish is used by default but it is possible to add any language. To use other languages a stop word list is required what you can find here: https://github.com/stopwords-iso
package main
import (
"fmt"
"github.com/DavidBelicza/TextRank"
)
func main() {
rawText := "Your long raw text, it could be a book. Lorem ipsum..."
tr := textrank.NewTextRank()
rule := textrank.NewDefaultRule()
language := textrank.NewDefaultLanguage()
language.SetWords("es", []string{"uno", "dos", "tres", "yo", "es", "eres"})
language.SetActiveLanguage("es")
algorithmDef := textrank.NewDefaultAlgorithm()
tr.Populate(rawText, language, rule)
tr.Ranking(algorithmDef)
rankedPhrases := textrank.FindPhrases(tr)
fmt.Println(rankedPhrases[0])
fmt.Println(rankedPhrases[1])
}
Asynchronous usage by goroutines
It is thread safe. Independent graphs can receive texts in the same time and can be extended by more text also in the same time.
package main
import (
"fmt"
"time"
"github.com/DavidBelicza/TextRank"
)
func main() {
stopProgram := false
stream := make(chan string)
tr := textrank.NewTextRank()
go func(tr *textrank.TextRank) {
rawTexts := []string{
"Very long text...",
"Another very long text...",
"Second another very long text...",
}
for _, rawText := range rawTexts {
stream <- rawText
}
}(tr)
go func() {
i := 1
for {
rawText := <-stream
rule := textrank.NewDefaultRule()
language := textrank.NewDefaultLanguage()
algorithm := textrank.NewDefaultAlgorithm()
tr.Populate(rawText, language, rule)
tr.Ranking(algorithm)
if i == 3 {
stopProgram = true
}
i++
}
}()
for !stopProgram {
time.Sleep(time.Second * 1)
}
phrases := textrank.FindPhrases(tr)
fmt.Println(phrases[0])
}
A SIMPLE VISUAL REPRESENTATION
The below image is a representation how works the simplest text ranking algorithm. This algorithm can be replaced by an another one by inject different Algorithm interface implementation.