persian-preprocess
Persian (Farsi) text pre processing (normalize, number, punctuation, white space, stop word & ...)
Table of contents
npm install --save persian-preprocess
const persianPreProcess = require('persian-preprocess');
Parameter | Type | Required | Descriptiopn |
---|
text | String | Yes | Text to process |
debug | Boolean | Yes | Debug system status |
const text = 'text to process';
const debug = true;
const processedText = persianPreProcess(text, debug)
.normalize()
.number()
.lowercase()
.punctuation()
.remove()
.stopword()
.emoticon()
.whitespace();
- Above code is just a sample of all pre process methods and because some of the methods require parameters this code won't work correctly. For complete functional sample please check Full Sample
Normalization process
Parameter | Type | Required | Descriptiopn |
---|
config | Object | No | Normalization config (table below) |
Configuration | Type | Description | Sample Characters |
---|
persian | boolean | Normalize persian characters | ﭐ ݓ ك ﻱ |
english | boolean | Normalize english characters | ᗩ ℳ Ѡ ⓡ ⒵ |
arabic | boolean | Normalize arabic characters | ﷲ ﷺ |
number | boolean | Normalize number characters | ٥ ⑩ |
math | boolean | Normalize math characters | ¼ ⅞ |
html | boolean | Normalize html characters | < |
punctuation | boolean | Normalize punctuation characters | ʕ ʔ ℅ ٪ |
special | boolean | Normalize special characters | ᴁ lj st |
- Default value for all configurations sets is true and the normalization process will use for all of them by default
- Setting configurations value to false will ignore normalization process for the set
const processedText = persianPreProcess(text, debug).normalize();
const processedText = persianPreProcess(text, debug).normalize({
persian: true,
english: true,
arabic: true,
number: true,
math: true,
html: true,
punctuation: true,
special: true
});
const processedText = persianPreProcess(text, debug).normalize({
html: false
});
Change numbers locale
Parameter | Type | Required | Descriptiopn |
---|
language | Enum: 'persian', 'english' | Yes | Numeric characters locale |
const processedText = persianPreProcess(text, debug).number('english');
const processedText = persianPreProcess(text, debug).number('persian');
Lowercase all characters
const processedText = persianPreProcess(text, debug).lowercase();
Remove punctuation
Parameter | Type | Required | Descriptiopn |
---|
config | Object | No | Punctuation removal config (table below) |
Configuration | Type | Description | Sample Characters |
---|
basic | boolean or null | Basic punctuations | ' " \ / , ( | ) |
mark | boolean or null | Special punctuations | \r \n \t \0 |
diacritic | boolean or null | Arabic diacritics | ٌ ٍ ً ّ |
unicode | boolean or null | Unicode punctuations | ZERO WIDTH NON-JOINER |
- Default value for all configurations sets is true and the punctuations will remove using space character
- Setting value to null will remove the punctuations and wont replace them with any character
- Setting value to false will ignore punctuations removal for the set
const processedText = persianPreProcess(text, debug).punctuation();
const processedText = persianPreProcess(text, debug).punctuation({
basic: true,
mark: true,
diacritic: true,
unicode: true
});
const processedText = persianPreProcess(text, debug).punctuation({
unicode: false
});
const processedText = persianPreProcess(text, debug).punctuation({
basic: null
});
Remove selected characters
Parameter | Type | Required | Descriptiopn |
---|
config | Object | No | Character removal config (table below) |
Configuration | Type | Description | Sample Characters |
---|
number | boolean or null | Numeric characters | 0 9 ۰ ۹ |
persian | boolean or null | Persian characters | آ ا ی |
english | boolean or null | English characters | A Z a z |
length | number | Words with specific length | |
-
for number, persian and english configurations
- Default value is false and the character removal process will be ignored by default
- Setting value to true will remove all the chacters in set and replace them with space character
- Setting value to null will remove the characters and wont replace them with any character
-
Setting length configuration will remove all words with the length equal or less than given value
const processedText = persianPreProcess(text, debug).remove();
const processedText = persianPreProcess(text, debug).remove({
number: true,
persian: true,
english: true
});
const processedText = persianPreProcess(text, debug).remove({
english: null
});
const processedText = persianPreProcess(text, debug).remove({
length: 2
});
Remove stop words
Parameter | Type | Required | Descriptiopn |
---|
config | Object | No | Stopword removal config (table below) |
Configuration | Type | Description | Sample Words |
---|
persian | boolean | Persian stopwords | در با به |
english | boolean | English stopwords | in at on |
custom | string[] | List of custom Words | |
const processedText = persianPreProcess(text, debug).stopword();
const processedText = persianPreProcess(text, debug).stopword({
persian: true,
english: true
});
const processedText = persianPreProcess(text, debug).stopword({
custom: ['this', 'a']
});
Remove emoticons
Parameter | Required | Descriptiopn |
---|
replace | No | Value of this parameter can only be NULL |
- Be default (calling method with no parameter) all emoticons will remove using space character
- Setting replace value to null will remove the emoticons and wont replace them with any character
const processedText = persianPreProcess(text, debug).emoticon();
const processedText = persianPreProcess(text, debug).emoticon(null);
Remove duplicate whitespaces
const processedText = persianPreProcess(text, debug).whitespace();
Get processed text
const stringValue = processedText.toString();
Get list of all words
const arrayList = processedText.toArray();
Get list of unique words
const uniqueList = processedText.toUnique();
Get pre process debug data
const debugInfo = processedText.getDebug();
const text = `
استفاده از حرف ك عربی و کاراکتر خاص ﷼ و عدد عربی ٦
انگلیسی: using ß character and ⅜ and < ⒄℅
حط دوم انگلیسی: and special character: NJ
شکلک: 😃 👦🏿 🚩 👱🏽 🍉 🏒 🚍 🥬
انتهای متن
`;
const persianPreProcess = require('persian-preprocess');
const processedText = persianPreProcess(text, true)
.normalize()
.number('persian')
.lowercase()
.punctuation({
mark: false
})
.remove({
number: null,
})
.stopword({
custom: ['حرف', 'خط']
})
.emoticon()
.whitespace();
const result = processedText.toString();
console.log(processedText.getDebug());
{
TOTAL: { duration: 0.054, change: -96, length: 212 },
normilize: {
duration: 0.018,
change: 4,
length: 216,
match: [
'ك', 'ß', '﷼',
'٦', '⒄', '⅜',
'<', '℅', 'NJ'
]
},
number: { duration: 0.001, change: 0, length: 216, match: [] },
lowercase: { duration: 0, change: 0, length: 216 },
punctuation: {
duration: 0,
change: 0,
length: 216,
match: [ ':', '/', '<', '%' ]
},
remove: {
duration: 0,
change: -5,
length: 211,
match: [ '۶', '۳', '۸', '۱', '۷' ]
},
stopword: {
duration: 0.028,
change: -6,
length: 205,
match: [
'and', ' دوم ',
' از ', ' ک ',
' و ', ' حرف ',
' حط '
]
},
emoticon: {
duration: 0.002,
change: -10,
length: 195,
match: [
'😃', '👦', '🏿',
'🚩', '👱', '🏽',
'🍉', '🏒', '🚍',
'🥬'
]
},
whitespace: { duration: 0.001, change: -79, length: 116 }
}
Name | Description |
---|
duration | Process time in millisecond |
change | Number of characters added or removed from Text value |
length | Text value length after process |
match | List of matched characters/words in process |
git clone https://github.com/webilix/persian-preprocess.git
npm install
npm test