
Research
npm Malware Targets Telegram Bot Developers with Persistent SSH Backdoors
Malicious npm packages posing as Telegram bot libraries install SSH backdoors and exfiltrate data from Linux developer machines.
Create Test, Train and Validation datasets for NLP. Currently, creating these datasets from wikipedia is supported
Create Train, Test and Validation Datasets for NLP from wikipedia. Datasets are created using provided seed WikiPages and also by traversing links within pages that meet the specified match pattern. Idea is to leverage links within wiki pages to create more data. The thought is, wikipedia will already contain links to additional pages that are relevant and links within pages can be narrowed through pattern matching.
pip install nlp-data-py
wiki_dataset --seed Brain Human_Brain --match .*neuro|.*neural
In short the above command:
Description
: List of initial Wiki Page names to start with.
Default
: None. If nothing is specified, items in pickle
file will be read. If pickle file also dose not exists, nothing will be done and
the code exits.
Example
:
wiki_dataset --seed Brain Human_Brain
Description
: This option serves 2 purpose. One to track links in WikiPages
and another to read additional pages either from links or saved pickle
file. Links that match the pattern will be considered to be added to datasets.
Also see limit
Default
: "". All links from a wikipage will be considered and tracked.
Example
: In the below example, any links that match neuro or neural will be tracked
and/or read to create datasets.
wiki_dataset --seed Brain Human_Brain -m ".*neuro|.*neural"
Description
: If this option is true, then additional pages will be read
either based on links or previously scanned pickle file. This option will
be used in conjunction with limit to determine number of additional
pages to read.
Also see limit
Default
: true
Example
: In the below example, only Brain and Human_Brain wiki pages will be read.
However, links that match the match patter from these pages will be tracked
and stored in a pickle file which may be used later on.
wiki_dataset --seed Brain Human_Brain -m ".*neuro|.*neural" -r false
Description
: Wikipedia may contain too many links especially when looking
at pages recursively. This option limits the number of additional pages to be read.
This option will only be relevant if recursive is set to true.
Default
20
Example
: In the below example, along with reading Brain & Human_Brain
and tracking links that match the match pattern, 100 additional pages
are read either based on links or pickle file.
wiki_dataset --seed Brain Human_Brain -m ".*neuro|.*neural" -l 100
Description
: Path to pickle file tracking items that are read. This enables to
incrementally read items. Pickle file stores a dict. Example:
{
"item1": 1,
"item2": 0,
"item3": -1
}
In the above example, item1 was read previously hence, wont be read again. item2 was not read and will be consider in future reads. item3 errored out in previous reads and will not be attempted again
Default
: ./vars/scanned.pkl
Example
:
wiki_dataset --seed Brain Human_Brain -m ".*neuro|.*neural" -p scanned.pkl
In the above example:
Description
: Path for datasets.
Default
: ./vars/datasets/
Example
:
wiki_dataset --seed Brain Human_Brain -m ".*neuro|.*neural" -o ./datasets/
In the above example, train, val and test datasets will be created in datasets/ folder. Future re-runs will append to these files
Description
: This option, along with chunks_per_page
defines a page. This comes in handy when creating datasets, especially, if the
data needs to be shuffled.
Default
: '(?<=[.!?]) +'
Example
:
wiki_dataset --seed Brain Human_Brain -m ".*neuro|.*neural" -cs '(?<=[.!?]) +'
In the above example, text from wiki pages are split into sentences (chunks) based on ., ! or ? Note: On windows, use double quotes like -cs "(?<=[.!?]) +"
Description
: This defines pages. i.e. this defines number of chunks for a page.
This comes in handy when data needs to be shuffled for creating test, train and val datasets.
Default
: 5
Example
:
wiki_dataset --seed Brain Human_Brain -m ".*neuro|.*neural" -cs '(?<=[.!?]) +' -cp 10
In the above example, wiki page is split into chunks based on ., ? or !. And 10 contiguous chunks form a page. For example, if wiki page has 100 sentences, in the above example, groups of 10s are considered to form a page. So, this wiki page contain 10 pages.
Description
: Ratio to split the train, val and test datasets. Split happens based on
number of pages.
Default
: 80%, 10% and 10% for train, val and test
Example
:
wiki_dataset --seed Brain Human_Brain -m ".*neuro|.*neural" -cs '(?<=[.!?]) +' -cp 10 -sr .8 0.1 0.1
If a wiki page has 10 pages (as defined by chuck_splitter and chunks_per_page), then in the above example, train will contain 8, val and test will contain 1 each. Note that the actual page in each of these datasets depend on if shuffle is on. If shuffle is on, pages are shuffled and any 8 page can make train dataset and any of the remaining 2 pages can be val and test. If shuffle is off, then first 8 pages will be train, next 1 is val and final page is test
Description
: Names the datasets
Default
: train, val and test
Example
:
wiki_dataset --seed Brain Human_Brain -m ".*neuro|.*neural" -sr 80 20 -ds set1 set2
In the above example, 2 datasets: set1 & set2 will be created
Description
: Shuffle pages (see chuck_splitter and
chunks_per_page for pages) before creating datasets
Default
: True
Example
:
wiki_dataset --seed Brain Human_Brain -m ".*neuro|.*neural" -sf false
Since shuffle is false in the above example, pages in wiki page will be taken in order. i.e. since default ratio is 80%, 10% and 10%, first 80% of this wiki page will be in train, next 10% in val and final 10% in test.
Actual pages in each of the datasets depend on if shuffle is on. If shuffle is on, pages are shuffled and any 80% page can make train dataset and any of the remaining 20% pages can be val and test. If shuffle is off, then first 80% will be train, next 10% val and final 10% is test
Below is a simple example:
from nlp_data_py import WikiDataset
WikiDataset.create_dataset_from_wiki(seeds=['Brain', 'Human_brain'], match=".*neuro")
In the above example,
Below is an example where default options are overridden:
from nlp_data_py import WikiDataset
from nlp_data_py import Book, Splitter
scanned_pickle = "./scanned.pkl"
save_dataset_path = "./datasets/"
book_def: Book = Book(chunk_splitter='(?<=[.!?]) +', chunks_per_page=2)
splitter: Splitter = Splitter(split_ratios=[0.5, 0.25, 0.25], dataset_names=['train', 'val', 'test'], shuffle=False)
wiki = WikiDataset.create_dataset_from_wiki(seeds=['Brain', 'Human_brain'],
match=".*neuro",
recursive=True, limit=2,
scanned_pickle=scanned_pickle,
save_dataset_path=save_dataset_path,
book_def=book_def,
splitter=splitter)
FAQs
Create Test, Train and Validation datasets for NLP. Currently, creating these datasets from wikipedia is supported
We found that nlp-data-py demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Malicious npm packages posing as Telegram bot libraries install SSH backdoors and exfiltrate data from Linux developer machines.
Security News
pip, PDM, pip-audit, and the packaging library are already adding support for Python’s new lock file format.
Product
Socket's Go support is now generally available, bringing automatic scanning and deep code analysis to all users with Go projects.