Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

nlp-corpus

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

nlp-corpus

texts for integration testing of nlp components

4.4.0
latest
Source
npm

Version published: 2 years ago

Weekly downloads: 149; increased by60.22%

Maintainers: 1

Weekly downloads

Created: 9 years ago

Source

nlp-corpus

lots of weird english sentences

npm install nlp-corpus

_{by
Spencer Kelly}

see french, german, and spanish translations

nlp-corpus is a proud series of weird texts from a delicious smattering of sources - aimed at getting cosmopolitan flavours of english - highbrow, lowbrow and unibrow - dialects, typos, shakespeare, unicode, 19th century, aggressive emoji, and epic nsfw slurs into your training data.

it is 50,000 sentences, or 5mb, split into 50 files of randomized sentences.

it's role is mainly to kick the tires a bit, as creatively as possible, for fuzzy linguistic parsing.

suggestive American rock lyrics
campy Friends tv-show transcripts
vulnerable drug-trip reports from Erowid
singaporean SMS messages
State of the union logorrhea
generally-offensive 90's rap
Legal descriptions in NAFTA
20th century romantic fiction
pedantic arguments on reddit
arcane and dense jeopardy questions

Note that some of this text is nsfw, or containing offensive content, badly-formatted unicode, weird indentation, ascii art, antiquated shorthands, etc.

These texts were found just clicking around on the internet. Running them blindly through your parser should be considered fair-use, but please don't commercially republish them, or anything like that.

ok go.

npm install nlp-corpus

running this library server-side loads a subset of the documents - abt 3mb total

import corpus from 'nlp-corpus'

// all 10k sentences, in an array
let arr = corpus.all()

// or load just a few:
arr = corpus.some(400)

//random sentence
let str = corpus.random()
//random 5 sentences
let arr = corpus.some(5) //n can only be <= 1,500

or on the client-side, there's a one-liner that fetches the docs:

<script src="http://unpkg.com/nlp-corpus"></script>
<script>
  // load a documents lazily
  await nlpCorpus.fetch(2) //1 - 20
  // (each doc is abt 150kb)
  let arr = nlpCorpus.random(4) //1 - 1,500
</script>

nlp-corpus

ok go.

Contents:

Dialog

Music lyrics

Fiction

Speeches

Wikipedia

Internet comments

Questions

Instructions

News Headlines

Reviews

Legal Text

Jokes & puns

Literature

Email text

4.4.0

Related posts

nlp-corpus

ok go.

Contents:

Dialog

Music lyrics

Fiction

Speeches

Wikipedia

Internet comments

Questions

Instructions

News Headlines

Reviews

Legal Text

Jokes & puns

Literature

Email text

4.4.0

Related posts

Input Validation Vulnerabilities Dominate MITRE's 2024 CWE Top 25 List

Risky Business Podcast: Why Open Source Software Needs Better Malware Tracking