New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →

Book a Demo Sign in

dumpster-dip

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

dumpster-dip

parse a wikipedia dump into tiny files

latest

Source

npm

Version: 2.0.0

Version published: 2 years ago

Maintainers: 1

Created: 4 years ago

Source

dumpster-dip

wikipedia dump parser

_{by
Spencer Kelly, Devrim Yasar,
and

others}

The data exports from wikimedia, arguably the world's most-important datasets, exist as huge xml files, in a notorious markup format.

dumpster-dip can flip this dataset into individual json or text files.

Sister-project dumpster-dive puts this data into mongodb, instead
use whatever you prefer!

Both projects use wtf_wikipedia as a parser.

Command-Line

the easiest way to get started is to simply run:

npx dumpster-dip

which is a wild, no-install, no-dependency way to get going.

Follow the prompts, and this will download, unzip, and parse any-language wikipedia, into a selected format.

The optional params are:

--lang fr             # do the french wikipedia
--output encyclopedia # add all 'E' pages to ./E/
--text                # return plaintext instead of json

JS API

Also available to be used as a powerful javascript library:

npm install dumpster-dip

import dumpster from 'dumpster-dip' // or require('dumpster-dip')

await dumpster({ file: './enwiki-latest-pages-articles.xml' }) // 😅

This will require you to download and unzip a dump yourself. Instructions below. Depending on the language, it may take a couple hours.

Instructions

1. Download a dump
cruise the wikipedia dump page and look for ${LANG}wiki-latest-pages-articles.xml.bz2

2. Unzip the dump

bzip2 -d ./enwiki-latest-pages-articles.xml.bz2

3. Start the javascript

import dip from 'dumpster-dip'

const opts = {
  input: './enwiki-latest-pages-articles.xml',
  parse: function (doc) {
    return doc.sentences()[0].text() // return the first sentence of each page
  }
}

dip(opts).then(() => {
  console.log('done!')
})

en-wikipedia takes about 4hrs on a macbook. See expected article counts here

Options

{
  file: './enwiki-latest-pages-articles.xml', // path to unzipped dump file relative to cwd
  outputDir: './dip', // directory for all our new file(s)
  outputMode: 'nested', // how we should write the results

  // define how many concurrent workers to run
  workers: cpuCount, // default is cpu count
  //interval to log status
  heartbeat: 5000, //every 5 seconds

  // which wikipedia namespaces to handle (null will do all)
  namespace: 0, //(default article namespace)
  // parse redirects, too
  redirects: false,
  // parse disambiguation pages, too
  disambiguation: true,

  // allow a custom wtf_wikipedia parsing library
  libPath: 'wtf_wikipedia',

  // should we skip this page or return something?
  doPage: function (doc) {
    return true
  },

  // what do return, for every page
  //- avoid using an arrow-function
  parse: function (doc) {
    return doc.json()
  }
}

Output formats:

dumpster-dip comes with 4 output formats:

'flat' - all files in 1 directory
'encyclopedia' - all 'E..' pages in ./e
'encyclopedia-two' - all 'Ed..' pages in ./ed
'hash' (default) - 2 evenly-distributed directories
'ndjson' - all data in one file

Sometimes operating systems don't like having ~6m files in one folder - so these options allow different nesting structures:

Encyclopedia

to put files in folders indexed by their first letter, do:

let opts = {
  outputDir: './results',
  outputMode: 'encyclopedia'
}

Remember, some directories become way larger than others. Also remember that titles are UTF-8.

For two-letter folders, use outputMode: 'encyclopedia-two'

Hash (default)

This format nests each file 2-deep, using the first 4 characters of the filename's hash:

/BE
  /EF
    /Dennis_Rodman.txt
    /Hilary_Clinton.txt

Although these directory names are meaningless, the advantage of this format is that files will be distributed evenly, instead of piling-up in the 'E' directory.

This is the same scheme that wikipedia does internally.

as a helper, this library exposes a function for navigating this directory scheme:

import getPath from 'dumpster-dip/nested-path'
let file = getPath('Dennis Rodman')
// ./BE/EF/Dennis_Rodman.txt

Flat:

if you want all files in one flat directory, you can cross your fingers and do:

let opts = {
  outputDir: './results',
  outputMode: 'flat'
}

Ndjson

You may want all results in one newline-delimited file. Using this format, you can produce TSV or CSV files:

let opts = {
  outputDir: './results',
  outputMode: 'ndjson',
  parse: function (doc) {
    return [doc.title(), doc.text().length].join('\t')
  }
}

Examples:

Wikipedia is often a complicated place. Getting specific data may require some investigation, and experimentation:

See runnable examples in ./examples

Birthdays of basketball players

Process only the 13,000 pages with the category American men's basketball players

await dip({
  input: `./enwiki-latest-pages-articles.xml`,
  doPage: function (doc) {
    return doc.categories().find((cat) => cat === `American men's basketball players`)
  },
  parse: function (doc) {
    return doc.infobox().get('birth_date')
  }
})

Film Budgets

Look for pages with the Film infobox and grab some properties:

await dip({
  input: `./enwiki-latest-pages-articles.xml`,
  outputMode: 'encyclopedia',
  doPage: function (doc) {
    // look for anything with a 'Film' 'infobox
    return doc.infobox() && doc.infobox().type() === 'film'
  },
  parse: function (doc) {
    let inf = doc.infobox()
    // pluck some values from its infobox
    return {
      title: doc.title(),
      budget: inf.get('budget'),
      gross: inf.get('gross')
    }
  }
})

Talk Pages

Talk pages are not found in the normal 'latest-pages-articles.xml' dump. Instead, you must download the larger 'latest-pages-meta-current.xml' dump. To process only Talk pages, set 'namespace' to 1.

const opts = {
  input: `./enwiki-latest-pages-meta-current.xml`,
  namespace: 1, // do talk pages only
  parse: function (doc) {
    return doc.text() //return their text
  }
}

Customization

Given the parse callback, you're free to return anything you'd like.

One of the charms of wtf_wikipedia is its plugin system, which allows users to add any new features.

Here we apply a custom plugin to our wtf lib, and pass it in to be available each worker:

in ./myLib.js

import wtf from 'wtf_wikipedia'

// add custom analysis as a plugin
wtf.plugin((models, templates) => {
  // add a new method
  models.Doc.prototype.firstSentence = function () {
    return this.sentences()[0].text()
  }
  // support a missing plugin
  templates.pingponggame = function (tmpl, list) {
    let arr = tmpl.split('|')
    return arr[1] + ' to ' + arr[2]
  }
})
export default wtf

then we can pass this version into dumpster-dip:

import dip from 'dumpster-dip'

dip({
  input: '/path/to/dump.xml',
  libPath: './myLib.js', // our version (relative to cwd)
  parse: function (doc) {
    return doc.firstSentence() // use custom method
  }
})

See the plugins available, such as the NHL season parser, the nsfw tagger, or a parser for disambiguation pages.

👋

We are commited to making this library into a great tool for parsing mediawiki projects.

Prs welcomed and respected.

MIT

FAQs

What is dumpster-dip?

Is dumpster-dip well maintained?

Package last updated on 28 Dec 2023

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

dumpster-dip

dumpster-dip

Command-Line

JS API

Instructions

Options

Output formats:

Encyclopedia

Hash (default)

Flat:

Ndjson

Examples:

Birthdays of basketball players

Film Budgets

Talk Pages

Customization

👋

Related posts

Don't Kill the Goose That Lays the Golden Eggs

Feross on TBPN: How North Korea Hijacked Axios