
Security News
Axios Supply Chain Attack Reaches OpenAI macOS Signing Pipeline, Forces Certificate Rotation
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.
dumpster-dip
Advanced tools
The data exports from wikimedia, arguably the world's most-important datasets, exist as huge xml files, in a notorious markup format.
dumpster-dip can flip this dataset into individual json or text files.
the easiest way to get started is to simply run:
npx dumpster-dip
which is a wild, no-install, no-dependency way to get going.
Follow the prompts, and this will download, unzip, and parse any-language wikipedia, into a selected format.
The optional params are:
--lang fr # do the french wikipedia
--output encyclopedia # add all 'E' pages to ./E/
--text # return plaintext instead of json
Also available to be used as a powerful javascript library:
npm install dumpster-dip
import dumpster from 'dumpster-dip' // or require('dumpster-dip')
await dumpster({ file: './enwiki-latest-pages-articles.xml' }) // 😅
This will require you to download and unzip a dump yourself. Instructions below. Depending on the language, it may take a couple hours.
1. Download a dump
cruise the wikipedia dump page and look for ${LANG}wiki-latest-pages-articles.xml.bz2
bzip2 -d ./enwiki-latest-pages-articles.xml.bz2
import dip from 'dumpster-dip'
const opts = {
input: './enwiki-latest-pages-articles.xml',
parse: function (doc) {
return doc.sentences()[0].text() // return the first sentence of each page
}
}
dip(opts).then(() => {
console.log('done!')
})
en-wikipedia takes about 4hrs on a macbook. See expected article counts here
{
file: './enwiki-latest-pages-articles.xml', // path to unzipped dump file relative to cwd
outputDir: './dip', // directory for all our new file(s)
outputMode: 'nested', // how we should write the results
// define how many concurrent workers to run
workers: cpuCount, // default is cpu count
//interval to log status
heartbeat: 5000, //every 5 seconds
// which wikipedia namespaces to handle (null will do all)
namespace: 0, //(default article namespace)
// parse redirects, too
redirects: false,
// parse disambiguation pages, too
disambiguation: true,
// allow a custom wtf_wikipedia parsing library
libPath: 'wtf_wikipedia',
// should we skip this page or return something?
doPage: function (doc) {
return true
},
// what do return, for every page
//- avoid using an arrow-function
parse: function (doc) {
return doc.json()
}
}
dumpster-dip comes with 4 output formats:
'E..' pages in ./e'Ed..' pages in ./edSometimes operating systems don't like having ~6m files in one folder - so these options allow different nesting structures:
to put files in folders indexed by their first letter, do:
let opts = {
outputDir: './results',
outputMode: 'encyclopedia'
}
Remember, some directories become way larger than others. Also remember that titles are UTF-8.
For two-letter folders, use outputMode: 'encyclopedia-two'
This format nests each file 2-deep, using the first 4 characters of the filename's hash:
/BE
/EF
/Dennis_Rodman.txt
/Hilary_Clinton.txt
Although these directory names are meaningless, the advantage of this format is that files will be distributed evenly, instead of piling-up in the 'E' directory.
This is the same scheme that wikipedia does internally.
as a helper, this library exposes a function for navigating this directory scheme:
import getPath from 'dumpster-dip/nested-path'
let file = getPath('Dennis Rodman')
// ./BE/EF/Dennis_Rodman.txt
if you want all files in one flat directory, you can cross your fingers and do:
let opts = {
outputDir: './results',
outputMode: 'flat'
}
You may want all results in one newline-delimited file. Using this format, you can produce TSV or CSV files:
let opts = {
outputDir: './results',
outputMode: 'ndjson',
parse: function (doc) {
return [doc.title(), doc.text().length].join('\t')
}
}
Wikipedia is often a complicated place. Getting specific data may require some investigation, and experimentation:
See runnable examples in ./examples
Process only the 13,000 pages with the category American men's basketball players
await dip({
input: `./enwiki-latest-pages-articles.xml`,
doPage: function (doc) {
return doc.categories().find((cat) => cat === `American men's basketball players`)
},
parse: function (doc) {
return doc.infobox().get('birth_date')
}
})
Look for pages with the Film infobox and grab some properties:
await dip({
input: `./enwiki-latest-pages-articles.xml`,
outputMode: 'encyclopedia',
doPage: function (doc) {
// look for anything with a 'Film' 'infobox
return doc.infobox() && doc.infobox().type() === 'film'
},
parse: function (doc) {
let inf = doc.infobox()
// pluck some values from its infobox
return {
title: doc.title(),
budget: inf.get('budget'),
gross: inf.get('gross')
}
}
})
Talk pages are not found in the normal 'latest-pages-articles.xml' dump. Instead, you must download the larger 'latest-pages-meta-current.xml' dump. To process only Talk pages, set 'namespace' to 1.
const opts = {
input: `./enwiki-latest-pages-meta-current.xml`,
namespace: 1, // do talk pages only
parse: function (doc) {
return doc.text() //return their text
}
}
Given the parse callback, you're free to return anything you'd like.
One of the charms of wtf_wikipedia is its plugin system, which allows users to add any new features.
Here we apply a custom plugin to our wtf lib, and pass it in to be available each worker:
in ./myLib.js
import wtf from 'wtf_wikipedia'
// add custom analysis as a plugin
wtf.plugin((models, templates) => {
// add a new method
models.Doc.prototype.firstSentence = function () {
return this.sentences()[0].text()
}
// support a missing plugin
templates.pingponggame = function (tmpl, list) {
let arr = tmpl.split('|')
return arr[1] + ' to ' + arr[2]
}
})
export default wtf
then we can pass this version into dumpster-dip:
import dip from 'dumpster-dip'
dip({
input: '/path/to/dump.xml',
libPath: './myLib.js', // our version (relative to cwd)
parse: function (doc) {
return doc.firstSentence() // use custom method
}
})
See the plugins available, such as the NHL season parser, the nsfw tagger, or a parser for disambiguation pages.
We are commited to making this library into a great tool for parsing mediawiki projects.
Prs welcomed and respected.
MIT
FAQs
parse a wikipedia dump into tiny files
We found that dumpster-dip demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.

Security News
Open source is under attack because of how much value it creates. It has been the foundation of every major software innovation for the last three decades. This is not the time to walk away from it.

Security News
Socket CEO Feross Aboukhadijeh breaks down how North Korea hijacked Axios and what it means for the future of software supply chain security.