Note
This repository is automatically generated from the main parser monorepo. Please submit any issues or pull requests there.
docx-to-vfile
Reads a .docx
file and stores its components in vfile format to be processed by other tools, like reoff-parse
.
Currently extremely dumb and just stores it all in memory, no streams for you.
File reading does happen in streams.
Based on docxtract
Contents
What is this?
This package reads a .docx
file and stores its components in vfile format to be processed by other tools, like reoff-parse
. This is the first step in a pipeline to convert a .docx
file to many other formats using the unified
ecosystem.
A .docx
document is just a zip file with a bunch of XML and other files (such as images) in it. This package unzips the .docx
file, reads the XML files and images and stores them in a VFile
object, which is a virtual file format that can be used by other tools in the unified
ecosystem.
When should I use this?
Probably only exclusively to read a docx
file to feed into reoff-parse
or something similar, or if you want to access the raw data of a docx
file for some reason.
Install
This package is ESM only. In Node.js (version 12.20+, 14.14+, 16.0+, 18.0+), install as
pnpm add docx-to-vfile
Use
In Node
import { docxToVFile } from 'docx-to-vfile'
Pass a path to a .docx
file
const file = await docxToVFile('path/to/file.docx')
Pass a Blob
const blob = await fetch('https://path/to/file.docx').then((res) => res.blob())
const file = await docxToVFile(blob)
Pass a Buffer
import { readFile } from 'fs/promises'
const buffer = await readFile('path/to/file.docx')
const file = await docxToVFile(buffer)
Pass a ReadStream
import { createReadStream } from 'fs'
const file = await docxToVFile(createReadStream('path/to/file.docx'))
In the browser
import { docxToVFile } from 'docx-to-vfile/browser'
Pass a File
<input type="file" />
document.querySelector('input[type="file"]')?.addEventListener('change', async (e) => {
const file = await docxToVFile(e.target.files[0])
})
Output
Using the default settings, the main value of the VFile will be the content of the main document, and the data will contain the content of the other files in the .docx archive. Media files will be stored in the media property.
const output = {
data: {
'word/footnotes.xml': '<?xml version ...',
'_rels/rels': '<?xml version ...',
relations: {
rId9: 'footnotes.xml',
rId8: 'endnotes.xml',
},
media: {
media/image1.png:
},
},
value:
messages: [],
history: [],
cwd: './',
}
String(output) === output.value
API
docxToVFile()
Takes a docx file as a Blob or File and returns a VFile with the contents of the document.xml file as the root, and the contents of the other xml files as data.
Signature
docxToVFile(file: string | Blob | ArrayBuffer | File | Buffer, userOptions?: Options): Promise<VFile>;
Parameters
Name | Type | Description |
---|
file | string | Blob |
userOptions? | Options | - |
Returns
Promise
<VFile
>
A VFile with the contents of the document.xml file as the root, and the contents of the other xml files as data.
Defined in: lib/docx-to-vfile-unzipit.ts:91
DocxVFileData
The data attribute of the VFile will contain the following:
Indexable
[key
: XMLOrRelsString
]: string
| undefined
Properties
media
object
The media files in the .docx file
Possibly undefined only to be compatible with the VFile interface
Since
0.5.0 - Added media, removed images
Index signature
Type declaration
Defined in: lib/docx-to-vfile-unzipit.ts:53
parsed?
object
The parsed .xml files in the .docx file
Usually added by reoff-parse
Index signature
[key
: XMLOrRelsString
]: Root
| undefined
Type declaration
Defined in: lib/docx-to-vfile-unzipit.ts:72
relations?
object
{
document: {
};
endnotes?: {
};
footnotes?: {
};
}
The relations between the .xml files in the .docx file
Possibly undefined only to be compatible with the VFile interface
Since
0.7.0 - Added relations.footnotes and relations.endnotes. relations.document
is now an alias for relations
. This now gets added by reoff-parse
.
Type declaration
Member | Type |
---|
document | { } |
endnotes ? | { } |
footnotes ? | { } |
Defined in: lib/docx-to-vfile-unzipit.ts:61
Options
Hierarchy
Properties
include?
string
[] | RegExp
[] | (key
: string
) => boolean
| "all"
| "allWithDocumentXML"
Include only the specified files on the data
attribute of the VFile.
This may be useful if you want to only do something with a subset of the files in the docx file, and don't intend to use 'reoff-stringify' to turn the VFile back into a docx file.
- If an array of strings or regexps is passed, only files that match one of the values will be included.
- If a function is passed, it will be called for each file and should return true to include the file.
- If the value is 'all', almost all files will be included, except for 'word/document.xml', as that already is the root of the VFile.
- If the value is 'allWithDocumentXML', all files will be included, including
word/document.xml
, even though that is already the root of the VFile. Useful if you really want to mimic the original docx file.
You should keep it at the default value if you intend to use 'reoff-stringify' to turn the VFile back into a docx file.
Default
'all'
Defined in: lib/docx-to-vfile-unzipit.ts:29
withoutMedia?
boolean
Whether or not to include media in the VFile.
By default, images are included on the data.media
attribute of the VFile as an object of Blobs, which are accessible both client and serverside.
Default
false
Defined in: lib/docx-to-vfile-unzipit.ts:15
OptionsWithFetchConfig
Hierarchy
Properties
fetchConfig?
RequestInit
The config to pass to fetch, for e.g. authorization headers.
Defined in: lib/docx-to-vfile-unzipit.ts:36
include?
string
[] | RegExp
[] | (key
: string
) => boolean
| "all"
| "allWithDocumentXML"
Include only the specified files on the data
attribute of the VFile.
This may be useful if you want to only do something with a subset of the files in the docx file, and don't intend to use 'reoff-stringify' to turn the VFile back into a docx file.
- If an array of strings or regexps is passed, only files that match one of the values will be included.
- If a function is passed, it will be called for each file and should return true to include the file.
- If the value is 'all', almost all files will be included, except for 'word/document.xml', as that already is the root of the VFile.
- If the value is 'allWithDocumentXML', all files will be included, including
word/document.xml
, even though that is already the root of the VFile. Useful if you really want to mimic the original docx file.
You should keep it at the default value if you intend to use 'reoff-stringify' to turn the VFile back into a docx file.
Default
'all'
Inherited from: Options.include
Defined in: lib/docx-to-vfile-unzipit.ts:29
withoutMedia?
boolean
Whether or not to include media in the VFile.
By default, images are included on the data.media
attribute of the VFile as an object of Blobs, which are accessible both client and serverside.
Default
false
Inherited from: Options.withoutMedia
Defined in: lib/docx-to-vfile-unzipit.ts:15
XMLOrRelsString
${string}.xml
| ${string}.rels
Defined in: lib/docx-to-vfile-unzipit.ts:82
Compatibility
Security
docx-to-vfile
currently does not read macros, so it is not vulnerable to potential security issues with macros.
It does not however do any other security checks, so it is possible that maliciously crafted docx files could cause problems when e.g. parsed with rehype
.
Related
reoff-parse
— Parse the output of docx-to-vfile
into a VFile
with an ooxast
tree.
Contribute
License
GPL-3.0-or-later © Thomas F. K. Jorna