Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
A simple tool to extract text from Open XML format (.docx/.xlsx/.pptx). The tool is meant to be used in combination with git to facilitate the version control of Open XML files.
oxmltotxt is a Python package that helps extracting text (including vba!) from Open XML documents (see here for more details on Microsofts Open XML format). The idea is, to provide for a flexible, fast & efficient way to extract information from e.g. MS Excel files to use them together with git to enable version control of Open XML documents.
Just install this package (ideally in an isolated virtual environment) from pypi:
pip install oxmltotxt
This will be enough, to use the text extraction feature of this package (if you want to try it out without git, just head over to this section). But to use it together with git, you will have to add some config changes to your local (or global, depending on your use case) git config settings. Don't worry, it should only take a few minutes to set this up.
To use this package together with git, you will have to use git's textconv tool. If you have experience in setting this up, you can skip this section.
This setup is OS and / or environment (shell) specific. The following section describes the setup necessary to enable the usage together with git bash
on windows. If you want to use e.g. powershell you will have to adjust these steps.
Copy the bash shell script oxml2txt
contained in this repo (link) to a location on your git bash search path.
To make sure, that this changes will take effect any time you open a new bash shell, you can e.g. extend your ~/.bashrc by adding the following lines to it's end:
export PATH=$PATH:/path/to/oxm2txt_script
export GIT_OXMLTOTXT=/path/to/oxmltotxt/venv/activators
The second export is only needed, if you have installed oxmltotxt
into a virtual environment (which I recommend). This variable is used to activate the virtual environment when the script oxml2txt
is called. It therefore needs to point to the path, where the activation scripts of your virtual env are located. On windows this is usually /path/to/your/venv/Scripts
. Here you will find 3 activation scripts for: bash, dos batch and Powershell.
If you have installed the package in your global python environment, you can ignore this variable. To test your setup execute:
which oxm2txt_script
This should return the path to the location where the oxm2txt_script is installed.
Create a git repo containing your Open XML document (e.g. an MS Excel file with an extension like *.xlsx))
Edit your git config (located in the repo created in the previous step at .git/config), so that it will contain:
[diff "msoffice"]
textconv=oxml2txt <t1> <t2>
This will tell git to use oxml2txt
(a tiny bash wrapper script used in the bash context to call) as "the text converter" for all documents of type msoffice. Optional tags (<t1>, <t2>, ...) can be provided as needed (see here for use cases for <tags>).
Create / edit your .gitattributes within the repo, so that it will contain e.g.:
*.xlsx diff=msoffice
*.xlsm diff=msoffice
This is yet another mapping telling git to use the text converter assigned to msoffice documents to all files ending on .xlsx or .xlsm. Add other extensions as needed. Take a look here to find out how to apply these changes system-wide.
Once you have installed the package and possibly activated the virtual environment, you can start using the package.
A first, simple use case is, to extract content from any Open XML format by issuing:
python -m oxmltotxt.oxmltotxt /path/to/your/OXMLFile
When executing this command, you will get the xml content of e.g. your XL file as nicely formatted text output. This should look similar to:
<!---- Start of file [Content_Types].xml ----!>
<?xml version="1.0" encoding="utf-8"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default ContentType="application/vnd.openxmlformats-package.relationships+xml" Extension="rels"/>
<Default ContentType="application/xml" Extension="xml"/>
...
Notice the xml comment <!---- Start of file [Content_Types].xml ----!> which indicates the relevant file with the zip file structure.
You can add tags to this command:
python -m oxmltotxt.oxmltotxt <t1> <t2> ... /path/to/your/OXMLFile
The tags <t1>, <t2>, ... will be used internally by beautifulsoup4 to filter all tags relevant to your use case contained in each of the xml files within the zip file.
Formulas within XL files are tagged with f
like in <f>some formula</f>
, so if you try
python -m oxmltotxt.oxmltotxt f /path/to/your/OXMLFile
on an XL workbook containing only one formula, you will get something along the lines of:
<!---- Start of file xl/worksheets/sheet1.xml ----!>
<f>
A4+5
</f>
<!---- End of file xl/worksheets/sheet1.xml ----!>
Now you are ready to use git diff
on Open XML documents. A simple example would be, to store two empty XL files say workbook1.xlsx and workbook2.xlsx. Just make sure the sheets are not exact copies (e.g. by just renaming the second workbook and save it again). Then do the following:
git diff --no-index workbook1.xlsx workbook2.xlsx
which will give an answer similar to this:
diff --git a/workbook1.xlsx b/workbook2.xlsx
index c54f836..017e614 100644
--- a/workbook1.xlsx
+++ b/workbook2.xlsx
@@ -34,7 +34,7 @@ Falling back to utf-8 decoding ...
</mc:Choice>
</mc:AlternateContent>
- <xr:revisionPtr documentId="13_ncr:1_{f876a870-98ec-4e82-91d5-47803a4d11c8}" revIDLastSave="0" xr10:uidLastSave="{00000000-0000-0000-0000-000000000000}" xr6:coauthVersionLast="47" xr6:coauthVersionMax="47"/>
+ <xr:revisionPtr documentId="13_ncr:1_{d0df556a-f8ad-4a22-b63e-7039adbe7d3d}" revIDLastSave="0" xr10:uidLastSave="{00000000-0000-0000-0000-000000000000}" xr6:coauthVersionLast="47" xr6:coauthVersionMax="47"/>
<bookViews>
<workbookView activeTab="1" windowHeight="11040" windowWidth="20730" xWindow="-120" xr2:uid="{2d46327d-59ce-41b3-b533-5578442c95a1}" yWindow="-120"/>
</bookViews>
@@ -568,7 +568,7 @@ Falling back to utf-8 decoding ...
2021-11-04T11:20:51Z
</dcterms:created>
<dcterms:modified xsi:type="dcterms:W3CDTF">
- 2022-07-30T05:34:51Z
+ 2022-07-30T05:34:55Z
</dcterms:modified>
</cp:coreProperties>
<!---- End of file docProps/core.xml ----!>
(END)
As you can see, although workbook2.xlsx is just a resaved copy of workbook1.xlsx without any further changes both files differ and oxmltotxt
together with git will tell you where exactly they differ:
documentId
within the tag <xr:revisionPtr ...>
will be updated with a new UUID.<dcterms:modified ...>
will be updated by the datetime of the latest saving time.Depending on your use case, the last save date or the color of a cell might be of limited interest to you. Maybe it even breaks your version control workflow, because you don't consider this update as a new version of the file.
In it's current version oxmltotxt allows you to filter by a set of relevant tags to look out for changes (see here).
To use the tag feature together with git, one needs to update a line in the .git/config from
[diff "msoffice"]
textconv=oxml2txt
to e.g.
[diff "msoffice"]
textconv=oxml2txt f
where the f
will be interpreted as find all f tags (anything between <f>..</f>), which is how XL stores formulas.
Taking our workbook1 / workbook2 example from above, a simple formula update with enabled <f> tag filter would look like this with git diff
:
diff --git a/workbook1.xlsx b/workbook2.xlsx
index 5e22d8e..5a4ea5d 100644
--- a/workbook1.xlsx
+++ b/workbook2.xlsx
@@ -13,7 +13,7 @@ Falling back to utf-8 decoding ...
<!---- End of file xl/_rels/workbook.xml.rels ----!>^M
<!---- Start of file xl/worksheets/sheet1.xml ----!>^M
<f>
- A1+1
+ A1+2^M
</f>
^M
<!---- End of file xl/worksheets/sheet1.xml ----!>^M
As you can see, the <tag>-feature might be able to help you focus on changes that are relevant to your use case.
XL workbooks containing vba code will end on .xlsm instead of .xlsx.
Using our example workbooks from the previous section one has to create / rename two *.xlsm files. Here I only changed the output of a simple vba debug.print
statement.
diff --git a/workbook1.xlsm b/workbook2.xlsm
index 5676e81..8bf0f12 100644
--- a/workbook1.xlsm
+++ b/workbook2.xlsm
@@ -27,10 +27,9 @@ Falling back to utf-8 decoding ...
Attribute VB_Name = "Modul1"^M
^M
Public Function hello_world()^M
- Debug.Print "Hello world"^M
+ Debug.Print "Hello world2"^M
End Function^M
^M
- ^M
<!--- END xl/vbaProject.bin Modul1 Modul1 --->^M
^M
<!---- End of file VBACode ----!>^M
The next example can be understood as a simple application of how one could use oxmltotxt
to understand how information is stored within on Open XML document (make sure, that no tag-filters are active).
If, for example, you only change the color of cell A1
to yellow in workbook2.xlsx and save again, you'll get something along the lines of (excerpt):
diff --git a/workbook1.xlsx b/workbook2.xlsx
index b5a635d..50c7bdd 100644
--- a/workbook1.xlsx
+++ b/workbook2.xlsx
...
<sheetFormatPr baseColWidth="10" defaultRowHeight="15" x14ac:dyDescent="0.25"/>
- <sheetData/>
+ <sheetData>^M
+ <row r="1" spans="1:1" x14ac:dyDescent="0.25">^M
+ <c r="A1" s="1"/>^M
+ </row>^M
+ </sheetData>^M
<pageMargins bottom="0.78740157499999996" footer="0.3" header="0.3" left="0.7" right="0.7" top="0.78740157499999996"/>
</worksheet>^M
<!---- End of file xl/worksheets/sheet1.xml ----!>^M
@@ -383,13 +387,19 @@ Falling back to utf-8 decoding ...
<scheme val="minor"/>
<scheme val="minor"/>
</font>
</fonts>
- <fills count="2">
+ <fills count="3">^M
<fill>
<patternFill patternType="none"/>
</fill>
<fill>
<patternFill patternType="gray125"/>
</fill>
+ <fill>^M
+ <patternFill patternType="solid">^M
+ <fgColor rgb="FFFFFF00"/>^M
+ <bgColor indexed="64"/>^M
+ </patternFill>^M
+ </fill>^M
</fills>
<borders count="1">
<border>
@@ -403,8 +413,9 @@ Falling back to utf-8 decoding ...
<cellStyleXfs count="1">
<xf borderId="0" fillId="0" fontId="0" numFmtId="0"/>
</cellStyleXfs>
- <cellXfs count="1">
+ <cellXfs count="2">^M
<xf borderId="0" fillId="0" fontId="0" numFmtId="0" xfId="0"/>
+ <xf applyFill="1" borderId="0" fillId="2" fontId="0" numFmtId="0" xfId="0"/>^M
</cellXfs>
...
The source code is currently hosted on GitHub at: https://github.com/mdriesch/oxmltotxt
Binary installers for the latest released version are available at the Python Package Index (PyPI).
There are various tools around trying to do something similar. I would like to give credits specifically to odt2txt which kind of inspired my approach for this package.
FAQs
A simple tool to extract text from Open XML format (.docx/.xlsx/.pptx). The tool is meant to be used in combination with git to facilitate the version control of Open XML files.
We found that oxmltotxt demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.