
Product
Secure Your AI-Generated Code with Socket MCP
Socket MCP brings real-time security checks to AI-generated code, helping developers catch risky dependencies before they enter the codebase.
Compares two XML documents by diffing their text, ignoring structure, and wraps changed text in <del>/<ins> tags.
Compares the text inside two XML documents and marks up the differences
with <del>
and <ins>
tags.
This is the result of about 7 years of trying to get this right and coded simply. I've used code like this in one form or another to compare bill text on GovTrack.us <https://www.govtrack.us>.
The comparison is completely blind to the structure of the two XML
documents. It does a word-by-word comparison on the text content only,
and then it goes back into the original documents and wraps changed text
in new <del>
and <ins>
wrapper elements.
The documents are then concatenated to form a new document and the new
document is printed on standard output. Or use this as a library and
call compare
yourself with two lxml.etree.Element
nodes (the roots
of your documents).
The script is written in Python 3.
Comparing these two documents:
<html>
Here is <b>some bold</b> text.
</html>
and:
<html>
Here is <i>some italic</i> content that shows how <tt>xml_diff</tt> works.
</html>
Yields:
<documents>
<html>
Here is <b>some <del>bold</del></b><del> text</del>.
</html>
<html>
Here is <i>some <ins>italic</ins></i><ins> content that shows how </ins><tt><ins>xml_diff</ins></tt><ins> works</ins>.
</html>
</documents>
On Ubuntu, get dependencies with:
apt-get install python3-lxml libxml2-dev libxslt1-dev
For really fast comparisons, get Google's Diff Match Patch library <https://code.google.com/p/google-diff-match-patch/>, as re-written and sped-up by @leutloff <https://github.com/leutloff/diff-match-patch-cpp-stl> and then turned into a Python extension module by me <https://github.com/JoshData/diff_match_patch-python>:
pip3 install diff_match_patch_python
Or if you can't install that for any reason, use the pure-Python library:
pip3 install diff-match-patch
This is also at <https://code.google.com/p/google-diff-match-patch/source/browse/trunk/python3/diff_match_patch.py>. xml_diff will use whichever is installed.
Finally, install this module:
pip3 install xml_diff
Then call the module from the command line:
python3 -m xml_diff --tags del,ins doc1.xml doc2.xml > changes.xml
Or use the module from Python:
import lxml.etree
from xml_diff import compare
dom1 = lxml.etree.parse("doc1.xml").getroot()
dom2 = lxml.etree.parse("doc2.xml").getroot()
comparison = compare(dom1, dom2)
The two DOMs are modified in-place.
The compare
function takes other optional keyword arguments:
merge
is a boolean (default false) that indicates whether the
comparison function should perform a merge. If true, dom1
will contain
not just <del>
nodes but also <ins>
nodes and, similarly, dom2
will contain not just <ins>
nodes but also <del>
nodes. Although the
two DOMs will now contain the same semantic information about changes,
and the same text content, each preserves their original structure ---
since the comparison is only over text and not structure. The new
ins
/del
nodes contain content from the other document (including
whole subtrees), and so there's no guarantee that the final documents
will conform to any particular structural schema after this operation.
word_separator_regex
(default r"\s+|[^\s\w]"
) is a regular
expression for how to separate words. The default splits on one or more
spaces in a row and single instances of non-word characters.
differ
is a function that takes two arguments (text1, text2)
and
returns an iterator over difference operations given as tuples of the
form (operation, text_length)
, where operation
is one of "="
(no
change in text), "+"
(text inserted into text2
), or "-"
(text
deleted from text1
). (See xml_diff/__init__.py's default_differ
function for how the default differ works.)
tags
is a two-tuple of tag names to use for deleted and inserted
content. The default is ('del', 'ins')
.
make_tag_func
is a function that takes one argument, which is either
"ins"
or "del"
, and returns a new lxml.etree.Element
to be
inserted into the DOM to wrap changed content. If given, the tags
argument is ignored.
This project is free of any copyright restrictions per the [Unlicense](https://unlicense.org/). (Prior to May 3, 2025, the project was made available under the terms of the [CC0 1.0 Universal public domain dedication](http://creativecommons.org/publicdomain/zero/1.0/).) See [LICENSE](LICENSE) and [CONTRIBUTING.md](CONTRIBUTING.md).
To release:
setup.py
.source venv/bin/activate
pip3 install --upgrade build twine
rm -rf dist
python3 -m build
twine upload -u __token__ dist/*
FAQs
Compares two XML documents by diffing their text, ignoring structure, and wraps changed text in <del>/<ins> tags.
We found that xml-diff demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Socket MCP brings real-time security checks to AI-generated code, helping developers catch risky dependencies before they enter the codebase.
Security News
As vulnerability data bottlenecks grow, the federal government is formally investigating NIST’s handling of the National Vulnerability Database.
Research
Security News
Socket’s Threat Research Team has uncovered 60 npm packages using post-install scripts to silently exfiltrate hostnames, IP addresses, DNS servers, and user directories to a Discord-controlled endpoint.