DevToys Web Pro iconDevToys Web ProBlog
Preložené pomocou LocalePack logoLocalePack
Ohodnoťte nás:
Vyskúšajte rozšírenie prehliadača:
← Back to Blog

Word Document Comparison Guide: Diff .docx Files Without Losing Your Mind

8 min read

Comparing two Word documents sounds straightforward until you try it. Open two versions of a contract, paste them into a text diff tool, and you get hundreds of lines of noise — XML tags, style IDs, relationship hashes — with the actual text changes buried somewhere in the middle. The Word Document Comparison tool on DevToys handles this by extracting plain text before diffing, giving you clean, readable results. This guide explains why Word comparison is inherently tricky and what approaches actually work.

Why Word Diffs Are Hard

The root problem is that .docx is not a text file. It is a ZIP archive containing a collection of XML files, images, fonts, and metadata. When you change a single word in a paragraph, Word may rewrite multiple XML nodes, reorder attributes, update revision counters, regenerate relationship IDs, and modify the document properties — all for what looks like a one-word edit to the reader.

Running diff or any line-oriented text comparison tool directly against a .docx produces binary garbage, because ZIP files are compressed binary data. Even if you unzip first and diff the XML, you are comparing serialization artifacts as much as content. Non-deterministic XML output — where attribute order or whitespace varies between Word versions or platforms — generates false positives on every save.

What Is Inside a .docx

Rename any .docx file to .zip and unzip it. You will find a directory tree like this:

myfile.docx (unzipped)
├── [Content_Types].xml
├── _rels/
   └── .rels
├── word/
   ├── document.xml the actual body text
   ├── styles.xml paragraph and character styles
   ├── settings.xml document settings and revision info
   ├── comments.xml reviewer comments
   ├── footnotes.xml
   ├── endnotes.xml
   ├── theme/
   └── theme1.xml
   └── media/
       └── image1.png embedded images
└── docProps/
    ├── app.xml application metadata
    └── core.xml author, created date, revision count

The body of the document lives in word/document.xml. Each paragraph is a <w:p> element, runs of text are <w:r> elements, and the actual characters are inside <w:t> tags. Formatting — bold, italic, font size — is stored as child elements of each run, not as surrounding markup. A single bolded word generates a separate <w:r> run with a <w:rPr><w:b/></w:rPr> block, splitting what looks like one sentence into multiple XML nodes.

Most comparison tools normalize this by extracting only the text content from <w:t> nodes before diffing. Styles, comments, and media are typically ignored unless you specifically need to compare them.

Three Levels of Comparison

There are three ways to diff two Word documents, each with different tradeoffs:

LevelWhat Is ComparedResult
Byte-levelRaw binary content of the ZIPUseless — almost always different, even on identical text
Structural XMLUnzipped XML nodesNoisy — formatting splits, attribute reordering, and revision counters create false positives
Semantic textExtracted plain text from <w:t> nodesClean — shows only actual content changes

Semantic text comparison is what users want in almost every case. The Word Document Comparison tool extracts text from both documents and runs a line-by-line diff, highlighting additions, deletions, and unchanged sections in readable form.

Use Cases

Document comparison comes up in several distinct workflows, each with slightly different requirements:

  • Legal contracts and redlines: Lawyers review every revision to a contract to ensure no clause was silently changed. Traditional redlining (showing deletions in red strikethrough and additions in underline) is the standard deliverable. Comparing the versions lets counsel confirm what counsel opposite actually changed between drafts.
  • Technical specifications: Engineering teams maintain spec documents that evolve over months. Comparing versions before a release sign-off catches requirement drift — cases where a requirement was quietly removed or reworded to weaken a commitment.
  • Academic collaboration: Advisors and students exchange manuscript drafts. Comparing versions shows exactly which paragraphs were revised, which citations were added, and where the argument structure changed — without having to rely on Track Changes being enabled.
  • Track Changes audits: When a document has accumulated many rounds of tracked changes from multiple reviewers, it can be faster to accept all changes and compare the clean version against the original than to manually review each tracked change.

Word's Built-In Compare

Microsoft Word has a built-in comparison feature under Review > Compare. It opens both documents and generates a third document showing tracked changes. For everyday use this works well, but it has notable limitations:

  • Table comparison: Word's Compare struggles with structural table changes. Adding or removing rows often produces confusing output rather than clean row-level diffs.
  • Attribution: The comparison document attributes all changes to a single reviewer, losing the original per-author attribution from the source documents.
  • Complex formatting: Heavy use of content controls, form fields, and nested tables can confuse the comparison engine.
  • No scripting: Word's Compare is a GUI-only operation. Automating it requires COM automation on Windows or LibreOffice macros on other platforms.

The .doc Problem

The legacy .doc format (used by Word 97–2003) is a binary format — the Compound Document File Format — with no XML inside. There is no practical way to diff .doc files directly. The only reliable path is to convert both files to .docx or plain text first:

# Convert .doc to .docx using LibreOffice (headless)
libreoffice --headless --convert-to docx old.doc new.doc

# Or convert both to plain text with antiword
antiword old.doc > old.txt
antiword new.doc > new.txt
diff old.txt new.txt

LibreOffice conversion is generally reliable for text content but may alter formatting fidelity. For critical legal documents, use Word itself for the conversion to preserve fidelity.

Alternative Workflows

When Word comparison or a purpose-built tool is not available, two common workarounds work well:

Convert to Markdown via pandoc: Pandoc converts .docx to Markdown with good fidelity for most prose documents. Markdown is plain text, so any diff tool works cleanly. This is especially useful when the documents are prose-heavy and formatting details do not matter.

pandoc v1.docx -o v1.md
pandoc v2.docx -o v2.md
diff v1.md v2.md

Convert to PDF and use PDF diff: If visual fidelity matters more than text-level precision, render both documents to PDF (Word's built-in export preserves layout exactly) and use a PDF comparison tool. This is the preferred approach for design-heavy documents like brochures where pixel-level layout changes matter.

See the Text Diff Guide for more on line-oriented and character-level diffing once you have plain text.

Programmatic Comparison

For batch processing or integration into CI pipelines, you can extract text programmatically and diff it with standard libraries.

Python with python-docx and difflib:

from docx import Document
import difflib

def extract_text(path: str) -> list[str]:
    doc = Document(path)
    return [para.text for para in doc.paragraphs if para.text.strip()]

v1_lines = extract_text("v1.docx")
v2_lines = extract_text("v2.docx")

diff = difflib.unified_diff(v1_lines, v2_lines, lineterm="", n=2)
print("\n".join(diff))

JavaScript with mammoth.js and diff-match-patch:

import mammoth from 'mammoth';
import { diff_match_patch } from 'diff-match-patch';

async function extractText(buffer) {
  const result = await mammoth.extractRawText({ buffer });
  return result.value;
}

const [text1, text2] = await Promise.all([
  extractText(v1Buffer),
  extractText(v2Buffer),
]);

const dmp = new diff_match_patch();
const diffs = dmp.diff_main(text1, text2);
dmp.diff_cleanupSemantic(diffs);

// diffs is an array of [operation, text] tuples
// operation: -1 = delete, 0 = equal, 1 = insert
for (const [op, text] of diffs) {
  if (op === 1) console.log("+ " + text);
  if (op === -1) console.log("- " + text);
}

Common Pitfalls

  • Track Changes ghosts: If a document has unaccepted tracked changes, the XML contains both the old and new text simultaneously inside <w:ins> and <w:del> elements. Text extraction libraries differ on which version they surface. Always accept or reject all tracked changes before comparing, unless your tool explicitly handles this case.
  • Non-deterministic XML output: Saving the same document in different versions of Word or on different platforms (Windows vs. macOS) can produce different XML even if the content is identical. Attribute order, namespace declarations, and whitespace inside text nodes can all vary. This is another reason semantic text comparison beats XML-level comparison.
  • Embedded images: Images are stored as binary blobs in the word/media/ folder. Text extraction ignores them entirely. If image changes matter — for example, updated diagrams in a specification — you need to hash and compare the image files separately, or use a visual diff approach.
  • Whitespace normalization: Word sometimes stores a single space as <w:t xml:space="preserve"> </w:t>. Different extraction implementations handle this differently, leading to spurious whitespace differences. A good extraction step trims and normalizes whitespace before diffing.

For a quick browser-based comparison without installing anything, use the Word Document Comparison tool — drop in two .docx files and get a clean semantic diff instantly, with no data leaving your machine.