DevToys Web Pro iconDevToys Web ProBlogu
Tupatie ukadiriaji:
Jaribu kiendelezi cha kivinjari:
← Back to Blog

How to Strip HTML Tags and Get Plain Text

10 min read

Stripping HTML tags to recover clean, readable plain text is one of the most common string processing tasks in web development. You might need it for search indexing, generating text previews, feeding content into language models, or exporting data to plain-text formats. The task sounds trivial — just remove everything between < and > — but every naive approach breaks on real-world HTML in a different way.

This guide walks through the full problem: why regex alone is unreliable, how the browser DOM gives you a safer path, how to decode HTML entities, how to preserve whitespace from block elements, and how to safely discard the content of <script> and <style> blocks rather than just their tags. For quick interactive use, the Strip HTML Tags tool handles all of this without writing a single line of code.

Why Naive Regex Fails

The simplest regex approach looks like this:

const plain = html.replace(/<[^>]*>/g, "");

For tightly controlled, well-formed HTML this works fine. But HTML in the wild breaks this pattern in several ways:

  • Attributes containing >: The pattern [^>]* stops at the first >, so <img alt="a > b" src="x.png"> gets partially stripped, leaving b" src="x.png"> in the output.
  • HTML comments: <!-- comment --> contains no tags but the text "comment" still leaks through. Comments can also contain markup that confuses the pattern.
  • CDATA sections: In SVG and MathML embedded in HTML5, CDATA blocks look like <![CDATA[ ... ]]>. A tag-stripping regex leaves the raw content intact including any embedded angle brackets.
  • Unclosed or malformed tags: Real-world scraped content often contains unclosed tags or tags with unexpected whitespace (< br />). The regex may not match them, leaving stray tag fragments.
  • Script and style content: Removing <script> and <style> tags without removing their content leaves JavaScript code and CSS rules in the plain-text output.

The conclusion is not that regex is useless — it is useful for targeted substitutions — but that a single-pass tag-stripping regex is not a reliable HTML parser.

The DOMParser Approach (Browser)

In a browser context, the safest way to extract plain text is to let the browser parse the HTML and then read the textContent property of the resulting DOM tree:

function htmlToText(html) {
  const doc = new DOMParser().parseFromString(html, "text/html");

  // Remove script and style elements entirely (content included)
  doc.querySelectorAll("script, style, noscript").forEach((el) => el.remove());

  return doc.body.textContent ?? "";
}

DOMParser uses the browser's own HTML5 parser, so it correctly handles malformed markup, attributes with angle brackets, and CDATA. The textContent property returns the concatenated text of all text nodes, with all tags already removed. Removing script, style, and noscript elements before reading textContent prevents their source code from appearing in the output.

Server-Side: Node.js Without a DOM

On the server (Node.js, Edge Runtime, Deno) there is no DOMParser. You have two practical options:

  • Use a lightweight HTML parser such as node-html-parser or htmlparser2, which provide a DOM-like API without a full browser environment.
  • Use a targeted regex pipeline that strips script/style blocks first, then removes tags, then decodes entities.

The regex pipeline approach is shown below. It is reliable enough for clean or semi-clean HTML when a full parser is not available:

function htmlToTextNode(html) {
  return (
    html
      // 1. Remove script blocks (tags + content)
      .replace(/<script[\s\S]*?<\/script>/gi, "")
      // 2. Remove style blocks (tags + content)
      .replace(/<style[\s\S]*?<\/style>/gi, "")
      // 3. Replace block-level line-break elements with newlines
      .replace(/<br\s*\/?>/gi, "\n")
      .replace(/<\/(p|div|h[1-6]|li|tr|blockquote|pre)>/gi, "\n")
      // 4. Strip all remaining tags
      .replace(/<[^>]*>/g, "")
      // 5. Decode common HTML entities
      .replace(/&amp;/g, "&")
      .replace(/&lt;/g, "<")
      .replace(/&gt;/g, ">")
      .replace(/&quot;/g, '"')
      .replace(/&#39;/g, "'")
      .replace(/&nbsp;/g, " ")
      // 6. Collapse excess whitespace
      .replace(/[ \t]+/g, " ")
      .replace(/\n{3,}/g, "\n\n")
      .trim()
  );
}

The order matters: script and style blocks must be removed before the generic tag-stripping pass in step 4, otherwise step 4 strips the opening and closing tags but leaves the raw code in the output.

Decoding HTML Entities

Even after all tags are gone, HTML entities remain as literal character sequences. A headline like AT&amp;T &mdash; Q1 Results becomes AT&T — Q1 Results only after decoding. The regex pipeline above handles the five most common named entities. For a complete solution — including numeric entities like © and the full named-entity table — use the DOMParser trick on a text fragment:

function decodeHtmlEntities(str) {
  const doc = new DOMParser().parseFromString(str, "text/html");
  return doc.documentElement.textContent;
}

// Examples:
decodeHtmlEntities("AT&amp;T");          // "AT&T"
decodeHtmlEntities("&lt;em&gt;bold&lt;/em&gt;");  // "<em>bold</em>"
decodeHtmlEntities("&#169; 2026");       // "© 2026"
decodeHtmlEntities("&mdash;");           // "—"

For the complete picture of how entities work and when they need decoding, see the HTML Entities guide. If you are converting a full Markdown document to HTML first and then to text, the Markdown ↔ HTML converter gives you the intermediate HTML step.

Preserving Line Breaks from Block Elements

A common complaint with naive tag stripping is that all whitespace collapses: paragraphs run together into a wall of text with no separation. The reason is that block-level HTML elements communicate visual structure through CSS layout, not through whitespace in the source. When you discard the tags you also discard that structure — unless you inject newlines first.

The elements that typically warrant a newline in plain text output are:

ElementRoleReplacement
<br>Explicit line break\n
</p>End of paragraph\n
</div>End of block container\n
</h1></h6>End of heading\n
</li>End of list item\n
</tr>End of table row\n

Note that you replace the closing tag (or the self-closing <br>), not the opening tag, so the newline appears after the content rather than before it.

If you need to analyse the word count or reading time of the resulting plain text, pass it through the Text Analyzer tool.

Removing Script and Style Content Safely

This is the step most implementations get wrong. Consider the following HTML:

<p>Introduction</p>
<style>
  body { background: red; }
  p::after { content: "injected text"; }
</style>
<script>
  const secret = "api-key-12345";
  document.write("<p>Hello</p>");
</script>
<p>Conclusion</p>

A naive tag stripper that only removes <...> patterns produces:

// Output of naive strip:
"Introduction\n  body { background: red; }\n  p::after { content: \"injected text\"; }\n\n  const secret = \"api-key-12345\";\n  document.write(\"Hello\");\n\nConclusion"

The CSS rules and JavaScript source appear verbatim in the plain text. To strip the content as well as the tags, the regex must match from the opening tag to the closing tag across multiple lines. The [\\s\\S]*? pattern (or (.|\\n)*?) matches any character including newlines in a non-greedy way:

// Correct: removes tags AND their inner content
html.replace(/<script[\s\S]*?<\/script>/gi, "")
    .replace(/<style[\s\S]*?<\/style>/gi, "")

Apply these two substitutions as the very first step before any other processing.

Security Note: Stripping Is Not Sanitizing

Stripping HTML tags to produce plain text is a data extraction operation, not a security sanitization operation. These are different goals with different requirements:

  • Plain-text extraction removes markup to get readable text. The output is treated as inert data — it will be displayed in a context where HTML is not interpreted (a <textarea>, a plain-text email, a database field, a search index).
  • HTML sanitization removes dangerous markup while preserving safe markup. The output will be re-inserted into an HTML context (e.g. via innerHTML) and must not contain executable content.

If you strip tags and then insert the result back into the DOM via innerHTML or dangerouslySetInnerHTML, you have not sanitized anything — you have just moved the problem. A crafted input like:

<img src=x onerror="alert(1)">

becomes the text <img src=x onerror="alert(1)"> after entity-stripping via textContent, which is safe as plain text. But if you write that string back into the DOM as HTML it becomes an executable tag again.

For sanitization (allowing some HTML through safely), use a dedicated library such as DOMPurify rather than a home-built tag stripper. For plain-text extraction specifically, the Strip HTML Tags tool and the techniques above are the right choice. You can also find targeted extraction utilities in the Text Extractors collection.

Approach Comparison

ApproachEnvironmentHandles malformed HTMLDecodes entitiesStrips script/style content
Naive replace(/<[^>]*>/g)AnyNoNoNo
Regex pipeline (multi-step)AnyPartialCommon entities onlyYes (with correct regex)
DOMParser + textContentBrowser onlyYes (browser parser)Yes (all entities)Yes (with element removal)
HTML parser libraryNode.js / anyYesYesYes

Putting It All Together

Here is a complete, production-ready function that combines all the techniques above for browser environments:

/**
 * Extract clean plain text from an HTML string.
 * - Removes <script>, <style>, and <noscript> content entirely
 * - Inserts newlines at block-element boundaries
 * - Decodes all HTML entities (browser DOMParser handles the full table)
 * - Collapses excess whitespace
 */
function extractPlainText(html) {
  const doc = new DOMParser().parseFromString(html, "text/html");

  // Step 1: Remove content-bearing non-visible elements
  doc.querySelectorAll("script, style, noscript").forEach((el) => el.remove());

  // Step 2: Insert newlines at block boundaries before reading text
  const blockTags = ["P", "DIV", "BR", "H1", "H2", "H3", "H4", "H5", "H6",
                     "LI", "TR", "BLOCKQUOTE", "PRE", "HR"];
  doc.querySelectorAll(blockTags.join(",")).forEach((el) => {
    el.after(doc.createTextNode("\n"));
  });

  // Step 3: Read text (entities already decoded by the DOM)
  const raw = doc.body.textContent ?? "";

  // Step 4: Normalize whitespace
  return raw
    .replace(/[ \t]+/g, " ")         // collapse horizontal whitespace
    .replace(/\n[ \t]+/g, "\n")      // trim leading spaces on each line
    .replace(/[ \t]+\n/g, "\n")      // trim trailing spaces on each line
    .replace(/\n{3,}/g, "\n\n")      // max two consecutive blank lines
    .trim();
}

Whether you are building a search indexer, a text preview generator, or cleaning up scraped content, understanding the difference between naive tag removal and proper HTML-to-text extraction saves hours of debugging edge cases. For instant results without writing any code, use the Strip HTML Tags tool — it handles entity decoding, script and style removal, and whitespace normalization in one click.