DevToys Web Pro iconDevToys Web Proالمدونة
مُترجم بواسطة LocalePack logoLocalePack
قيّمنا:
جرّب إضافة المتصفح:
← Back to Blog

Text Extractor Guide: URLs, Emails, IPs, and Regex Patterns

8 min read

Raw text is full of structured data hiding in plain sight: access logs packed with IP addresses, HTML dumps bristling with URLs, CRM exports where every row contains an email address. Extracting that data by hand is error-prone and slow. The Text Extractors tool lets you paste any block of text and pull out URLs, emails, IPs, and more in one click — but understanding the underlying patterns makes you better at both using the tool and building your own extraction pipelines.

Use Cases

Extraction tasks come up constantly in developer workflows:

  • Log parsing: A DDoS incident leaves you with thousands of lines of Nginx or Apache logs. Extracting all unique source IPs lets you build a blocklist or feed data into a rate-limiter without writing a full log parser.
  • OSINT: Security researchers and bug bounty hunters lift URLs and domains from data dumps, HTML source, JavaScript files, and API responses to map an organization's attack surface.
  • Data migration: When moving from one CRM to another, the exported CSV often contains email addresses embedded in free-text notes alongside phone numbers and postal addresses. Regex extraction is faster than manual cleanup.
  • Content cleanup: Markdown documents, scraped web pages, and pasted content from word processors routinely contain raw URLs that need to be converted to proper links, deduplicated, or validated.

URL Extraction

A URL extractor needs to handle multiple schemes. The common ones are http://, https://, ftp://, and mailto:. Less common but valid: ws://, wss://, ssh://, git://.

Domain matching is the tricky part. The TLD list maintained by ICANN now contains over 1,500 entries — from the familiar .com and .org to newer strings like .photography and .amsterdam. A regex that hardcodes a list of two- or three-letter TLDs will miss these. The pragmatic approach is to match any sequence of label characters separated by dots, then validate the result against a library or the Public Suffix List if accuracy matters.

Two edge cases trip up most URL extractors:

  • Relative URLs like /api/v1/users or ../images/logo.png do not have a scheme and will not match a scheme-anchored pattern. Relative URL extraction requires a separate, context-aware approach.
  • Trailing punctuation: A URL followed by a period at end-of-sentence (e.g. Visit https://example.com.) will incorrectly include the period in the match. A good pattern strips trailing .,;:!?) characters from the end of each match.

Email Extraction Pitfalls

Email addresses look simple but RFC 5322 — the specification that defines them — is famously permissive. The full grammar allows constructs that almost no production system actually emits:

  • Quoted local parts: "first last"@example.com and even "foo@bar"@example.com are technically valid. The quoted string can contain spaces, at-signs, and most special characters.
  • IP address literals: user@[192.168.1.1] and user@[IPv6:2001:db8::1] are valid per the spec. Almost no mail server accepts them in practice.
  • Comments: user(comment)@example.com is valid RFC 5322. Again, rarely seen outside of test suites.

For extraction purposes, the standard pragmatic approach is to target the 99% case: a local part made of alphanumerics, dots, hyphens, plus signs, and underscores, followed by @, followed by a standard domain. This misses quoted local parts but those are vanishingly rare in real data.

IPv4 Extraction

IPv4 addresses are four decimal octets separated by dots: 192.168.1.1. Each octet must be 0–255. The most common mistake in IPv4 regex is allowing invalid values like 000.000.000.000 (leading zeros are ambiguous — some parsers treat them as octal) or 300.1.1.1 (out of range).

A correct octet pattern must match exactly the range 0–255:

  • 25[0-5] — matches 250–255
  • 2[0-4][0-9] — matches 200–249
  • 1[0-9]{2} — matches 100–199
  • [1-9][0-9] — matches 10–99
  • [0-9] — matches 0–9

For log analysis you often want CIDR-aware extraction: matching 10.0.0.0/8 as a single token rather than extracting 10.0.0.0 and leaving the /8 behind. Add an optional (/[0-9]{1,2}) suffix group to your pattern.

IPv6 Extraction

IPv6 is significantly harder to match with regex because the format allows multiple abbreviated forms:

  • Full form: 2001:0db8:0000:0000:0000:0000:0000:0001
  • Compressed zeros: 2001:db8::1 — the :: notation collapses one or more consecutive all-zero groups
  • Bracketed form: [2001:db8::1] — used in URLs and log entries to disambiguate the address from a port number
  • Embedded IPv4: ::ffff:192.168.1.1 — IPv4-mapped IPv6 addresses appear in dual-stack environments
  • Loopback: ::1 — the full form is 0000:0000:0000:0000:0000:0000:0000:0001

For most extraction tasks, matching the bracketed form ([...]) is sufficient because that is what web servers and load balancers write to logs. A full RFC-compliant IPv6 pattern is long and difficult to maintain — using a dedicated parsing library is preferable for production code.

Sample Regex Patterns

TargetPattern (JavaScript)Notes
URL (http/https)https?://[^\s"'<>)]+[^\s"'<>.,;:!?)\]]Strips trailing punctuation
Email[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}Pragmatic 99% case; misses quoted local parts
IPv4(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]\d|\d)(?:\.(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]\d|\d)){3}Validates 0–255 range per octet
IPv4 + CIDR<IPv4 pattern>(?:/[0-9]{1,2})?Optionally matches subnet mask
IPv6 (bracketed)\[([0-9a-fA-F:]+)\]Matches log-format IPv6 only
Phone (E.164)\+[1-9]\d{7,14}International format only
UUID[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}Case-insensitive with i flag

IDN and Unicode Domains

Internationalized Domain Names (IDN) allow non-ASCII characters in domain names. A domain like café.com is encoded in DNS as xn--caf-dma.com using the Punycode algorithm. Both forms may appear in text you are extracting from:

  • User-facing content (web pages, emails) tends to use the Unicode form: münchen.de
  • System logs, DNS records, and certificate fields tend to use the Punycode form: xn--mnchen-3ya.de

A simple ASCII-only URL regex will miss the Unicode form. If your input may contain non-ASCII domains, use a Unicode-aware pattern or normalize the input to Punycode first using URL constructor in JavaScript (new URL(href).hostname returns the Punycode form) or the encodings module in Python.

Deduplication and Normalization

Raw extraction produces duplicates. Before presenting results or writing them to a database, apply normalization so that logically identical values deduplicate correctly:

  • Emails: Lowercase the entire address. Most mail systems are case-insensitive for both the local part and the domain, though RFC 5321 technically allows case-sensitive local parts.
  • URLs: Strip fragment identifiers (#section) before deduplication — two URLs that differ only by fragment point to the same resource. Optionally normalize trailing slashes and sort query parameters.
  • IPs: Normalize to a canonical form. 192.168.001.001 and 192.168.1.1 are the same address; parsing as integers before comparison avoids false duplicates.

Code Examples

JavaScript: URL and email scanner

// Extract and deduplicate URLs from a block of text
function extractUrls(text) {
  const pattern = /https?:\/\/[^\s"'<>)]+[^\s"'<>.,;:!?)\]]/g;
  const matches = text.match(pattern) ?? [];
  // Remove fragment, deduplicate, return sorted
  const normalized = matches.map(u => u.replace(/#.*$/, ''));
  return [...new Set(normalized)].sort();
}

// Extract emails
function extractEmails(text) {
  const pattern = /[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}/g;
  const matches = text.match(pattern) ?? [];
  return [...new Set(matches.map(e => e.toLowerCase()))].sort();
}

// Use lookahead to avoid crossing markdown link boundaries
// Matches raw URL but not the URL inside [text](url)
function extractUrlsFromMarkdown(text) {
  // Strip markdown links first, then extract
  const stripped = text.replace(/\[([^\]]+)\]\(([^)]+)\)/g, '$1 $2');
  return extractUrls(stripped);
}

Python: IP extraction with re.finditer

import re

IPV4_OCTET = r'(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]\d|\d)'
IPV4_PATTERN = re.compile(
    rf'(?<![\d.])({IPV4_OCTET}(?:\.{IPV4_OCTET}){{3}})(?:/([0-9]{{1,2}}))?(?![\d.])'
)

def extract_ips(text):
    results = []
    for m in IPV4_PATTERN.finditer(text):
        ip = m.group(1)
        cidr = m.group(2)
        results.append(f'{ip}/{cidr}' if cidr else ip)
    return list(dict.fromkeys(results))  # deduplicate, preserve order

# Example
log_line = '2026-04-20 10:23:11 blocked 203.0.113.42 -> 10.0.0.1/8'
print(extract_ips(log_line))
# ['203.0.113.42', '10.0.0.1/8']

Python: Robust URL domain extraction with tldextract

# pip install tldextract
import tldextract
import re

URL_PATTERN = re.compile(r'https?://[^\s"'<>)]+[^\s"'<>.,;:!?)\]]')

def extract_domains(text):
    urls = URL_PATTERN.findall(text)
    domains = set()
    for url in urls:
        parts = tldextract.extract(url)
        if parts.domain and parts.suffix:
            # Returns registered domain only, handles new TLDs correctly
            domains.add(f'{parts.domain}.{parts.suffix}')
    return sorted(domains)

# tldextract uses the Public Suffix List — handles .co.uk, .com.au,
# and new gTLDs like .photography correctly

Common Pitfalls

  • Greedy matching across lines: Without the multiline flag and careful anchoring, a greedy .* in a URL pattern can consume newlines and match content from multiple lines as a single URL. Always use [^\\s] or a character class instead of . for the URL body.
  • Markdown links: In [link text](https://example.com) a naive URL pattern extracts https://example.com) including the closing parenthesis. Add ) to the list of characters that terminate a URL match, or pre-process Markdown before extracting.
  • URLs in query parameters: A URL like https://tracker.example.com/?redirect=https://target.com embeds another URL as a query parameter value. Both will be extracted. If you only want the outer URL, decode the query string and filter after extraction.
  • False positives in version strings: A string like node@18.17.0 or a package specifier like lodash@4.17.21 will match a naive email regex. Use word boundary anchors (\\b) and verify the domain TLD length (at least 2 characters, no purely numeric TLD) to reduce false positives.
  • Performance on large inputs: Applying many independent regex patterns to a multi-megabyte log file is slow. Combine patterns into a single alternation with named capture groups, or use re.finditer (Python) rather than re.findall to avoid building a full results list in memory.

For one-off extraction tasks, the Text Extractors tool handles URLs, emails, IPs, and more directly in the browser — nothing leaves your machine. For related regex tools, see the Regex Cheatsheet, Regex Find and Replace Guide, and Text Tools Guide.