DevToys Web Pro iconDevToys Web Proブログ
翻訳: LocalePack logoLocalePack
評価:
ブラウザ拡張機能を試す:
← Back to Blog

Sort Lines, Remove Duplicates, and Filter Text: A Practical Guide

7 min read

Line-level text manipulation — sorting a list, stripping duplicates, removing blank lines — shows up constantly in development work. You paste a log excerpt, a CSV column, an import list, or a config fragment and need it cleaned up in seconds. The Line Utilities tool handles all of these operations directly in the browser without sending your data anywhere.

When a Web Tool Beats sort | uniq

The Unix pipeline sort file.txt | uniq is powerful, but reaching for a terminal is not always the right move. Three situations where a browser tool wins:

  • No terminal access. You are working on a locked-down corporate machine, a Chromebook, or inside a web-based IDE. Pasting into a tool takes two seconds; finding a terminal takes ten minutes.
  • Small ad-hoc lists. When the input is 20 lines copied from a spreadsheet or a Slack message, spinning up a shell script is over-engineering. Paste, click, copy.
  • Privacy-sensitive data. Sorting a list of internal usernames or API keys through an online service that logs requests is a risk. A client-side tool processes everything locally — nothing leaves the browser tab.

For large files, scripted pipelines, or automation, CLI tools remain the right choice. The sections below include CLI equivalents for every operation so you can replicate any result in a script.

Sort Strategies

Not all sorting is alphabetical. The right strategy depends on what the lines contain. Here is a comparison of the main options:

StrategyHow It WorksBest ForCLI Flag
AlphabeticalByte order (ASCII / UTF-8 code point)Code identifiers, keys, simple word listssort
Case-insensitiveFold uppercase to lowercase before comparingMixed-case config keys, hostnamessort -f
Natural (human) sortTreats embedded numbers as numeric values so file10 sorts after file9Filenames, version-like stringssort -V
NumericParse the leading number and compare as integer or floatLines that start with counts, latencies, sizessort -n
Locale-awareUses Intl.Collator with the system locale for accent and diacritic orderingHuman names, words in non-ASCII languagessort -k1,1 with LC_ALL=lang
ReverseAny of the above, applied in descending orderNewest-first timestamps, highest-first countssort -r

Common Sorting Pitfalls

Choosing the wrong sort type produces silently wrong results. Three traps developers hit regularly:

  • IP addresses sorted as text. Lexicographic order puts 192.168.1.10 before 192.168.1.9 because "1" (the digit starting 10) sorts before "9". Use numeric sort on each octet separately, or sort by the integer representation of the full address.
  • Version strings with sort -n. Numeric sort reads the leading number only, so 2.10.0 and 2.9.0 both become 2 and compare as equal. Use sort -V (version sort) instead.
  • Dates in local format. 04/20/2026 and 20/04/2026 sort completely differently depending on locale. Normalize to ISO 8601 (2026-04-20) before sorting — ISO dates sort correctly as plain strings.

Deduplication Approaches

Removing duplicate lines sounds straightforward, but the definition of "duplicate" matters:

  • Exact match — two lines are duplicates only if they are byte-for-byte identical. This is what sort -u and uniq do after sorting.
  • Case-insensitive — treat Error and error as the same. Useful for deduplicating log keywords or config keys.
  • Trimmed whitespace — leading and trailing spaces are stripped before comparison, so " apple" and "apple" are considered identical. Catches copy-paste artifacts from spreadsheets.
  • Normalized Unicode (NFC) — some characters can be represented in multiple Unicode encodings. A plain e and a combining accent e + ̀ look identical but have different byte sequences. NFC normalization collapses these before comparison. Important for text pasted from macOS (which uses NFD) into a Windows-originated file.

When order matters, use insertion-order deduplication rather than sort-then-uniq. The CLI equivalent is awk '!seen[$0]++', which keeps only the first occurrence of each line without reordering.

# Keep first occurrence, preserve order
awk '!seen[$0]++' input.txt

# Case-insensitive dedup, preserve order
awk '!seen[tolower($0)]++' input.txt

# Sort + dedup (does not preserve original order)
sort -u input.txt

Empty Line Removal and Whitespace Handling

Empty lines come in two varieties that need different handling:

  • Truly empty lines — the line contains only a newline character. Easy to remove with grep -v '^$' or sed '/^$/d'.
  • Whitespace-only lines — the line contains spaces or tabs but no visible content. A naïve empty-line filter misses these. Use grep -v '^\s*$' to catch both.

The related operation is trimming — removing leading and trailing whitespace from each line without deleting the line itself. This is distinct from empty-line removal. After trimming, lines that were whitespace-only become empty and can then be removed in a second pass.

# Remove truly empty lines
grep -v '^$' input.txt

# Remove whitespace-only lines too
grep -v '^s*$' input.txt

# Trim leading/trailing whitespace per line (GNU sed)
sed 's/^[[:space:]]*//;s/[[:space:]]*$//' input.txt

# Collapse multiple blank lines into one
cat -s input.txt
# or
tr -s '
' < input.txt

Real-World Use Cases

Line utilities pay off most in these everyday scenarios:

  • Log filtering. Paste a block of log lines, remove duplicates to see distinct error messages, then sort by severity prefix or timestamp. Much faster than grepping when you do not know the exact pattern yet.
  • Config file cleanup. .env files, hosts entries, and nginx allowlists accumulate duplicate entries over time. Sort and deduplicate to spot conflicts and reduce file size.
  • Import list deduplication. When merging two files of Python imports or TypeScript import statements, sort alphabetically and remove duplicates to produce a canonical list.
  • CSV header analysis. Paste the first row of several CSV exports, deduplicate to find the union of column names, then sort to compare coverage across datasets.

CLI Equivalents

For automation and large files, the equivalent shell commands are worth knowing:

# Alphabetical sort
sort input.txt

# Sort and remove duplicates
sort -u input.txt

# Numeric sort (leading number on each line)
sort -n input.txt

# Version / natural sort
sort -V input.txt

# Reverse sort
sort -r input.txt

# Case-insensitive sort
sort -f input.txt

# Preserve order, remove duplicates (awk)
awk '!seen[$0]++' input.txt

# Remove empty and whitespace-only lines
grep -v '^s*$' input.txt

# Collapse consecutive duplicate lines (input must be sorted)
uniq input.txt

# Trim whitespace from each line
sed 's/^[[:space:]]*//;s/[[:space:]]*$//' input.txt

Code Examples

When you need the same operations in application code rather than on the command line:

// JavaScript — sort and deduplicate, preserve insertion order
const lines = text.split('\n').filter(Boolean);

// Exact dedup, insertion order preserved (ES2015+)
const unique = Array.from(new Set(lines));

// Case-insensitive dedup, keep first occurrence
const seen = new Set();
const uniqueCI = lines.filter(line => {
  const key = line.toLowerCase();
  if (seen.has(key)) return false;
  seen.add(key);
  return true;
});

// Locale-aware sort (respects accents, diacritics)
const collator = new Intl.Collator('en', { sensitivity: 'base' });
const sorted = [...unique].sort((a, b) => collator.compare(a, b));

// Natural (human) sort
const natural = [...lines].sort((a, b) =>
  a.localeCompare(b, undefined, { numeric: true, sensitivity: 'base' })
);
# Python — sort and deduplicate
lines = text.splitlines()

# Exact dedup, insertion order preserved (Python 3.7+ dicts maintain order)
unique = list(dict.fromkeys(lines))

# Case-insensitive dedup, keep first occurrence
seen = set()
unique_ci = []
for line in lines:
    key = line.casefold()
    if key not in seen:
        seen.add(key)
        unique_ci.append(line)

# Sorted
sorted_lines = sorted(unique)

# Locale-aware sort (requires PyICU or locale module)
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
locale_sorted = sorted(unique, key=locale.strxfrm)

# Remove empty and whitespace-only lines
non_empty = [line for line in lines if line.strip()]

Pitfalls in Code

A few traps worth knowing before you ship line-manipulation code:

  • Set preserves insertion order, but only since ES2015. Older environments and some transpilation targets may not guarantee this. If order matters, use the explicit seen Set pattern shown above rather than relying on Array.from(new Set(...)).
  • Unicode normalization traps. Two strings that look identical may have different byte representations depending on the source (macOS uses NFD, Windows and the web use NFC). Always normalize before comparing: line.normalize('NFC') in JavaScript, unicodedata.normalize('NFC', line) in Python.
  • CRLF vs LF line endings. A file with Windows line endings (\r\n) will leave a trailing \r on each line after splitting on \n. The lines "apple\r" and "apple" are not duplicates under exact comparison. Normalize line endings first: text.replace(/\r\n/g, '\n') or open the file in text mode in Python (the default).
  • sort -u vs sort | uniq. They produce the same result for exact matching, but uniq only removes adjacent duplicates — it requires sorted input. On unsorted input, sort -u is correct and uniq alone is not.

All the operations described in this guide are available in Line Utilities — sort, deduplicate, filter, and trim directly in your browser with no data leaving your machine. For related text processing, see the Text Tools Guide and the comparison of List Operations.