DevToys Web Pro iconDevToys Web ProBlog
Isinalin gamit ang LocalePack logoLocalePack
I-rate kami:
Subukan ang browser extension:
← Back to Blog

SQL Formatter Guide: Dialects, CTEs, and CI Linting

11 min read

Unformatted SQL is where bugs hide. A query that spans three screens with no consistent indentation, mixed keyword casing, and implicit JOIN conditions is hard to review, hard to diff, and nearly impossible to refactor safely. Formatted SQL forces structure on a language that otherwise lets you write whatever compiles. Use the SQL Formatter to clean up queries as you follow along.

This guide covers what a formatter actually does under the hood, where dialects diverge, how to structure CTEs and JOINs, what to do with long SELECT lists, and how to gate SQL quality in CI with sqlfluff.

What a Formatter Actually Does

A SQL formatter is not a simple text prettifier. It works in distinct phases:

  1. Tokenize — split the raw string into keywords, identifiers, literals, operators, and whitespace tokens.
  2. Parse — build a parse tree that understands the grammatical structure: which tokens are clause keywords, which are column lists, which are expressions.
  3. Normalize — apply casing rules (uppercase keywords, lowercase identifiers), strip redundant whitespace, and standardize quote characters.
  4. Reflow — walk the parse tree and emit tokens with correct indentation, line breaks, and alignment.

This matters because a formatter that skips the parse step — treating SQL as a string to reflow with regex — will corrupt complex expressions. String literals, nested subqueries, and CASE expressions all require a real parser to reformat safely.

Here is the same query before and after formatting:

-- Before
select u.id,u.name,o.total from users u join orders o on u.id=o.user_id where o.total>100 and u.active=1 order by o.total desc
-- After
SELECT
  u.id,
  u.name,
  o.total
FROM users AS u
JOIN orders AS o
  ON u.id = o.user_id
WHERE
  o.total > 100
  AND u.active = 1
ORDER BY o.total DESC

The formatted version makes every clause visible at a glance. The WHERE filter on active — which could silently exclude rows — is no longer buried at the end of a single line.

Keyword Casing Debates

The SQL standard defines keywords in uppercase. Early terminals displayed everything in uppercase, and the convention stuck. Modern tools accept either case, but the debate continues on every team that writes SQL seriously.

StyleExampleArgument for
UPPERCASE keywordsSELECT id FROM users WHERE active = 1Visual separation between keywords and identifiers; matches SQL standard; most style guides (Google, GitLab, dbt) default to this
lowercase keywordsselect id from users where active = 1Less visual noise; easier to type; popular in some ORM-generated output and PostgreSQL documentation examples

The conclusion that holds across most large codebases: consistency beats preference. Pick one style, configure your formatter to enforce it, and move on. The cost of mixed casing in a shared repository — where half the queries are uppercase and half are not — exceeds the cost of whichever style you chose.

Uppercase keywords won in most SQL style guides (dbt's official style guide, GitLab's SQL style guide, Kickstarter's style guide) because they make it easier to visually scan a query and identify clause boundaries. Formatters like sqlfluff default to uppercase; you can override to lowercase if your team prefers.

Dialect Differences

SQL has an ISO standard, but every major database deviates from it. A formatter must know the target dialect to handle syntax correctly — otherwise it will either reject valid syntax or silently misformat it.

FeatureMySQLPostgreSQLSQLiteSQL ServerBigQuerySnowflake
Quoted identifiers`backtick`"double"Both[bracket]`backtick`"double"
Boolean literals1 / 0TRUE / FALSE1 / 01 / 0TRUE / FALSETRUE / FALSE
String concatCONCAT(a, b)a || ba || ba + bCONCAT(a, b)a || b
Window functions8.0+Full support3.25+Full supportFull supportFull support
LIMIT / TOPLIMIT nLIMIT nLIMIT nTOP n (prefix)LIMIT nLIMIT n
Upsert syntaxON DUPLICATE KEY UPDATEON CONFLICT DO UPDATEON CONFLICT DO UPDATEMERGEMERGEMERGE

When a formatter encounters `backtick` identifiers, it needs to know it is targeting MySQL or BigQuery — not PostgreSQL, which would treat backticks as a syntax error. Always set the dialect in your formatter configuration rather than relying on auto-detection.

CTE Formatting

Common Table Expressions (CTEs) are the most powerful structural tool in SQL for managing query complexity. A CTE chain with ten steps is readable; the equivalent nested subquery is not. Formatting choices for CTEs have an outsized impact on readability.

Leading comma vs trailing comma

Trailing-comma style is more common in application code (JavaScript arrays, Python lists), but leading-comma style has a practical advantage in SQL: it makes it easier to comment out an individual CTE or column without a syntax error on the preceding line.

-- Trailing comma (common default)
WITH
  active_users AS (
    SELECT id, name
    FROM users
    WHERE active = TRUE
  ),
  recent_orders AS (
    SELECT user_id, SUM(total) AS total_spent
    FROM orders
    WHERE created_at > CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
  )
SELECT
  u.name,
  o.total_spent
FROM active_users AS u
JOIN recent_orders AS o
  ON u.id = o.user_id
-- Leading comma (dbt style guide preference)
WITH
  active_users AS (
    SELECT id, name
    FROM users
    WHERE active = TRUE
  )

  , recent_orders AS (
    SELECT user_id, SUM(total) AS total_spent
    FROM orders
    WHERE created_at > CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
  )

SELECT
  u.name
  , o.total_spent
FROM active_users AS u
JOIN recent_orders AS o
  ON u.id = o.user_id

When to extract to a CTE vs inline a subquery

A subquery that is referenced once and fits in three lines can stay inline. Extract to a CTE when:

  • The subquery is referenced more than once in the outer query.
  • The subquery is longer than about five lines and has its own WHERE clause.
  • The logic has a meaningful name — active_users, monthly_revenue — that makes the outer query self-documenting.
  • You want to test the intermediate result independently during development.

JOIN Formatting

Always use explicit JOIN keywords. Implicit joins (comma-separated tables in FROM) still work syntactically but make the join condition easy to lose or omit:

-- Implicit join — avoid
SELECT u.name, o.total
FROM users u, orders o
WHERE u.id = o.user_id
  AND o.total > 100
-- Explicit join — prefer
SELECT u.name, o.total
FROM users AS u
JOIN orders AS o
  ON u.id = o.user_id
WHERE o.total > 100

Place the ON clause on its own line, indented one level under the JOIN keyword. When a join involves multiple conditions, each condition gets its own line:

SELECT
  u.name,
  oi.quantity,
  p.title
FROM users AS u
JOIN orders AS o
  ON u.id = o.user_id
JOIN order_items AS oi
  ON o.id = oi.order_id
  AND oi.cancelled = FALSE
JOIN products AS p
  ON oi.product_id = p.id
WHERE u.active = TRUE

Use table aliases consistently. Short aliases (u, o) work well for two-table queries but become confusing at five tables. Longer abbreviations (usr, ord) or the full table name help in complex queries. Pick one convention per query and stick to it.

Long SELECT Lists

One column per line is the standard recommendation, and for good reason: it makes code review diffs readable. When a column is added or removed, the diff shows exactly which column changed rather than a blob of changed text on a single line.

-- Hard to diff, hard to review
SELECT id, name, email, created_at, updated_at, active, role, last_login_at FROM users
-- One column per line
SELECT
  id,
  name,
  email,
  created_at,
  updated_at,
  active,
  role,
  last_login_at
FROM users

For trailing vs leading commas in SELECT lists, the same argument applies as with CTEs: leading commas make it easier to comment out a single column. But trailing commas are more natural for most developers and are the default in most formatters. Choose one and configure your formatter to enforce it.

Tools Ecosystem

Several tools handle SQL formatting and linting. They occupy different points on the spectrum from "fast and opinionated" to "fully configurable lint engine."

sqlfluff

The most capable open-source SQL linter and fixer. Supports 20+ dialects, hundreds of configurable rules, and can both detect and fix violations. The go-to choice for CI enforcement.

pip install sqlfluff
# Lint a file
sqlfluff lint query.sql --dialect postgres

# Auto-fix violations
sqlfluff fix query.sql --dialect postgres

Prettier SQL plugin

prettier-plugin-sql integrates SQL formatting into the Prettier ecosystem. Useful when your project already runs Prettier for JavaScript and TypeScript and you want a single formatter invocation to cover all file types.

npm install --save-dev prettier prettier-plugin-sql
// .prettierrc
{
  "plugins": ["prettier-plugin-sql"],
  "language": "postgresql",
  "keywordCase": "upper"
}

pgFormatter

A Perl-based formatter specifically for PostgreSQL. Produces clean, standards-aligned output. Available as a CLI (pg_format) and integrated into many editors via plugins.

pg_format -i query.sql   # Format in-place
pg_format query.sql       # Print to stdout

DBeaver built-in formatter

DBeaver (the open-source database GUI) ships a SQL formatter accessible via Ctrl+Shift+F (or Cmd+Shift+F on macOS). Configurable under Preferences → Editors → SQL Editor → Formatting. Useful for interactive query work, but not suitable for automated CI enforcement.

DevToys Pro SQL Formatter

For quick ad-hoc formatting without installing anything, the SQL Formatter in DevToys Pro runs entirely in the browser — no server, no data leaving your machine. Paste a query, select the dialect, get clean output.

CI Integration with sqlfluff

Formatting and linting SQL in CI prevents unformatted queries from entering the codebase. Here is a working sqlfluff configuration and pre-commit setup.

sqlfluff configuration

# .sqlfluff
[sqlfluff]
dialect = postgres
templater = dbt          # or "jinja" for raw SQL with Jinja, or "raw" for plain SQL
max_line_length = 100
indent_unit = space
tab_space_size = 2

[sqlfluff:indentation]
indented_joins = False
indented_ctes = False
indented_using_on = True

[sqlfluff:layout:type:comma]
line_position = trailing  # or "leading" for leading-comma style

[sqlfluff:rules:capitalisation.keywords]
capitalisation_policy = upper

[sqlfluff:rules:capitalisation.identifiers]
capitalisation_policy = lower

[sqlfluff:rules:capitalisation.functions]
capitalisation_policy = upper

[sqlfluff:rules:aliasing.table]
aliasing = explicit       # require AS keyword for table aliases

[sqlfluff:rules:aliasing.column]
aliasing = explicit       # require AS keyword for column aliases

pre-commit configuration

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/sqlfluff/sqlfluff
    rev: 3.0.7           # pin to a specific release
    hooks:
      - id: sqlfluff-lint
        args: [--dialect, postgres]
        # Limit to SQL files only
        types: [sql]
      - id: sqlfluff-fix
        args: [--dialect, postgres]
        types: [sql]
# Install pre-commit and run hooks
pip install pre-commit
pre-commit install
pre-commit run --all-files

Lint rules worth enabling

Not all sqlfluff rules are worth enforcing everywhere. These are the ones with the highest signal-to-noise ratio:

  • ambiguous.column_references — flag unqualified column references in queries with more than one table (a common source of bugs when a column exists in multiple tables).
  • references.from — ensure every table referenced in SELECT is declared in FROM or JOIN.
  • structure.subquery — warn when a subquery could be replaced with a CTE for readability.
  • aliasing.forbid_subquery_in — prevent unaliased subqueries in FROM clauses.
  • layout.select_targets — enforce one column per line in SELECT lists.

Ignoring generated SQL

ORM migrations, query builders, and code generators produce SQL that you should not run through a linter. Exclude them with glob patterns:

# .sqlfluff
[sqlfluff]
exclude_rules = None

# Ignore specific paths
[sqlfluff:templater:jinja]
ignore_templating_errors = True
# .sqlfluffignore
# Generated migration files
migrations/auto_*.sql
# ORM-generated schema dumps
schema_dump.sql
# Vendor SQL
vendor/**

Pitfalls

String literal contents getting reformatted

A formatter that does not correctly parse string boundaries will reformat content inside string literals. This is destructive: a SQL query stored as a string inside another SQL query, or a regex pattern in a SIMILAR TO clause, can be corrupted.

-- The content inside the string must not be reformatted
SELECT *
FROM logs
WHERE message SIMILAR TO '(ERROR|WARN):s+.*connection.*refused'

Always verify your formatter preserves string literal contents exactly. Test with queries that contain escape sequences, single quotes escaped with '', and dollar-quoted strings (PostgreSQL).

Dialect misdetection

Auto-detection of SQL dialect is unreliable. A query using ILIKE (PostgreSQL) or QUALIFY (BigQuery, Snowflake) will confuse a formatter that defaults to ANSI SQL. Always set dialect explicitly:

# Bad: relies on auto-detection
sqlfluff lint query.sql

# Good: explicit dialect
sqlfluff lint query.sql --dialect snowflake

Dynamic SQL

SQL assembled at runtime from string concatenation cannot be statically formatted or linted. Formatters operate on static text. If you build queries dynamically — concatenating WHERE conditions based on user input, or building IN lists programmatically — the formatter will see the template, not the final query.

For Jinja-templated SQL (common in dbt), sqlfluff has a Jinja templater that understands{{ ref('table') }} and {% if condition %} syntax. For everything else, lint the static parts and accept that the dynamic parts cannot be automatically checked.

If you work with multiple languages and want a consistent formatting workflow across your codebase, the Code Formatters Guide covers Prettier, ESLint, Black, rustfmt, and how to integrate them into pre-commit and CI — the same patterns that apply to sqlfluff for SQL.


Format and clean SQL queries directly in your browser with the SQL Formatter — supports multiple dialects, runs locally, and handles edge cases that trip up simpler regex-based tools.