Skip to content

Conversation

hat0uma
Copy link
Owner

@hat0uma hat0uma commented Jul 13, 2025

Summary

Implements comprehensive CSV dialect detection with automatic delimiter detection, quote character detection, and header row identification.

Key Features

Auto-detect Delimiter

Command-line

" For unknown file formats, let auto-detection work
:CsvViewEnable

Configurations

-- Default configuration with auto-detection
{
  parser = {
    delimiter = {
      ft = {
        csv = ",",        -- Always use comma for .csv files
        tsv = "\t",       -- Always use tab for .tsv files
      },
      fallbacks = {       -- Try these delimiters in order for other files
        ",",              -- Comma (most common)
        "\t",             -- Tab
        ";",              -- Semicolon
        "|",              -- Pipe
        ":",              -- Colon
        " ",              -- Space
      },
    },
  },
}

How Auto-Detection Works

  1. If the file type matches ft rules (e.g., .csv → comma), use that delimiter
  2. Otherwise, test each delimiter in fallbacks order
  3. Score each delimiter based on field consistency across lines
  4. Select the delimiter with the highest score

Auto-detect Header

Command-line Header Options

:CsvViewEnable header_lnum=auto  " Auto-detect header (default)
:CsvViewEnable header_lnum=1     " First line as header
:CsvViewEnable header_lnum=none  " No header line

Configurations

-- Default configuration with auto-detection
{
  view = {
    header_lnum = true,  -- Auto-detect header (default)
    sticky_header = {
      enabled = true,
      separator = "",  -- Separator line character
    },
  },
}

How Header Auto-Detection Works

  1. Find the first non-comment line as header candidate
  2. Analyze each column independently using two heuristics:
    • Type Mismatch: If the first row contains text while data rows are numeric, it's likely a header
    • Length Deviation: If the first row's text length differs significantly from data rows, it's likely a header
  3. Combine evidence from all columns to make the final decision

hat0uma added 10 commits July 13, 2025 20:37
This commit introduces a new module for sniffing CSV dialect properties
such as delimiter, quote character, and header row.

To enable this, the CsvViewParser has been refactored to be more
flexible. It now uses a pluggable 'source' for line retrieval, allowing
it to parse from either a Neovim buffer or an arbitrary array of
strings (e.g., sample lines). This change also involved making
`async_chunksize`, `comments`, and `max_lookahead` explicit parameters
in the parser's methods, rather than relying solely on internal options.
The `delimiter` option now supports a `fallbacks` array in config.
This array allows the parser to automatically detect the delimiter
if no filetype-specific delimiter is configured. This significantly
improves handling of diverse CSV files without manual setup.
```
…lation

- Fix parser line advancement to handle multi-line fields properly
- Skip comment lines when calculating field consistency scores
- Use accurate record count instead of total line count for variance
- Add explicit CSV filetype delimiter in config
The `opts.parser.delimiter.default` option has been deprecated.
This change adds backward compatibility by mapping the `default`
value to `opts.parser.delimiter.ft.csv`. Users are advised to
migrate to using `opts.parser.delimiter.fallbacks` or configuring
filetype-specific delimiters via the `ft` table. A deprecation
warning will be displayed when the deprecated option is used.
The header detection algorithm has been significantly improved to be
more robust and accurate.

The new approach analyzes each column independently, aggregating
evidence from two heuristics:

1.  Type Mismatch: Assesses if the header candidate's type differs from
    the inferred data type of the rest of the column.
2.  Length Deviation: Checks if the header candidate's string length
    is an outlier compared to the column's data.

A scoring system based on these heuristics is used to determine if
the first non-comment row is a header.
The `view.header_lnum` option now defaults to `true` and supports automatic
header detection.

This change refactors the CSV dialect detection logic, including
delimiter, quote character, and header line number. Previously, this
logic was intertwined within the parser. Now, dedicated utility functions
(`util.resolve_delimiter`, `util.resolve_quote_char`, `util.resolve_header_lnum`)
handle the resolution, leveraging the `sniffer` module.

The `CsvViewParser` and `CsvView` instances now receive the resolved dialect
parameters explicitly, leading to a cleaner and more modular design.
The `sniffer` module's public interface was also updated to expose
buffer-level detection functions.
… better docs

- Add auto-detection support (header_lnum = true, now default)
- Enhance documentation with clear value explanations
- Add command-line aliases 'auto' and 'none' for better UX
@hat0uma hat0uma force-pushed the feat/auto_detect_delimiter branch from 4e5ebfc to 9ebb198 Compare July 20, 2025 19:32
@hat0uma hat0uma force-pushed the feat/auto_detect_delimiter branch from 0a7b21a to 918efa7 Compare July 20, 2025 19:52
Introduces `GUIDE.md` to provide detailed documentation on
csvview.nvim's features, configuration options, and API.
This new guide aims to improve user understanding and ease of use.
Updates `README.md` to remove redundant information and
point to the new guide.
@hat0uma hat0uma changed the title feat(parser): auto detect delimiter feat(parser): Add automatic delimiter and header detection Jul 20, 2025
@hat0uma hat0uma merged commit bfd95ed into main Jul 21, 2025
5 checks passed
@hat0uma hat0uma deleted the feat/auto_detect_delimiter branch July 21, 2025 07:53
This was referenced Jul 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant