Tool

CSV ↔ JSON.

RFC 4180-correct CSV parsing — handles double-quoted fields, "" escapes, embedded commas, and multi-line cells. Type-infers numbers and booleans (toggleable). Local only — pipe sales reports, exports, or analytics dumps without leaking the data.

Direction
CSV → JSON
Rows
3
Columns
4

Delim
Input
Samples
Output
[
  {
    "name": "Ada Lovelace",
    "role": "mathematician",
    "founded": 1815,
    "active": true
  },
  {
    "name": "Babbage, Charles",
    "role": "engineer",
    "founded": 1791,
    "active": true
  },
  {
    "name": "Alan Turing",
    "role": "cryptanalyst",
    "founded": 1912,
    "active": false
  }
]

RFC 4180 vs everyone else.

Anyone who has shipped CSV import code in production has been bitten by the gap between what RFC 4180 specifies and what actually arrives in the wild. The RFC, published in October 2005, codifies comma-separated values with CRLF line endings, double-quote field wrapping, and doubled quotes ("") as the in-field escape. Excel running on a German or French Windows install ships semicolons by default because the locale uses the comma as a decimal separator. Tab-separated values, often given the .tsv extension but sometimes hidden inside .csv, sidestep the comma-vs-decimal collision entirely and remain the lingua franca of bioinformatics, log exports, and clipboard pastes from spreadsheet apps.

Pipe-separated files surface most often in telecom billing systems and older healthcare interchange formats, where the pipe was chosen precisely because it almost never appears in free-text address or name fields. Then there is the fixed-column format — technically not delimited at all but routinely lumped under the CSV umbrella by anyone integrating with regional banks, clearing houses, or government tax authorities. Columns 1 through 10 are an account number, 11 through 18 are a date in YYYYMMDD, 19 through 30 are an amount with leading zeros and an implicit two-decimal scale.

The three dialects you encounter most as moving targets: RFC 4180, Excel-CSV, and the format from Postgres COPY ... TO STDOUT CSV. RFC 4180 uses CRLF and escapes embedded quotes by doubling. Excel-CSV is mostly compatible but accepts LF on read, sometimes emits a UTF-8 BOM on write, and famously reinterprets values during round-trips. Postgres COPY exposes QUOTE, ESCAPE, DELIMITER, and NULL as independent parameters; a Postgres-emitted file may use \N for nulls in the default text format or an empty unquoted field in CSV mode.

DialectDelimiterQuote escapeLine endingNull marker
RFC 4180commadoubled quoteCRLFundefined
Excel (US)commadoubled quoteCRLF or LFempty field
Excel (EU)semicolondoubled quoteCRLFempty field
Postgres COPYconfigurabledoubled or backslashLF\N or configurable
TSV (IANA)tabnone typicalLFempty field
Sniff the delimiter, don't trust the extension

If you accept CSV uploads from end users, do not trust the file extension or MIME type. Sniff the delimiter from the first kilobyte by counting candidate characters and picking the one with the most consistent column count. Python's csv.Sniffer, PapaParse's delimiter: "" autodetect, and R's readr::guess_delim() all implement variants — and all occasionally guess wrong on adversarial input.

When 1 GB won't fit in RAM.

Once a CSV crosses roughly 100 megabytes, the naive read-everything-then-parse strategy stops being viable, and somewhere between 1 and 10 gigabytes it stops being possible at all on commodity hardware. The fix is a streaming parser that emits rows incrementally and never holds the full file in memory. PapaParse, the de facto browser-side CSV library since around 2014, exposes step and chunk callbacks: the former fires per row, the latter per parsed chunk, and both let the caller backpressure by pausing the parser while a downstream sink (IndexedDB write, fetch upload, Web Worker post) catches up.

In Node, csv-parse from Adaltas, paired with csv-stringify for the reverse direction, integrates cleanly with the standard stream.Readable and stream.Transform interfaces. A pipeline can read from a file or S3 GetObject, parse, transform, and write to Postgres COPY FROM STDIN without the full payload ever materializing in heap. Python's standard library csv.reader is a lazy iterator over a file handle — open with newline="", wrap in csv.DictReader, iterate with a generator that yields transformed rows.

For larger workloads, Polars (1.0 in July 2024) and DuckDB (1.0 in mid-2024) both offer streaming CSV scans that spill to disk and parallelize across cores, out-performing hand-rolled Python loops by an order of magnitude on multi-gigabyte files. Pandas read_csv accepts a chunksize argument that returns an iterator of DataFrames — the right escape hatch when you want streaming but already have downstream code written against Pandas.

The pathological case is a single row larger than available RAM. It is rarer than you would think but does happen — a CSV cell containing a base64-encoded image, a serialized JSON blob from a misconfigured ETL job, a free-text comments field. RFC 4180 places no upper bound on field length. The defensive answer: set a hard ceiling on field size at parser level, reject the row with a clear error, and surface it to an operator rather than letting the process OOM. csv-parse exposes max_record_size; Python's csv module has csv.field_size_limit(), defaulting to 131072 bytes — worth raising explicitly.

When 02134 becomes 2134.

CSV is a fundamentally untyped format, and any converter promising to "infer types" is making aesthetic guesses dressed up as engineering. The classic failure mode is leading-zero codes: a US ZIP code column containing 02134 becomes the integer 2134 the moment a parser decides the column is numeric, and the leading zero is gone forever. Phone numbers, employee IDs, ISBNs, and any code system using a check digit suffer the same fate. In 2020 the HUGO Gene Nomenclature Committee renamed several human genes specifically because Excel kept turning MARCH1 and SEPT2 into dates.

Date strings are the next layer of pain. A column of 04/05/2026 is April 5th in the US and May 4th nearly everywhere else, with no signal in the file to disambiguate. ISO 8601 (2026-04-05) is unambiguous and should be preferred for any new format, but legacy CSVs almost never use it. Scientific notation is another quiet trap: a column of order IDs that happens to contain 1E5 will be silently coerced to 100000.

Boolean-looking strings (true/false, yes/no, Y/N, 1/0) and null markers (NA, NULL, n/a, -, empty) compound the problem because conventions vary by source: SAS exports use . for missing numerics, R exports use NA, Postgres COPY uses \N, spreadsheets use empty fields.

Default to strings; opt into types

The defensive default is to treat every column as a string until the user opts into a type for that specific column. Auto-inference is fine as a UI affordance — show what the inferred type would be — but the conversion itself should require explicit confirmation for any column with values the parser cannot prove safe to coerce. This is the discipline dbt seed files, Great Expectations, and Pandas dtype= all push toward.

Three ways to arrange the rows.

Once the CSV is parsed, the question of what JSON to emit has more nuance than first appears. The most common shape is an array of objects, one per row, with column headers as keys. This is what most APIs return, what most JavaScript code expects, and what JSON.parse produces directly into an iterable. It is also the most verbose: every row repeats every key, and on a million-row file the key strings alone can dominate the payload size. Gzip helps a lot — repeated keys compress beautifully — but if the consumer is not gzipping, the overhead is real.

The column-oriented shape, an object of arrays, is what Pandas's to_json(orient="columns") and Apache Arrow's JSON exports emit. Dramatically more compact for wide tables, aligns naturally with how analytics engines store data internally, and is what plotting libraries like Plotly and Vega-Lite often prefer. The trade-off: streaming consumption is harder; you cannot process row N without having read all columns up through row N, defeating the streaming model for large files.

NDJSON, also called JSON Lines or .jsonl, emits one JSON object per line with no enclosing array. It is the format BigQuery, Snowflake, and most log-ingestion pipelines prefer for bulk loads, because each line is independently parseable and a corrupt row does not poison the rest of the file. For ingestion pipelines and append-only event streams, NDJSON is almost always the right answer. For API responses consumed by a browser, an array of objects is friendlier. For analytics handoffs to a notebook, column-oriented JSON or — frankly — Parquet, will outperform any JSON shape.

CSV is the interoperability dialect nobody planned.

CSV predates RFC 4180 (2005) by decades. Spreadsheets, databases, and BI tools all speak some variant. Postgres has COPY ... CSV; MySQL has LOAD DATA INFILE; Snowflake, BigQuery, Redshift all read CSV at staggering throughput because the format streams: parser state is one row, and rows can be split across worker shards. JSON wins the API battle but loses the bulk-load battle — JSON's wrapping array and per-row braces are dead weight at billion-row scale. NDJSON / JSON Lines (one object per line) was the compromise, and most modern tools accept either NDJSON or CSV for ingest.

Excel CSV ≠ standard CSV

When a French or German Excel saves "CSV", it uses semicolons (because the comma is the decimal separator in those locales). It also writes a UTF-8 BOM and CRLF line endings. Strict RFC 4180 parsers gag on the BOM. Most production CSV libraries strip it silently — a fact worth knowing if you're writing the parser yourself.

Found this useful?