- RFC 4180 [1] Section 2.6 says:
"Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes."
If the DMS output isn’t quoting fields that contain commas, that’s technically invalid CSV.
A small normalization step before COPY (or ensuring the writer emits RFC-compliant CSV in the first place) would make the pipeline robust without renaming countries or changing delimiters.
That way, if/when the DMS output is fixed upstream, nothing downstream needs to change.
[1] https://www.rfc-editor.org/rfc/rfc4180.html
- I'm not sure how "commas inside strings in CSVs can cause bugs" becomes newsworthy, but I guess even the vibecoding generation needs to learn the same old lessons.
- I really don't understand why people think it's a good idea to use csv. In english settings, the comma can be used as 1000-delimiter in large numbers, e.g. 1,000,000 for on million, in German, the comma is used as decimal place, e.g. 1,50€ for 1 euro and 50 cents. And of course, commas can be used free text fields. Given all that, it is just logical to use tsv instead!
by franciscop
1 subcomments
- This very clearly seems like a bug either in their DMS script, or in the DMS job that they don't directly control, since CSV clearly allows for escaping commas (by just quoting them). Would love to see a bug report being submitted upstream as well as part of the "fix".
- Unrelated to the fundamental issue (a part of your pipeline generates invalid CSV), I would never store the name of the country like this. The country's name is "The Republic of Moldova" and I would store it like this.
Sure, the most common collation scheme for country names is to sort ignoring certain prefixes like "The Republic of", "The", "People's Democratic...", etc. but this is purely a presentation layer issue (how to order a list of countries to a user) that should be independent of your underlying data.
Sure "hacking" the name of the country like this to make the traditional alphabetical ordering match a particular ordering desired to aid human navigation has a lot of history (encyclopedia indexes, library indexes, record stores, etc.) but that was in the age of paper and physical filing systems.
Store the country name correctly and then provide a custom sort or multiple custom sorts where such functionality belongs - in the presentation layer.
- "Sanitize at the boundary"
Ah, but what _is_ the boundary, asks Transnistria?
- Considering the scope, this could be more easily resolved by just stripping ", Republic of" from that specific string (assuming "Moldova" on its own is sufficient).
by davecahill
1 subcomments
- I was expecting a Markdown-related .md issue. :)
- The majority of countries official names are in this format. We just use the short forms. "Republic of ..." is the most common formal country name: https://en.wikipedia.org/wiki/List_of_sovereign_states
by shalmanese
0 subcomment
- Did you really name your breakaway republic Sealand'); DROP TABLE Countries;--?
- just use TSV instead of CSV by default
- I dont understand people who dont validate their inputs and outputs - a count of expected values would've prevented such a basic mistake
- I personaly would shy away from binary formats whenever possible. For my column based files i use TSV or the pipe char as delimiter. even excel allowes this files if you include a "del=|" as first line
by the_origami_fox
1 subcomments
- That's a cool map.
- Why use Shopify if they can't even get their CSVs working right? Worse bugs have to be lurking.
- Typical vibe coding quality.
It's not the serialisation that is correct, it must be the data.
Lets rename a country because we are not capable of handling commas in data.
by renewiltord
0 subcomment
- Come on man. What are we doing here. This is not even anything interesting like Norway being interpreted as False in YAML. This is just a straightforward escaping issue.
by inevletter
0 subcomment
- Huge skill issue. Nothing to see here.