Seed column & model column mismatch due to normalization #2698

georgesittas · 2024-05-29T19:13:41Z

To reproduce the bug:

Create new SQLMesh project using sqlmesh init postgres --template empty - Postgres is the default dialect because its normalization strategy is LOWERCASE, i.e. it treats quoted identifiers as case-sensitive, and lowercases unquoted ones.
Create the following seed model:

MODEL (
    name sqlmesh_example.demo_seed,
    kind SEED (
        path '../seeds/demo_seed.csv'
    ),
    columns(
        "camelCaseColumn" TEXT,
    ),
);

Create the following seed file under seeds/, named demo_seed.csv:

"camelCaseColumn"
foo

Run sqlmesh plan.

This currently produces the following error:

  File "sqlmesh/core/model/definition.py", line 1182, in render_seed
    df.loc[:, string_columns] = df[string_columns].mask(
  File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 3899, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6115, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['camelCaseColumn'], dtype='object')] are in the [columns]"

The reason this happens is because we normalize the seed file's column by lowercasing it, whereas the column property lists it with quotes, making it case-sensitive. We're essentially always treating CSV columns as case-insensitive, since there's no way to quote them and preserve those quotes in the dataframe returned by pd.read_csv.

Some ideas to fix this:

Always treat CSV columns as case-sensitive. This may be problematic because it's a constraint that propagates downstream, thus requiring quotes everywhere. One way a user could overcome this would be to make a new model that acts as a dependency inversion layer for that seed, e.g. SELECT "camelCaseColumn" AS camelCaseColumn FROM sqlmesh_example.demo_seed. Then, the column would not require quotes downstream from that "staging" model.
Continue normalizing the CSV columns, but treat the names enumerated in the columns model property as the source of truth. So for the example above, we'd rename the single dataframe column into camelCaseColumn to match the corresponding column in the seed's columns_to_types.

The text was updated successfully, but these errors were encountered:

georgesittas added the Bug Something isn't working label May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seed column & model column mismatch due to normalization #2698

Seed column & model column mismatch due to normalization #2698

georgesittas commented May 29, 2024 •

edited

Loading

Seed column & model column mismatch due to normalization #2698

Seed column & model column mismatch due to normalization #2698

Comments

georgesittas commented May 29, 2024 • edited Loading

georgesittas commented May 29, 2024 •

edited

Loading