Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed column & model column mismatch due to normalization #2698

Open
georgesittas opened this issue May 29, 2024 · 0 comments
Open

Seed column & model column mismatch due to normalization #2698

georgesittas opened this issue May 29, 2024 · 0 comments
Labels
Bug Something isn't working

Comments

@georgesittas
Copy link
Contributor

georgesittas commented May 29, 2024

To reproduce the bug:

  1. Create new SQLMesh project using sqlmesh init postgres --template empty - Postgres is the default dialect because its normalization strategy is LOWERCASE, i.e. it treats quoted identifiers as case-sensitive, and lowercases unquoted ones.
  2. Create the following seed model:
MODEL (
    name sqlmesh_example.demo_seed,
    kind SEED (
        path '../seeds/demo_seed.csv'
    ),
    columns(
        "camelCaseColumn" TEXT,
    ),
);
  1. Create the following seed file under seeds/, named demo_seed.csv:
"camelCaseColumn"
foo
  1. Run sqlmesh plan.

This currently produces the following error:

  File "sqlmesh/core/model/definition.py", line 1182, in render_seed
    df.loc[:, string_columns] = df[string_columns].mask(
  File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 3899, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6115, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['camelCaseColumn'], dtype='object')] are in the [columns]"

The reason this happens is because we normalize the seed file's column by lowercasing it, whereas the column property lists it with quotes, making it case-sensitive. We're essentially always treating CSV columns as case-insensitive, since there's no way to quote them and preserve those quotes in the dataframe returned by pd.read_csv.

Some ideas to fix this:

  1. Always treat CSV columns as case-sensitive. This may be problematic because it's a constraint that propagates downstream, thus requiring quotes everywhere. One way a user could overcome this would be to make a new model that acts as a dependency inversion layer for that seed, e.g. SELECT "camelCaseColumn" AS camelCaseColumn FROM sqlmesh_example.demo_seed. Then, the column would not require quotes downstream from that "staging" model.
  2. Continue normalizing the CSV columns, but treat the names enumerated in the columns model property as the source of truth. So for the example above, we'd rename the single dataframe column into camelCaseColumn to match the corresponding column in the seed's columns_to_types.
@georgesittas georgesittas added the Bug Something isn't working label May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant