You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create new SQLMesh project using sqlmesh init postgres --template empty - Postgres is the default dialect because its normalization strategy is LOWERCASE, i.e. it treats quoted identifiers as case-sensitive, and lowercases unquoted ones.
Create the following seed model:
MODEL (
name sqlmesh_example.demo_seed,
kind SEED (
path'../seeds/demo_seed.csv'
),
columns(
"camelCaseColumn"TEXT,
),
);
Create the following seed file under seeds/, named demo_seed.csv:
"camelCaseColumn"
foo
Run sqlmesh plan.
This currently produces the following error:
File "sqlmesh/core/model/definition.py", line 1182, in render_seed
df.loc[:, string_columns] = df[string_columns].mask(
File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 3899, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6115, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['camelCaseColumn'], dtype='object')] are in the [columns]"
The reason this happens is because we normalize the seed file's column by lowercasing it, whereas the column property lists it with quotes, making it case-sensitive. We're essentially always treating CSV columns as case-insensitive, since there's no way to quote them and preserve those quotes in the dataframe returned by pd.read_csv.
Some ideas to fix this:
Always treat CSV columns as case-sensitive. This may be problematic because it's a constraint that propagates downstream, thus requiring quotes everywhere. One way a user could overcome this would be to make a new model that acts as a dependency inversion layer for that seed, e.g. SELECT "camelCaseColumn" AS camelCaseColumn FROM sqlmesh_example.demo_seed. Then, the column would not require quotes downstream from that "staging" model.
Continue normalizing the CSV columns, but treat the names enumerated in the columns model property as the source of truth. So for the example above, we'd rename the single dataframe column into camelCaseColumn to match the corresponding column in the seed's columns_to_types.
The text was updated successfully, but these errors were encountered:
To reproduce the bug:
sqlmesh init postgres --template empty
- Postgres is the default dialect because its normalization strategy isLOWERCASE
, i.e. it treats quoted identifiers as case-sensitive, and lowercases unquoted ones.seeds/
, nameddemo_seed.csv
:sqlmesh plan
.This currently produces the following error:
The reason this happens is because we normalize the seed file's column by lowercasing it, whereas the
column
property lists it with quotes, making it case-sensitive. We're essentially always treating CSV columns as case-insensitive, since there's no way to quote them and preserve those quotes in the dataframe returned bypd.read_csv
.Some ideas to fix this:
SELECT "camelCaseColumn" AS camelCaseColumn FROM sqlmesh_example.demo_seed
. Then, the column would not require quotes downstream from that "staging" model.columns
model property as the source of truth. So for the example above, we'd rename the single dataframe column intocamelCaseColumn
to match the corresponding column in the seed'scolumns_to_types
.The text was updated successfully, but these errors were encountered: