Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] read_excel: fixes handling of multi index header and other corner cases #58899

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

PL-SalvadorFandino
Copy link

@PL-SalvadorFandino PL-SalvadorFandino commented Jun 2, 2024

read_excel function has several bugs regarding how it handles
combinations of header, skiprows and index_col arguments.

The tests here showcase some of them.
@PL-SalvadorFandino PL-SalvadorFandino changed the title Holes [BUG] read_excel: fixes handling of multi index header and other corner cases Jun 2, 2024
@PL-SalvadorFandino
Copy link
Author

PL-SalvadorFandino commented Jun 2, 2024

Hi, I can't see why check pre-commit.ci - pr is failing!

The logic in that method was not handling correctly all the possible
combinations of skiprows, header and index_col arguments.

Specifically:

- it was not able to handle correctly multi index header
  with holes (for instance, `header=[0,2]`.

- multi index header and skiprows given as lists.

- forward filling index columns and skiprows gigen as lists.

- inconsistences processing one-element list arguments (for instance,
  `header=1` and `header=[1]` or `index_col=0` and `index_col=[0]` where
  handled differently).

The logic has been revamped, because it was not possible to fix all
the errors with local changes.

The mayor challenge was handling skiprows as a list, as it may remove
rows at any place (before, between or after header(s), index names and
data). Also, header row indexes reference rows **after** removing
skiprows.

To handle that we use an intermediate mapping `ixmap` which goes from
the row indixes with skiprows removed to the row indixes in `data`.

Finally, let me add that IMO, most of the functionality of
_parse_sheet should be moved down into TextParser... but that's work
for another day!
@Aloqeely

This comment was marked as resolved.

@Aloqeely
Copy link
Member

Aloqeely commented Jun 3, 2024

Looks like I don't have the permissions to do it, but you should be able to autofix by commenting pre-commit.ci autofix

@PL-SalvadorFandino
Copy link
Author

pre-commit.ci autofix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_excel multiindex head with holes
3 participants