Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large table runs into memory limit #15

Open
dyenzer opened this issue Apr 13, 2017 · 9 comments · May be fixed by #34
Open

Large table runs into memory limit #15

dyenzer opened this issue Apr 13, 2017 · 9 comments · May be fixed by #34

Comments

@dyenzer
Copy link

dyenzer commented Apr 13, 2017

First I'd like to say what a great tool this is. It would have saved a lot of time if it had worked for me out the gate (I know nothing about SQL/DBA and inherited a rather large database to manage). Now on to the problem...

I have a table that's 105 Million rows large that I'm trying to transfer to a newer database (Oracle 11g to MySQL 5.7). It looks like your tool loads all of the SQL transactions into memory, and with 40GB RAM / 20GB Swap on the target server, it will reach row 60 Million before it simply says 'Killed'. I'm not sure if this is something you can easily fix, like having the option to write to file instead of memory, but I wanted to at least make you aware of it as a limitation.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@seanharr11
Copy link
Owner

Hi @dyenzer , yeah there is definitely a scalability concern that you've addressed above. It's not that the SQL transactions are loaded into memory - these are actually written to disk (to a temporary file in working directory) before they are loaded into the Target DB via the DB's BULK IMPORT tool (i.e. mysqlimport, COPY FROM, etc...).

The issue is that all data from the source database is loaded into memory, and is then iterated over row-by-row to create the .sql file that is bulk loaded into the database.

It shouldn't be too difficult of a fix, but we basically need to use generators to "stream" the data through ETLAlchemy, rather than naively loading all of the data into a list (current implementation).

I foresee the need to iterate over each table's rows twice:

  1. First Iteration: Schema Inference and Clean-up (to infer the type of the column, check for empty columns, reduce size of text columns, etc...)
  2. Second Iteration: Data transformation, Bulk-import SQL file generation

I will leave this open, as I may have some time coming up this summer (June) to implement these changes. In the meantime, feel free to take a crack at it yourself and open up a PR!

@Anmol-Tuple
Copy link

@seanharr11 Hey when you will fix this issue.

@Anmol-Tuple
Copy link

Hi i am using mssql to fetch record of approx 25 Million
i find this query
resultProxy = self.engine.execute(T_src.select())
Is utilizing most of my memory is there any way to avoid this
I have altered the preceding code where you are iterating list to make it in a single loop
can you suggest improvement for above fallback for the resultProxy

@Anmol-Tuple
Copy link

@seanharr11 When u will resovle this issue ?

@seanharr11
Copy link
Owner

@Anmol-Tuple this is a pretty big undertaking...It will take considerable time, and without tests in place it will be very difficult to ensure this won't break etlalchemy.

This is, sadly, the by-product of me not knowing a great deal about building maintainable OSS 2 years ago.

That said, if you'd like to take a shot at beginning to implement this, on a new branch, feel free to while I develop tests. I would encourage this, as I would encourage ANY other maintainer to help in this effort (especially to develop tests!).

@rajrohith
Copy link

rajrohith commented Aug 26, 2017

@seanharr11
Hi Sean,
i have quick question
i have requirement like i need to move all db from MSSql to postgresql cloud instance and also each many db have 50 plus table so data volume may be more than TB for each db .. how to achieve this scenario
i want to use as service in cloud by using flask
i need your guidence how to do
Thanks in advance
Raj

@seanharr11
Copy link
Owner

Hi @rajrohith you can try the tool initially, and see how it works. If the program exits with a memory dump (i.e. the OS kills the process) then you'll know the tool won't support the size of your tables.

There is definitely a need to add functionality to support large tables (i.e. by writing to disk intermittently and only loading buffers of data). This should be done on a feature-branch, and merged in once tests are in place for master.

@josteinl
Copy link

Hi @seanharr11

This is a great project. Thank you for your efforts.

I am trying to transfer an Oracle 11.2 database to MySQL 5.7, but some of my tables are very big and do not fit into memory. I have tried to rewrite the migrate() method to a two-phase analyze and write operation.

I am planning to send you a pull request, but there are some issues with my code:

  1. Performance writing to file.
  2. During writing to file, transform_data() calling transform_data() calling transform_rows(). What does transform_rows() do? I have commented it out from my branch and it probably has to be moved into the dump_data() method that now dumps only one row at a time.
  3. I have changed some of the self.logger.info() statements from using .format(), to '%s', I will revert if you don't like that.
  4. I have changed the maximum line lenght from 80 to 120 characters in the code. I will revert if you do not like that.
  5. I experience problems transfering BLOBS. It works for the most part, but some data is corrupted. I expect it has to do with some missing escaping of \n or quotes. I will investigate more, and register a new issue if I am not able to solve this.

Best regards @josteinl

@josteinl josteinl linked a pull request Apr 30, 2018 that will close this issue
@musashiXXX
Copy link

musashiXXX commented Oct 31, 2018

I have a table that's 105 Million rows large that I'm trying to transfer to a newer database (Oracle 11g to MySQL 5.7). It looks like your tool loads all of the SQL transactions into memory, and with 40GB RAM / 20GB Swap on the target server, it will reach row 60 Million before it simply says 'Killed'.

I've had the same issue migrating a database with a table that was over 20GB in size. The easiest thing to do is increase your swap space. This can be done dynamically/temporarily by doing the following:

    dd if=/dev/zero of=/tmp/tempswap.img bs=1024 count=110M
    mkswap /tmp/tempswap.img
    sudo swapon /tmp/tempswap.img

In my situation, increasing swap to ~110GB "resolved" this issue and allowed the table to be migrated. You may need to go higher than that. My rule of thumb is:

  required swap = (table size * 6)

And by table size I mean the total size of the table on disk, not the number of rows. Getting that info differs depending on the source database:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants