Large table runs into memory limit #15

dyenzer · 2017-04-13T21:51:22Z

First I'd like to say what a great tool this is. It would have saved a lot of time if it had worked for me out the gate (I know nothing about SQL/DBA and inherited a rather large database to manage). Now on to the problem...

I have a table that's 105 Million rows large that I'm trying to transfer to a newer database (Oracle 11g to MySQL 5.7). It looks like your tool loads all of the SQL transactions into memory, and with 40GB RAM / 20GB Swap on the target server, it will reach row 60 Million before it simply says 'Killed'. I'm not sure if this is something you can easily fix, like having the option to write to file instead of memory, but I wanted to at least make you aware of it as a limitation.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

seanharr11 · 2017-04-17T13:17:39Z

Hi @dyenzer , yeah there is definitely a scalability concern that you've addressed above. It's not that the SQL transactions are loaded into memory - these are actually written to disk (to a temporary file in working directory) before they are loaded into the Target DB via the DB's BULK IMPORT tool (i.e. mysqlimport, COPY FROM, etc...).

The issue is that all data from the source database is loaded into memory, and is then iterated over row-by-row to create the .sql file that is bulk loaded into the database.

It shouldn't be too difficult of a fix, but we basically need to use generators to "stream" the data through ETLAlchemy, rather than naively loading all of the data into a list (current implementation).

I foresee the need to iterate over each table's rows twice:

First Iteration: Schema Inference and Clean-up (to infer the type of the column, check for empty columns, reduce size of text columns, etc...)
Second Iteration: Data transformation, Bulk-import SQL file generation

I will leave this open, as I may have some time coming up this summer (June) to implement these changes. In the meantime, feel free to take a crack at it yourself and open up a PR!

Anmol-Tuple · 2017-06-07T07:21:08Z

@seanharr11 Hey when you will fix this issue.

Anmol-Tuple · 2017-06-08T06:36:18Z

Hi i am using mssql to fetch record of approx 25 Million
i find this query
resultProxy = self.engine.execute(T_src.select())
Is utilizing most of my memory is there any way to avoid this
I have altered the preceding code where you are iterating list to make it in a single loop
can you suggest improvement for above fallback for the resultProxy

Anmol-Tuple · 2017-07-17T11:24:02Z

@seanharr11 When u will resovle this issue ?

seanharr11 · 2017-07-19T21:49:32Z

@Anmol-Tuple this is a pretty big undertaking...It will take considerable time, and without tests in place it will be very difficult to ensure this won't break etlalchemy.

This is, sadly, the by-product of me not knowing a great deal about building maintainable OSS 2 years ago.

That said, if you'd like to take a shot at beginning to implement this, on a new branch, feel free to while I develop tests. I would encourage this, as I would encourage ANY other maintainer to help in this effort (especially to develop tests!).

rajrohith · 2017-08-26T02:23:42Z

@seanharr11
Hi Sean,
i have quick question
i have requirement like i need to move all db from MSSql to postgresql cloud instance and also each many db have 50 plus table so data volume may be more than TB for each db .. how to achieve this scenario
i want to use as service in cloud by using flask
i need your guidence how to do
Thanks in advance
Raj

seanharr11 · 2017-08-29T13:22:52Z

Hi @rajrohith you can try the tool initially, and see how it works. If the program exits with a memory dump (i.e. the OS kills the process) then you'll know the tool won't support the size of your tables.

There is definitely a need to add functionality to support large tables (i.e. by writing to disk intermittently and only loading buffers of data). This should be done on a feature-branch, and merged in once tests are in place for master.

josteinl · 2018-04-24T08:27:26Z

Hi @seanharr11

This is a great project. Thank you for your efforts.

I am trying to transfer an Oracle 11.2 database to MySQL 5.7, but some of my tables are very big and do not fit into memory. I have tried to rewrite the migrate() method to a two-phase analyze and write operation.

I am planning to send you a pull request, but there are some issues with my code:

Performance writing to file.
During writing to file, transform_data() calling transform_data() calling transform_rows(). What does transform_rows() do? I have commented it out from my branch and it probably has to be moved into the dump_data() method that now dumps only one row at a time.
I have changed some of the self.logger.info() statements from using .format(), to '%s', I will revert if you don't like that.
I have changed the maximum line lenght from 80 to 120 characters in the code. I will revert if you do not like that.
I experience problems transfering BLOBS. It works for the most part, but some data is corrupted. I expect it has to do with some missing escaping of \n or quotes. I will investigate more, and register a new issue if I am not able to solve this.

Best regards @josteinl

musashiXXX · 2018-10-31T15:45:34Z

I have a table that's 105 Million rows large that I'm trying to transfer to a newer database (Oracle 11g to MySQL 5.7). It looks like your tool loads all of the SQL transactions into memory, and with 40GB RAM / 20GB Swap on the target server, it will reach row 60 Million before it simply says 'Killed'.

I've had the same issue migrating a database with a table that was over 20GB in size. The easiest thing to do is increase your swap space. This can be done dynamically/temporarily by doing the following:

    dd if=/dev/zero of=/tmp/tempswap.img bs=1024 count=110M
    mkswap /tmp/tempswap.img
    sudo swapon /tmp/tempswap.img

In my situation, increasing swap to ~110GB "resolved" this issue and allowed the table to be migrated. You may need to go higher than that. My rule of thumb is:

  required swap = (table size * 6)

And by table size I mean the total size of the table on disk, not the number of rows. Getting that info differs depending on the source database:

josteinl linked a pull request Apr 30, 2018 that will close this issue

Two phase migrate #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large table runs into memory limit #15

Large table runs into memory limit #15

dyenzer commented Apr 13, 2017 •

edited by seanharr11

Loading

seanharr11 commented Apr 17, 2017

Anmol-Tuple commented Jun 7, 2017

Anmol-Tuple commented Jun 8, 2017

Anmol-Tuple commented Jul 17, 2017

seanharr11 commented Jul 19, 2017

rajrohith commented Aug 26, 2017 •

edited

Loading

seanharr11 commented Aug 29, 2017

josteinl commented Apr 24, 2018

musashiXXX commented Oct 31, 2018 •

edited

Loading

Large table runs into memory limit #15

Large table runs into memory limit #15

Comments

dyenzer commented Apr 13, 2017 • edited by seanharr11 Loading

seanharr11 commented Apr 17, 2017

Anmol-Tuple commented Jun 7, 2017

Anmol-Tuple commented Jun 8, 2017

Anmol-Tuple commented Jul 17, 2017

seanharr11 commented Jul 19, 2017

rajrohith commented Aug 26, 2017 • edited Loading

seanharr11 commented Aug 29, 2017

josteinl commented Apr 24, 2018

musashiXXX commented Oct 31, 2018 • edited Loading

dyenzer commented Apr 13, 2017 •

edited by seanharr11

Loading

rajrohith commented Aug 26, 2017 •

edited

Loading

musashiXXX commented Oct 31, 2018 •

edited

Loading