I encountered this issue as well when I was running in a virtual machine, or somewere else where the memory is stricktly limited. It has nothing to do with pandas or numpy or csv, but will always happen if you try using more memory as you are alowed to use, not even only in python.
The only chance you have is what you already tried, try to chomp down the big thing into smaller pieces which fit into memory.
If you ever asked yourself what MapReduce is all about, you found out by yourself…MapReduce would try to distribute the chunks over many machines, you would try to process the chunke on one machine one after another.
What you found out with the concatenation of the chunk files might be an issue indeed, maybe there are some copy needed in this operation…but in the end this maybe saves you in your current situation but if your csv gets a little bit larger you might run against that wall again…
It also could be, that pandas is so smart, that it actually only loads the individual data chunks into memory if you do something with it, like concatenating to a big df?
Several things you can try:
- Don’t load all the data at once, but split in in pieces
- As far as I know, hdf5 is able to do these chunks automatically and only loads the part your program currently works on
- Look if the types are ok, a string ‘0.111111’ needs more memory than a float
- What do you need actually, if there is the adress as a string, you might not need it for numerical analysis…
- A database can help acessing and loading only the parts you actually need (e.g. only the 1% active users)