Memory error when using pandas read_csv

Memory error when using pandas read_csv

Asked on October 24, 2018 in Windows.
Add Comment


  • 7 Answer(s)

    In case csv file is corrupted, like this:

      name, age, birthday

    •   Raj, 30, 1985-01-01
    •   Shama, 34, 1981-01-01
    •   Reenu, 26, 1989-01-01
    •   Akash, 40+, None-Ur-Bz

    Specifying dtype={‘age’:int} it can be break using the .read_csv() command, because it cannot cast “40+” to int.

    Answered on October 24, 2018.
    Add Comment

    The memory problem with a simple read of a tab delimited text file around size is 1 gb (over 5.5 million records) and this solved the memory problem

    df = pd.read_csv(myfile,sep='\t') it didn't work, memory error
    df = pd.read_csv(myfile,sep='\t',low_memory=False) worked fine and in less than 30 seconds
    

    Spyder 3.2.3 Python 2.7.13 64bits

    Answered on October 24, 2018.
    Add Comment

     Using Pandas on  Linux box , and it could be baby-faced several memory leaks that solely got resolved once upgrading Pandas to the newest version once biological research it from github.

    Answered on October 24, 2018.
    Add Comment

    I encountered this issue as well when I was running in a virtual machine, or somewere else where the memory is stricktly limited. It has nothing to do with pandas or numpy or csv, but will always happen if you try using more memory as you are alowed to use, not even only in python.

    The only chance you have is what you already tried, try to chomp down the big thing into smaller pieces which fit into memory.

    If you ever asked yourself what MapReduce is all about, you found out by yourself…MapReduce would try to distribute the chunks over many machines, you would try to process the chunke on one machine one after another.

    What you found out with the concatenation of the chunk files might be an issue indeed, maybe there are some copy needed in this operation…but in the end this maybe saves you in your current situation but if your csv gets a little bit larger you might run against that wall again…

    It also could be, that pandas is so smart, that it actually only loads the individual data chunks into memory if you do something with it, like concatenating to a big df?

    Several things you can try:

    • Don’t load all the data at once, but split in in pieces
    • As far as I know, hdf5 is able to do these chunks automatically and only loads the part your program currently works on
    • Look if the types are ok, a string ‘0.111111’ needs more memory than a float
    • What do you need actually, if there is the adress as a string, you might not need it for numerical analysis…
    • A database can help acessing and loading only the parts you actually need (e.g. only the 1% active users)
    Answered on January 19, 2019.
    Add Comment

    There is no error for Pandas 0.12.0 and NumPy 1.8.0.

    I have managed to create a big DataFrame and save it to a csv file and then successfully read it. Please see the example here. The size of the file is 554 Mb (It even worked for 1.1 Gb file, took longer, to generate 1.1Gb file use frequency of 30 seconds). Though I have 4Gb of RAM available.

    My suggestion is try updating Pandas. Other thing that could be useful is try running your script from command line, because for R you are not using Visual Studio (this already was suggested in the comments to your question), hence it has more resources available.

    Answered on January 19, 2019.
    Add Comment
    Memory error when using pandas read_csv. I am trying to do something fairly simple, reading a large csv file into a pandas dataframe. The code either fails with aMemoryError , or just never finishes. … The file I am trying to read is 366 Mb, the code above works if I cut the file down to something short (25 Mb).
    Answered on February 17, 2019.
    Add Comment

    Windows memory limitation

    Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.

    Tricks for lowering memory usage

    If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.

    The pandas.read_csv function takes an option called dtype. This lets pandas know what types exist inside your csv data.

    How this works

    By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.

    Example

    Let’s say your csv looks like this:

    name, age, birthday
    Alice, 30, 1985-01-01
    Bob, 35, 1980-01-01
    Charlie, 25, 1990-01-01

    This example is of course no problem to read into memory, but it’s just an example.

    If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.

    I think the default in pandas is to read 1,000,000 rows before guessing the dtype.

    Solution

    By specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.

    Problem with corrupt data

    However, if your csv file would be corrupted, like this:

    name, age, birthday
    Alice, 30, 1985-01-01
    Bob, 35, 1980-01-01
    Charlie, 25, 1990-01-01
    Dennis, 40+, None-Ur-Bz

    Then specifying dtype={'age':int} will break the .read_csv() command, because it cannot cast "40+" to int. So sanitize your data carefully!

    Answered on February 21, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.