Big data strategies

Click for: original source

Do I have “big data”? Oddly, this is not actually a straightforward question for two reasons. By @practicaldatascience.org.

As a result, the fact that you have 16GB of RAM doesn’t mean you can easily work with a 14GB file. As a general rule, you need at least two times as much memory as your file takes up when first loaded.

If you have big data, you basically have four options:

  • Use chunking to trim and thin out your data so it does fit in memory (this works if the data you were given is huge, but the final analysis dataset you want to work with is small)
  • Buy more memory. Seriously, consider it
  • Minimize the penalties of working off your hard drive (not usually practical for data science)
  • Break your job into pieces and distribute over multiple machines

If your program starts using more space than you have main memory, your operating system will usually just start using your hard drive for extra space without telling you (this is called “virtual memory”, and is nice in that it prevents your computer from crashing, though it will slow stuff down a lot). As a result, you won’t always get an error message if you try and load a file to is much bigger than main memory. More details on how do I check to see if my data is fitting in memory in the article. Interesting read!

[Read More]

Tags analytics big-data app-development management cio how-to