A Data Scientist’s Laptop

What con­fig­ur­a­tion should a data sci­ent­ist go for?

A KDnuggets poll in­dic­ates a 3-4 core 5-16GB Windows ma­chine.

A StackExchange thread re­com­mends a 16GB RAM, 1TB SSD Linux sys­tem with a GPU.

Quora thread nudges con­verges around 16GB RAM.

RAM mat­ters. Our ex­per­i­ence is that RAM is the biggest bot­tle­neck with large data­sets. Things speed up an or­der of mag­nitude when all your pro­cessing is in-memory. A 16GB RAM is an ideal con­fig­ur­a­tion. Do not go be­low 8GB.

Big drives. The next biggest driver is the hard disk speed. But you don’t ne­ces­sar­ily need an SSD. If your data fits in memory, then most data ac­cess is se­quen­tial. An SSD is only ~2X faster than a reg­u­lar hard disk, but much more ex­pens­ive. (If you’re run­ning a data­base, then an SSD makes more sense.) For hard disks, lar­ger hard disks are also faster due to higher stor­age dens­ity. So prefer the 1 TB disks.

The CPU doesn’t mat­ter. Make sure you have more cores than data in­tens­ive pro­cesses, but oth­er than that, it’s not an is­sue.

However, one com­mon theme we find is that heavy data sci­ence work hap­pens on the cloud, not on the laptop. That’s what you need to be look­ing for — a good cloud en­vir­on­ment that you can con­nect to.

For ex­ample, this Frontanalytics re­port re­com­mends a ba­sic laptop with long bat­tery life, the abil­ity multi-task (i.e. mul­tiple cores), and a back­lit key­board for the night.

Maybe you just need USB port in your arms.

Damn. Not only did he not install it, he sutured a 'Vista-Ready' sticker onto my arm.

Sharing analysis

Here are three links that you should go through this week:

  1. If you’re look­ing for ways to share your ana­lys­is, RStudio 1.0 is out. The biggest fea­ture is R Notebooks, which are like Jupyter note­books. At Gramener, we’re us­ing RStudio server to col­lab­or­ate. Airbnb’s Knowledge Repo is an­other op­tion.
  2. If you’re fil­ter­ing data, be aware of Simpson’s para­dox. It ex­plains how Derek Jeter’s bat­ting av­er­age is higher than David Justice’s though the lat­ter per­formed bet­ter every year.
  3. Prepare for data sci­ence in­ter­views with this com­pil­a­tion of 109 data sci­ence in­ter­view ques­tions.

Speaking of Simpson’s para­dox, be wary of stat­ist­ic­al sig­ni­fic­ance as well:

Awesome public datasets

Here are three links you should go through this week:

  1. A cata­logue of open pub­lic data­sets, grouped by do­main. (With this list, you won’t be short of sample data for any do­main.)
  2. How to stay aware of the latest in data sci­ence? A short col­lec­tion of news­let­ters with their fre­quency and qual­ity.
  3. What mis­takes do we make when de­cid­ing based on data? The cog­nit­ive bi­as cheat sheet lists dozens of bi­ases we have and con­denses them in­to a poster.

Be warned – all three links suf­fer from this same prob­lem: