Let your customers talk

Here are three links from the data sci­ence world that are worth your time.

  1. Actionable data sci­ence in sales. Let your cus­tom­ers talk for 4+ minutes. Don’t talk about your com­pany for more than 2 minutes. And more data-driven ad­vice.
  2. How to get in­to nat­ur­al lan­guage pro­cessing. YCombinator has star­ted a series of blog posts titled Paths on get­ting star­ted with emer­ging fields. The first is on NLP.
  3. The cur­rent state of auto­mated ma­chine learn­ing. An over­view of lib­rar­ies that auto­mat­ic­ally ap­ply ma­chine learn­ing tech­niques to data­sets.

But re­mem­ber: amid­st all this big data, we have a big­ger small-data prob­lem.

Which charting library to use?

Here are three links you should go through this week.

  1. What I learned re­cre­at­ing one chart us­ing 24 tools is an ex­cel­lent com­par­is­on by Lisa of 12 visu­al­isa­tion ap­plic­a­tions and 12 lib­rar­ies, with a good sum­mary of which tool to use when.
  2. Can we pre­dict flu deaths with ML and R? Read this R note­book for a step-by-step walk-through of pre­dict­ing wheth­er a pa­tient will sur­vive or not. (There’s also a part 2 that im­proves on this mod­el.)
  3. One of our col­leagues nearly lost a piece of ana­lys­is re­cently. Here’s the most bor­ing / valu­able ad­vice she can get on how to or­gan­ise ana­lys­is — or any form of work for that mat­ter. Of course, you could al­ways learn git.

    If that doesn’t fix it, git.txt con­tains the phone num­ber of a friend of mine who un­der­stands git. Just wait through a few minutes of ‘It’s really pretty sim­ple, just think of branches as…’ and even­tu­ally you’ll learn the com­mands that will fix everything.

A Data Scientist’s Laptop

What con­fig­ur­a­tion should a data sci­ent­ist go for?

A KDnuggets poll in­dic­ates a 3-4 core 5-16GB Windows ma­chine.

A StackExchange thread re­com­mends a 16GB RAM, 1TB SSD Linux sys­tem with a GPU.

Quora thread nudges con­verges around 16GB RAM.

RAM mat­ters. Our ex­per­i­ence is that RAM is the biggest bot­tle­neck with large data­sets. Things speed up an or­der of mag­nitude when all your pro­cessing is in-memory. A 16GB RAM is an ideal con­fig­ur­a­tion. Do not go be­low 8GB.

Big drives. The next biggest driver is the hard disk speed. But you don’t ne­ces­sar­ily need an SSD. If your data fits in memory, then most data ac­cess is se­quen­tial. An SSD is only ~2X faster than a reg­u­lar hard disk, but much more ex­pens­ive. (If you’re run­ning a data­base, then an SSD makes more sense.) For hard disks, lar­ger hard disks are also faster due to higher stor­age dens­ity. So prefer the 1 TB disks.

The CPU doesn’t mat­ter. Make sure you have more cores than data in­tens­ive pro­cesses, but oth­er than that, it’s not an is­sue.

However, one com­mon theme we find is that heavy data sci­ence work hap­pens on the cloud, not on the laptop. That’s what you need to be look­ing for — a good cloud en­vir­on­ment that you can con­nect to.

For ex­ample, this Frontanalytics re­port re­com­mends a ba­sic laptop with long bat­tery life, the abil­ity multi-task (i.e. mul­tiple cores), and a back­lit key­board for the night.

Maybe you just need USB port in your arms.

Damn. Not only did he not install it, he sutured a 'Vista-Ready' sticker onto my arm.