Let your customers talk

Here are three links from the data science world that are worth your time.

  1. Actionable data science in sales. Let your customers talk for 4+ minutes. Don’t talk about your company for more than 2 minutes. And more data-driven advice.
  2. How to get into natural language processing. YCombinator has started a series of blog posts titled Paths on getting started with emerging fields. The first is on NLP.
  3. The current state of automated machine learning. An overview of libraries that automatically apply machine learning techniques to datasets.

But remember: amidst all this big data, we have a bigger small-data problem.

Which charting library to use?

Here are three links you should go through this week.

  1. What I learned recreating one chart using 24 tools is an excellent comparison by Lisa of 12 visualisation applications and 12 libraries, with a good summary of which tool to use when.
  2. Can we predict flu deaths with ML and R? Read this R notebook for a step-by-step walk-through of predicting whether a patient will survive or not. (There’s also a part 2 that improves on this model.)
  3. One of our colleagues nearly lost a piece of analysis recently. Here’s the most boring / valuable advice she can get on how to organise analysis — or any form of work for that matter. Of course, you could always learn git.

    If that doesn’t fix it, git.txt contains the phone number of a friend of mine who understands git. Just wait through a few minutes of ‘It’s really pretty simple, just think of branches as…’ and eventually you’ll learn the commands that will fix everything.

A Data Scientist’s Laptop

What configuration should a data scientist go for?

A KDnuggets poll indicates a 3-4 core 5-16GB Windows machine.

A StackExchange thread recommends a 16GB RAM, 1TB SSD Linux system with a GPU.

Quora thread nudges converges around 16GB RAM.

RAM matters. Our experience is that RAM is the biggest bottleneck with large datasets. Things speed up an order of magnitude when all your processing is in-memory. A 16GB RAM is an ideal configuration. Do not go below 8GB.

Big drives. The next biggest driver is the hard disk speed. But you don’t necessarily need an SSD. If your data fits in memory, then most data access is sequential. An SSD is only ~2X faster than a regular hard disk, but much more expensive. (If you’re running a database, then an SSD makes more sense.) For hard disks, larger hard disks are also faster due to higher storage density. So prefer the 1 TB disks.

The CPU doesn’t matter. Make sure you have more cores than data intensive processes, but other than that, it’s not an issue.

However, one common theme we find is that heavy data science work happens on the cloud, not on the laptop. That’s what you need to be looking for — a good cloud environment that you can connect to.

For example, this Frontanalytics report recommends a basic laptop with long battery life, the ability multi-task (i.e. multiple cores), and a backlit keyboard for the night.

Maybe you just need USB port in your arms.

Damn. Not only did he not install it, he sutured a 'Vista-Ready' sticker onto my arm.