Tag Archives: data

Feeling Sick or Unlucky

Let’s play doctor.Image result for doctor holding stethoscope

Let’s say you have a patient who shows signs of a disease that’s tricky to diagnose.  In fact, of the people who show these symptoms, only 1 in 100 have the disease.  The test is only successful in detecting the disease 90% of the time.  The test can also fail by incorrectly indicating a “false positive” (i.e., test results show you have the disease when in fact, you do not) 9% of the time.

How do you feel about them odds?

Since this case study is appearing on this blog, you are correct in thinking there is a trick.  In the real world, physicians are confronted with these type of odds all the time.  To make matters worse, the percentages are even murkier, for example, with overlapping or contradicting studies.  The neuroscientist Daniel Levitin, in his book ‘A Field Guide to Lies’ cites a study the indicates that 90% of physicians make the error of conducting the test.
What’s the error?  The odds are that about 9 in 10 positive results are actual false positives.  If the test shows that your patient has the disease, you are nine times more wrong than you are right.  This test, is therefore, useless, or worse than useless.
One thing I have learned is that maintaining one or two numbers and the relationship between them is relatively easy.  When you have to deal with three numbers, even if the math is easy, things get hard really fast.  To do the above math in your head, you have to do a few things.
  1. Track the “patient does not have the disease” part of the equation.  Using numbers from above, 99 do not have the disease, 9% false positive is about 9 people.
  2. Compare that to the “correct positive” of the 1 person who has the disease and gets a positive result”.  Let’s round up and say it’s one person.

Nine false positives to one correct positive.  Feeling lucky?


Infographic on Commuter Volumes Across Different CIties

Screen Shot 2014-09-17 at 3.14.05 PMI saw this article on the Future of Daily Travel, part of the Good Cities Project.  The slightly-larger-than thumbnail sized infographic seemed intriguing.  So I bookmarked the page so I could come back to it later when I had the chance to really dig into it.

So many things going for it!  2.5D isometrics, used as graphic, colors, use of different types of transportation, use of XY space.  I couldn’t wait to get to it.

What a disappointment when I finally had a chance to look into it!

Any graphic has to be relatively easy to interpret.  The best allow you to “start easy” and get some initial “aha!”s, and then dig deeper, like some fine piece of art.  It’s OK to challenge and stretch your audience, but the payoff has to be there.  The opposite of this is to have the audience think, “is that it?”.  Or to make them feel dumb (I discuss this in a separate blog post).

It took me a while to realize that the icons (sprites for boats, cars, London double-decker buses, trains, etc.) had no meaning.  When you have something that take up that much room and color, it has to mean something.  It turns out that the LENGTH of the transportation does matter… BUT the location in XY space does not matter.  Or if it does, I can’t figure out the meaning.

Does the location of the person with the “Cost per Commuter” matter?  I dunno.  The # is nice, but there’s no way to compare it across the cities.  Maybe there is… I can’t tell.

Also, I got lost in the 2 colored lines per city.  Sure, most of the transport requires two line in the real world… is that what we’re trying to show?

There’s a lot of info here, and it still draws you in.  But I found it difficult to filter out what was informative and what was just cute.

What’s the “informative to cute” ratio in your graphics?

A Tale of Two Banks and the “Three Friends Go Out to Lunch” Brain Teaser

I recently had two separate conversations with two different people, both in banking.

One is a VP at a Regional Bank in California.  The other is the CEO of a bank based on the Middle East, growing by expansion and acquisitions.  Two different worlds, night and day.  Both deal with loans and try to mange risk while trying to grow.  But the similarities end quickly.

After the conversations, I was reminded of the following “Three Friends go out to Lunch” problem I read years ago in some brain teaser book.  Enjoy!  My response is here.

Three friends go out to a restaurant and order three lunch specials.  The bill comes out to $30 (with tax, no tip).  They decide to leave $10 each.  As they leave out the door, the owner realizes that the bill was incorrectly calculated: it should have been $25 including tax.  He sends the waiter out with five one dollar bills.  The waiter, feeling slighted by not having received a tip, keeps $2 and gives a dollar bill to each of the the three friends.

So each friend paid $9 (originally paid $10 each, but each got a dollar back).  The waiter keeps $2.  Three times $9 is $27, plus the two dollars is $29.  Where is the other dollar?

How Much Oil is Leaking in the Gulf?

The oil continues to leak, and we are now starting see the oil hit the shores.  We have live images from the leak 5,000 feet below the water surface.  But how much oil is actually leaking from the wellhead?

Initial estimates (Day 5) from the US Coast Guard and BP placed the estimate at 1,000 barrels/day.  Last week, the “official” government group revised the estimates at 20,000 – 40,000 barrels/day.  Less than a week later, the estimate is now at 35,000 – 60,000 barrels/day.

When the late May estimates came out, there was an interesting quote from Ira Leifer, University of California, Santa Barbara: “It would be irresponsible and unscientific to claim an upper bound, …it’s safe to say that the total amount is significantly larger.”  He wants to make sure the estimate has an asterisk because he wants “to stand up for academic integrity.”  In fact, there’s a whole document available by the university that will explain how the scientists came up with their estimate.  (from WSJ article here)

But I suspect that most of us will not care too much for the actual method or the details.  Maybe a chart like below is helpful since it summarizes the “growth” of the estimates over time.  By doing so, I am (on purpose? inadvertently?) suggesting a story and a conclusion.  What do you read from it?

Report Date Barrels/Day Source / Reported by Method Link
April 24 (Day 5) 1,000 USCG, BP, thru cbc.ca Info from ROV (remote operating vehicles) and surface oil slick Link
April 28 (Day 9) 5,000 NOAA Satellite pictures Link
May 12 (Day 23) 70,000 Steven Wereley, Purdue, for an NPR story particle image velocimetry, based on videotape Link
May 12 (Day 23) 20,000 – 100,000 Eugene Chang, UC-Berkeley, for an NPR story “Pencil and paper” based on pipe diameter (from video) Link
May 27 (Day 34) 12,000 – 19,000 12,000 – 25,000
(depending on source)
Flow Rate Technical Group
(NOAA, USCG, Minerals Mgt)
Based on multiple methods
(blog entry author’s guess)
Link 1

Link 2

Jun 10 (Day 52) 20,000 – 40,000
25,000 – 30,000
(depending on source)
Plume Modeling Team of The Flow Rate Technical Group Revised from earlier, based on additional video from BP Link
Jun 15 (Day 57) 35,000 – 60,000 Deepwater Horizon Incident Joint Information Center, reported by cnn “based on updated information and scientific assessments,” Link

So what can we learn from this?  We all (think we) want lots of data.  It’s helpful when it’s summarized in a way that seems to make sense.  But when we are confronted with data that we are not used to seeing (how may of us deal in BARRELS of oil, or work with flow rates?) we need some anchor, some comparisons, something that helps us make sense of numbers.

No matter how you count it, this is a lot of oil.  But does it really matter that it’s 15,000 or 60,000 barrels/day?  If you are part of cleanup or doing planning for the the collection, it may help you with the planning.  But you’re also going to want to know some other info, such as how long will it flow, how the flow has changed over time and the related “total leakage”.  Even with this last bit of info, you’re more interested in the amount that ends up on the shore or the amount that actually possible to reclaim.

For most of us, the accuracy of the flow rates do not matter so much.  It’s a lot of oil, and we need some way to get a handle on it.  Most of us will not remember the actual number (or in this case, the changing range of numbers).

Besides, no one will really know the true amount that has spilled.

Just for Fun

Non Sequitor

What is a “good model”?

Chances are, you have worked with models.  You may build and run complex models spanning many years or detailing lots of “steps” or “lines”.  You may have also used some simple models, such as a hand-drawn map that tells you how to get to the park.

A model is simply “some representation of reality”.  You have a real product line and a set of salespeople.  They take real orders from real customers.  If all goes well, you will receive real dollars (or Euros) for the sale.  So you have a speadsheet for projecting sales revenue associated with all this.  After a hard day at work, you may go running at a nearby park, and a simple hand-drawn map may suffice in getting you there.

Both the “sales projection spreadsheet” and the “hand drawn map to the park” are models.  There are two key features of a model, any model, that can be illustrated through the above examples.

1. A model must be constructed with more or less a specific question in mind.
2. A model is “good” or “bad” in light of this purpose.

So a hand-scribbled model, not drawn to scale, with some streets that are not labeled may indeed suffice for your manoeuvring through the neighborhood and finding the park.  But if you wanted to lay down utility lines and plan some street-ripping construction, you would want a different model, one that shows more specific dimensions and perhaps what kind of surface materials you are dealing with.

In many of the modeling and analytics work I do, I get asked, “how accurate is it?” or “how much data is in it?”.  I believe that the questions are valid, but the first questions to ask are: “what are we trying to solve?”.

Before you build a model, think about what you are trying to do.  Who is the audience?  How will the model (or the results) be used? What kind of questions will people ask?  Then we can go about discussing “level of detail” or “what kind of data”.