How to lie with statistics by Darrell Huff is an old classic that has been sitting on my reading list for quite a while. Finally bought a copy. Despite my extremely slow reading speed, I could finish this little book within a day as it was a delight to read. Extremely accessible prose combined with several cartoon like illustrations by Irving Geis makes it suitable even for high school kids. This was originally published in 1954 and so the illustrations date themselves with everyone, including a baby, shown with a lit cigarette hanging from their mouth! Guess that was considered the normal state of human existence in those days. Introduction starts off with how anecdotal evidence can significantly bias people's perception, such as newspaper headlines filled with crime stories, leaving an impression on the general public that the city or that year is quite crime ridden, though statistics may not show any noticeable increase. This item being one of my pet peeves, I was hooked right away.
First chapter talks about biases built into the sample that is being analyzed. While there are several examples related to US Presidential elections themselves, an early one Huff points out dates back to 1936. At that time a magazine called Literary Digest conducted a telephone survey using their subscriber base, and predicted that the Republican candidate Alfred Landon will win the election handily. But in reality, Democratic candidate Franklin Roosevelt won the election in 46 states while Landon won only 2. This survey fiasco was attributed to the fact that in 1936, only rich Republicans could afford magazine subscriptions and telephones! While they resoundingly supported Landon, a vast majority of the voting population preferred the alternative. Though this sounds obvious now, we often fail to check whether the sample selection truly represents the larger population. Thus, while all the classmates who show up in your school/college reunion may appear to be doing better than yourself, it is probably because the majority who are not doing well simply don't show up to such reunions.
Second chapter discusses how reports often talk about "average" but choose mean or median or even mode that support their argument without explicitly stating the choice. Subsequent one talks about how most reports & advertisements that throw a number at you to make a point, leave out several other important numbers that are needed to set the full context. An example is a toothpaste advertisement saying "XYZ toothpaste users report 23% less cavities". On the surface it sounds great and may encourage us to by XYZ toothpaste, until we ask, "Compared to what?". If we compare XYZ toothpaste users to a population that doesn't use any kind of dental hygiene product, the claim may turn out to be true, but is not a big deal. Or perhaps other toothpaste users report lot more reduction in cavities, in which case, XYZ toothpaste is not the best? Or perhaps it is comparison between same XYZ toothpaste users between two successive years where the cavity reduction is due to fluoride being added to water? There is also a good bit of discussion about the size of the sample. If you used only two people in your sample, one person not reporting cavities this year can be reported as 50% reduction in cavities. In addition experiments can be repeated multiple times until favorable set of results are obtained, and then one can report just that one result leaving out all the other results that were unfavorable.
Then comes the analysis of various types of graphs used in reports that zoom in (i.e. Y axis doesn’t start from zero) to focus on small variations to make them look really big. In addition to actual graphs, there are ways to use pictures to give you the wrong impressions as well. For example when grain production doubles in one year, if we simply show a bar graph that is twice the height of the bar graph representing previous year, it is legit. Instead if we want to make it look much bigger, we can use bags of grain in the illustration showing current year bag as twice the size of last year's. While it may be twice the size in the picture, since it is supposed to represent a three dimensional image, our brains will see it as eight times more (2x2x2=8). Author also points out as to how reporters intentionally or unintentionally leave too many decimal places in the figures quoted giving you an impression that the figure is extremely precise. For example, if you took the arithmetic average of the salary of 147 people and it comes out to $73,425.76321, the five decimal places is a result of the division and is not a reflection of the precision of the average salary in a population of 147 people. To give the right perception of precision, it would be better to report it as just $73,426. But it may not appear very precise to the reader! The fact that "Correlation doesn't imply causation" is well known in academic circle and your statistics prof will try hard to beat it into your head. Thus, though there is a clear correlation between drowning deaths and amount of ice cream being eaten, one doesn't cause the other, but both tend to go up in summer months when more people go swimming. But it is easy to miss this point in more complicated datasets as the author points out. The saying, "Lies, Damn lies and Statistics" clearly gets etched in our mind as we go through the book.
After a series of these chapters, each one equally enjoyable and accessible, author finishes off the book with a final chapter that provides suggestions on how to identify these shenanigans. First question we should always as is "Who says so?" that will tease out the biases that may exist that we should be aware of. Second question is to ask, "How do they know?" (I am changing "How does he know?" used in the book to make it more contemporary. ), a question that will bring out the methods used to gather the data allowing us to assess the data reliability. Third question to ask is "What is missing?" that will let us know if uncomfortable or context setting part of the data are being left out to nudge us in a direction that is not supported by the whole dataset. "Did somebody change the subject?" is the next question that may reveal if somewhere along the lines of analysis, one set of data is incorrectly being used to report something else. Finally ask "Does it make sense?" as it will help identify and discredit a lot of nonsense reports.
This will be a good little book that we should ask all undergrad students to read so that they can internalize these notions. They will certainly help them throughout their lives when statistics is used to pull wool over their eyes.
--------------------------------------------------------------------------------
Usually when I combine two book reviews in one email, the books tend to be talking about the same topic. But in this case, the second book The Lives of a Cell - Notes of a Biology Watcher by Lewis Thomas has nothing to do with the contents of the first one. A thin connection could be the fact that this also happens to be a thin volume.
A friend of mine was putting together a short list of best papers/articles he had read over the years, that he felt gave a solid foundation for non-experts to gain a good grasp of the fundamentals in the area each article covered. He had listed the eponymous essay as one among that list. Since I had not read the article, looked it up, found it to be a book, ordered it online. When it landed, I realized it is actually a collection of essays generally related to biology. First article I read, which forms the title of the book, didn't really hook me. Subsequent few also appeared to be meandering around, set in 1974 popular science article tone. But with my OCD tendencies kicking in, I couldn't toss it aside and so read the remaining article to get closure. Glad I did as subsequent articles slowly clued me into the rhythm and the outlook of the author. Though I wouldn't say this is one of the best books, I certainly enjoyed several essays found in the book, each only 5, 6 pages long. The last paragraph of the first article, that forms the title of the book, sums up the tone/take away of the essays: I have been trying to think of the earth as a kind of organism, but it is no go. I cannot think of it this way. It is too big, too complex, with too many working parts lacking visible connections. If not like an organism, what is it like, what is it most like? Then satisfactorily for that moment, it came to me: it is most like a single cell. Author thus, tries to make observations related to biology that we may not have had on our own.
This book was first published in 1974. With the advantage of an additional 46 years of time being on my side, some of the observations the author makes, sounding very smug as one who is presenting a brilliant insight that the readers would have never thought of on their own, didn't sound that brilliant to me. I hope I am not sounding too snobbish/smug myself making this remark. For example, there is one essay titled Your Very Good Health that discusses the introduction of HMO (Health Maintenance Organization) coming into vogue in the 70's. Author says that we are spending about $90B a year on healthcare in US (that is in the mid 70's while the figure is nearing $4T per year now!) and despite that much money being spent, HMOs can't deliver all they promise/expected to deliver while simultaneously reducing the cost, because we spend so much money to treat things that don’t require any treatment. May be it was a great insight then. Now it doesn't sound like one. Even the 1970's prose that reads, "The great secret, known to internists and learned early in marriage by internists' wives but still hidden from the general public, is that most things get better by themselves." sounds anachronistic to me since my wife is a physician and I am not one! There is a chapter titled Computers that talks about how computers can never become equal to human beings. Author talks about how computers gaining more and more power are able to compute more but are still too far away from mimicking one human brain. He says, perhaps through a highly improbable development, one day if all the computers in the world can be interconnected to harness their combined power, they may get smart enough to think like a human being. He then argues that them equaling what all human beings (that already communicate with each other forming their own network) in the planet are able to think/achieve, is an order of magnitude bigger in its complexity that will always remain elusive to computers.
Certainly not all the essays in the book get caught in such pitfalls due to book's age. Many of the essays in second half are interesting as they talk about notions like the extremely complicated cities termites build together without any central control (see this link to read about a very recent discovery), conversion of basic science discoveries into medicine, characteristics of mythological creatures such as Chimera, Griffon, Sphinx, Ch'i-lin and even Ganesha that always tend to have biological features of different species that are mingled together but hardly ever sport features never seen on any biological creatures. Each essay being an independent one from the rest and lasting only half a dozen pages, makes it a fun read once you get into the groove.
No comments:
Post a Comment