I stumbled accross the website six.pen.io, which features a series of exercises designed to make the reader think about the number six. The exercises include a series of five simple addition problems which sum to six, and a request that the reader pause 15 seconds and repeat the number 'six' to themselves as fast as they can. The website then has them scroll down, and implores them (with excessive punctuation) to think of a vegetable. The website claims that 98% of people who follow this procedure come up with "carrot".
I found this claim rather dubious, so I set out to test it. There are two hypotheses to test here:
To test these hypothesis, I set up an internet survey to gather data. The survey page was written in PHP and JavaScript, and results were stored in a MySQL database. JavaScript was used for form validation, and to count down 15 seconds during which subjects were supposed to be saying "six" to themselves. The survey was designed to function even with JavaScript disabled.
For this study, I needed a sample size of at least 30 for both the treatment and control. This is because the Central Limit Theorem states that a sample of at least 30 can usually be safely assumed to be approximately normal. This means that I would need at least 60 people total to respond to the survey. More people would be ideal, because a larger sample size decreases standard deviation, which increases confidence and makes it easier to draw conclusions.
The survey is still available here; you can still take it if you wish, though your results will not be recorded.
Randomization The population of interest for this study was all English-speakers who can do basic addition and know the name of at least one vegetable. The randomization for this study wasn't great, as people were made aware of it primarily through facebook, the JetCareers forums, and the xkcd forums. Fortunately, I don't have any reason to suspect that people who know me on facebook, members of the JetCareers forums, and members of the xkcd forums are predisposed to thinking of particular vegetables. So despite less than ideal randomization, I can still continue the study.
The first page of the survey introduced my motives for creating it, without disclosing any information that could potentially bias subjects. The second page recorded demographics, including Country, State/Province, whether English was their first language, and Gender. These were included because I thought I might block by them.
Subjects were asked not to take the test if they had already taken it, and not to take it while someone who hadn't taken it yet was watching. Upon completing the demographics page, subjects were randomly assigned by the PHP script to either the Treatment or Control group.
All the pages on the treatment side of the test had the number six spelled out in large letters at the top of the page. This was done because six.pen.io included a large-font "six" at the start of the page, and in the page title and url. I tried to choose a font and style for the "six" close to the font used on six.pen.io, and make to it noticeable. The next few pages simply reproduced the test at six.pen.io, asking the same mathematical questions in the same order: 1+5, 2+4, 3+3, 4+2, and 5+1. The subjects were then asked to say "six" to themselves as fast as they could for fifteen seconds. A JavaScript timer counted down, enabling the "next" button after reaching 0.
The next page prompted them to think of a vegetable, using six.pen.io's exact wording: "QUICK!!! THINK OF A VEGETABLE!". The page after that asked them to enter in the first vegetable that had come to their mind, providing a text field for their input.
Subjects in the control group skipped all the questions regarding the number six, and were simply asked to type in the first vegetable that popped into their head.
All subjects from both samples were asked whether they had prior knowledge of the test before taking it. These questions were intended to curb bias and help insure independence (eg, a person who had seen another person take the test might be predisposed to answer with the same vegetable with that person).
When I closed the survey, I had received 162 usable responses, not including spam and people who said they had prior knowledge of the test. I decided to keep the results who said they were dishonest, because:
I decided against blocking by demographics for two reasons:
Blocking by language did seem reasonable, but the fact that all of the respondents spoke English as a first language (except for two which were eliminated anyway because of answers that were not vegetables) made this a moot point.
The 162 respondents divided evenly: 81 in the Control group, and 81 in the Treatment group. Equal sample sizes weren't necessary for this study, and I didn't try for them. I just got lucky.
The results do show a high proportion of carrots, for both the treatment and control. And the proportion of carrots for the treatment is slightly higher, as the website implies it should be. However, the proportion of carrots is nowhere near the 98% the website claims.
Graphically, the results look like this:

And in tabular form:
| Vegetable | Count |
|---|---|
| asparagus | 1 |
| beans | 1 |
| broccoli | 8 |
| cabbage | 1 |
| califlower | 0 |
| carrot | 24 |
| celery | 4 |
| corn | 5 |
| cucumber | 2 |
| eggplant | 1 |
| green beans | 1 |
| leeks | 1 |
| lettuce | 3 |
| okra | 1 |
| onion | 1 |
| potato | 17 |
| rubarb | 0 |
| rutabega | 0 |
| squash | 1 |
| strawberry | 1 |
| sweet potato | 1 |
| tomato | 6 |
| turnip | 0 |
| zucchini | 1 |
| Total | 81 |
| Vegetable | Count |
|---|---|
| asparagus | 2 |
| beans | 0 |
| broccoli | 6 |
| cabbage | 0 |
| califlower | 1 |
| carrot | 33 |
| celery | 2 |
| corn | 1 |
| cucumber | 5 |
| eggplant | 0 |
| green beans | 1 |
| leeks | 1 |
| lettuce | 3 |
| okra | 0 |
| onion | 0 |
| potato | 13 |
| rubarb | 1 |
| rutabega | 1 |
| squash | 0 |
| strawberry | 0 |
| sweet potato | 0 |
| tomato | 6 |
| turnip | 1 |
| zucchini | 4 |
| Total | 81 |
The proportions we care about are the carrots:
| Control | Treatment | |
|---|---|---|
| Size | 81 | 81 |
| Carrots | 24 | 33 |
| Proportion | 0.2963 | 0.4074 |
I will first test the website's first claim: Is proportion of people who took the treatment and thought first of carrots equal to 98%, or less than that?
H0: π = 0.98
Ha: π < 0.98
Where π is the population proportion of people who think first of carrots after engaging in the exercises to think about the number six.
α=0.05 I'm using a 0.05 significance level; I'm not too concerned about the consequences of a Type II error.
Conditions:Not all conditions were met, so we can't be sure the data is approximately normal. Take results with a grain of salt, so to speak.

We reject the null hypothesis. Because the P-value is less than the significance level (~0<0.05), there is enough evidence to conclude that the proportion of people who think first of carrots after participating in the number-six exercises is less than 98%.
So it look like the website's claim that 98% of people think first of carrots is incorrect. The data in the survey suggests the true proportion of people who think first of carrots is somewhere between 30% and 40%, which is still fairly high. Carrots are the vegetable with the highest proportion in both samples.
That said, we did not meet all conditions for assuming an approximately normal distribution, which makes the results somewhat suspect. Fortunately, we have pie charts that back up the conclusion that the true proportion of people who think first of carrots is less than 98%.
Now to address the website's second claim, which was implicit: Is the proportion of people who thought first of carrots different between those who took the treatment (the six-exercises), and those in the control?
H0: π0 - π1 = 0
Ha: π0 - π1 < 0
Where π0 is the population proportion of people who thought of carrots first in the control group, and π1 is the population proportion of people who thought of carrots first when given the treatment.
α=0.05 I'm using a 0.05 significance level; I'm not too concerned about the consequences of a Type II error.
Conditions:All conditions are met, so we can proceed as planned without undue concern.
We fail to reject the null hypothesis. Because the P-value is not less than the significance level (0.0693>0.05), we do not have enough evidence to conclude that the proportion of people who thought first of carrots in the treatment group differs significantly from the proportion who thought first of carrots in the control group.
So not only is the proportion of people who think first of carrots less than 98%, the series of exercises designed to make people think about the number six doesn't have a significant impact on the proportion of people who think of carrots. Both the website's claims seem to be wrong.
Unfortunately, a member of the xkcd forums pointed out a glaring flaw in my survey: The survey asked subjects to name a vegetable, and the survey was hosted at abpotato.com. This very likely biased subjects to think of potatoes, which would decrease the proportion of people who thought of carrots.
To compensate for this, I'll take an extreme test case. I'll assume that all of the people who reported thinking first of potatoes would have normally thought first of carrots, if not for the url. In accordance with this, I'll redo the single proportion test using the sum of the proportion of carrots and potatoes.
Where p̂0 is the combined proportion of carrots and potatoes in the control group, and p̂1 is the combined proportion of carrots and potatoes in the treatment group.
Now to reconduct the test on the first hypothesis:
H0: π = 0.98
Ha: π < 0.98
Where π is the population proportion of people who think first of carrots after engaging in the exercises to think about the number six.
α=0.05 I'm using a 0.05 significance level; I'm not too concerned about the consequences of a Type II error.
Conditions:Not all conditions were met, so we can't be sure the data is approximately normal. Take results with a grain of salt, so to speak.

We reject the null hypothesis. Because the P-value is less than the significance level (~0<0.05), there is enough evidence to conclude that the proportion of people who think first of carrots after participating in the number-six exercises is less than 98%.
So, even when we combine the proportions of carrots with potatoes, we still reject the null hypothesis. Keep in mind that, again, not all the conditions were met, so the test cannot be completely trusted by itself. Again, the pie charts back up the results of the test: While the combined proportion of carrots and potatoes in the treatment group is large (56.8%), it is still significantly smaller than 98%.
The conclusion that the website is incorrect in stating 98% of people think first of carrots stands.
Just to be thorough, I'll also reconduct both the one-proportion z-test and the two-proportion z-test with the potato data thrown out.
Summary data with the potatoes thrown out:
| Control | Treatment | |
|---|---|---|
| Size | 63 | 68 |
| Carrots | 24 | 33 |
| Proportion | 0.3810 | 0.4853 |
The first hypothesis: Is the proportion of people who think first of carrots less than .98?
H0: π = 0.98
Ha: π < 0.98
Where π is the population proportion of people who think first of carrots after engaging in the exercises to think about the number six.
α=0.05 I'm using a 0.05 significance level; I'm not too concerned about the consequences of a Type II error.
Conditions:Not all conditions were met, so we can't be sure the data is approximately normal. Take results with a grain of salt, so to speak.

We reject the null hypothesis. Because the P-value is less than the significance level (~0<0.05), there is enough evidence to conclude that the proportion of people who think first of carrots after participating in the number-six exercises is less than 98%.
So the first test, the test of the website's explicit claim that 98% of people who take the treatment think of carrots, is false no matter how we test it. Now, as with the previous one-proportion z-tests, we don't have all the conditions for assuming approximate normality yet. Fortunately, we can again turn to pie charts.
Pie charts for potato-less data:

While the proportion of carrots in the potato-less data is quite high - nearly 50% for the treatment - it is still significantly less than 98%. The pie charts again back up the test. It is clear that the true proportion of people who think first of carrots when given the treatment is less than 98%.
Now to address the second hypothesis: Does the treatment have any affect on the proportion of carrots (when the potatoes are discarded)?
H0: π0 - π1 = 0
Ha: π0 - π1 < 0
Where π0 is the population proportion of people who thought of carrots first in the control group, and π1 is the population proportion of people who thought of carrots first when given the treatment.
α=0.05 I'm using a 0.05 significance level; I'm not too concerned about the consequences of a Type II error.
Conditions:All conditions are met, so we can proceed as planned without undue concern.
We fail to reject the null hypothesis. Because the P-value is not less than the significance level (0.1144>0.05), we do not have enough evidence to conclude that the proportion of people who thought first of carrots in the treatment group differs significantly from the proportion who thought first of carrots in the control group.
So with the potatoes eliminated, we still fail to reject the null hypothesis. That is to say, we still find no strong evidence to agree with the website's implicit claim that the treatment has any affect on the proportion of people who think of carrots. In fact, as the P-value without the potatoes is even higher than the P-value with them, the absence of potatoes only strengthens our prior conclusion.
So it seems like both the website's claims are wrong no matter how we look at them.
Now, we have already determined that the proportion of carrots does not vary between treatment and control groups. But does the proportion of potatoes change, or any vegetable? To test this, I'll use a χ² (chi-squared) test.
Mouseover vegetable numbers to see vegetable names| Vegetable | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Control | 1 | 1 | 8 | 1 | 0 | 24 | 4 | 5 | 2 | 1 | 1 | 1 | 3 | 1 | 1 | 17 | 0 | 0 | 1 | 1 | 1 | 6 | 0 | 1 |
| Treatment | 2 | 0 | 6 | 0 | 1 | 33 | 2 | 1 | 5 | 0 | 1 | 1 | 3 | 0 | 0 | 13 | 1 | 1 | 0 | 0 | 0 | 6 | 1 | 4 |
| Vegetable | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Control | 1.509 | 0.503 | 7.043 | 0.503 | 1.006 | 28.675 | 3.018 | 3.018 | 3.522 | 0.503 | 1.006 | 1.006 |
| Treatment | 1.491 | 0.497 | 6.957 | 0.497 | 0.994 | 28.33 | 2.982 | 2.982 | 3.479 | 0.497 | 0.994 | 0.994 |
| Vegetable | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Control | 3.018 | 0.503 | 0.503 | 15.092 | 0.503 | 0.503 | 0.503 | 0.503 | 0.503 | 6.037 | 0.503 | 2.519 |
| Treatment | 2.982 | 0.497 | 0.497 | 14.91 | 0.497 | 0.497 | 0.497 | 0.497 | 0.497 | 5.963 | 0.497 | 2.485 |
Unfortunately, the χ² test cannot be completed, because many of the expected values are less than 1, and less than 20% of them are greater than five. The conditions aren't even close to being met, so we can't proceed and still expect reliable results.
So to tell if the treatment had any effect on any of the vegetables, the best we can do is look at the pie charts:

The treatment pie chart does have a higher proportion of carrots than the control pie chart, but we already determined through the two-proportion z-test that the difference is not significant. There are also variations for other vegetables, but as the differences are even smaller than those for the carrots, it is not unreasonable to account it as random variation. There isn't any convincing evidence that the treatment affects what vegetables people think of.
The data strongly suggests that the proportion of people who think of carrots after taking the exercises to think about the number six is significantly less than 98%, even accomodating for the bias due to the survey's url.
In addition, there is no convincing evidence to suggest that there is a significant difference in the treatment - that is, the exercises to think about the number six - has any influence on what vegetables people think of.
Based on the evidence collected in the survey, six.pen.io appears to be wrong on both accounts. That said, the survey did show that the proportion of people who think of carrots when asked to think of a vegetable is higher than that of other vegetables - accomodating for the potato bias, the population proportion is probably somewhere from 30-50%. So six.pen.io did contain a grain of truth, exaggerated though it was.
At the start of the study, I used the Central Limit Theorem to determine that I needed a sample size of at least 30 for each group (treatment and sample). This was an error on my part, as the CLT applies to means, not proportions. I should have calculated the necessary sample size this way:
This would have required that I have a minimum of 1000 people respond to the survey, more due to bad results and uneven distribution between the treatment and control groups (due to randomness). It is possible that a sample size of 1000 would have been attained had I let the survey run longer than I did (it was live for about four days). Any follow-up studies should ideally meet this requirement.
Fortunately, the disparity in acquired versus ideal sample size hasn't seriously reduced the credibility of this study, because the pie charts make it fairly obvious that the true proportion of people who think of carrots is less than 98%.
Another source of error in this study was the url of the survey. Hosting the survey on my personal website - abpotato.com - almost certainly biased people to think of potatoes. Fortunately, the disparity between the proportion of carrots hypothesized by the website and the true proportion of carrots proved to be extreme enough for this not to matter. Even if all the people who thought of potatoes had instead thought of carrots, six.pen.io's claim would still have been rejected. On top of that, the conclusions still remain the same if we remove all the potatoes from the data. That said, it would be highly advisable for any follow-up studies to host surveys at a vegetable-neutral url.
One bias that I couldn't accomodate for is people reporting a vegetable other than the one they thought of first. It is possible that people did think of "carrots" - or any other vegetable - first, but then decided (for whatever reason) to put something else. A possible motivation for this would be thinking of a vegetable that seems too "normal" - like a potato - and wanting to put something more "unusual" (like leeks).
Future studies could mitigate this bias by providing textboxes for the second vegetable thought of in addition to the first. This way, subjects could feel comfortable entering the vegetable they actually thought of first in the first box, and the "more interesting" vegetable they thought of second in the second box. For good measure, it might be a good idea to include a third textbox as well. It is worth noting that these textboxes would need to be all on the same page, where the subject could see them all at once. Obviously, subjects would have to have the option of leaving subsequent boxes after the first blank, as they might not have thought of a second or tertiary vegetable.
If I did this study again, I wouldn't ask for country, state/province, gender, or first language. I would still record IP Addresses; they were useful for spam detection.