Summary of data from each study used in the meta-analyses: boldness and reproductive success (a); boldness and survival (b); exploration and reproductive success (c); exploration and survival (d); aggression and reproductive success (e); aggression and survival (f)

## Also, results were only included when it was clear from the publication that statistical tests had been used to examine the relationship

Our sample of mammals, fish, arthropods, and birds included studies of male, female, and mixed-sex subjects, and all but 3 studies ( Wilson et al. 1993; Godin and Davis 1995; Bremner-Harrison et al. 2004) consisted solely of adults.

## Analyses

All conversions and analyses were done using Meta-Analysis Programs 5.3 (Ralf Schwarzer: We chose Pearson’s product–moment correlation coefficient, r, as the measure of effect size for our studies ( Rosenthal 1991). The r value is the magnitude of the effect of the measured behavioral type on the direct fitness correlate. When possible, coefficients were obtained from each study in the following order: 1) direct reporting of r, R 2 , or partial correlation; 2) other test statistics (F, U, t, ? 2 ) converted to r ( Rosenthal 1991); 3) N and exact 1-tailed P values used to calculate r (reported 2-tailed P values were converted to 1-tailed by dividing by 2). To account for the use of 1-tailed P values by Meta-Analysis Programs 5.3, minus signs were given to probabilities in the opposite direction of our prediction. Thus, if a study found that bold individuals survived longer, the P value was given a positive sign; if survival was reduced, the P value was given a negative sign. The only deviation from the above methods was a study by Dingemanse et al. (2004), which fit models using information theory. Some results of this study could not be directly converted to effect sizes so we calculated values for the data points illustrated in Figure 2 (p. 850) and fitted linear regression models to obtain effect size estimates.

We attempted to contact authors for additional data when results did not report exact effect sizes or P values (e.g., they stated P < 0.05 or P > 0.05), and we obtained unpublished data for 2 studies ( Armitage 1986; Dingemanse et al. 2004) that were used as r values. For other studies that included results which were not exact, P values were estimated to the nearest tenth or hundredth decimal place of the given value (P < 0.25 = 0.2; P > 0.05 = 0.06; P > 0.1 = 0.2) and results that reported nonsignificance with no P value were given P values of 0.5 (r = 0.0) ( Rosenthal 1991). Only results where a direct comparison was made between a personality dimension and fitness correlate were included for analyses. For example, if a paper stated that no relationship was obvious, but did not give the P value or test used, the result was not included. A summary of the data from each study used in the meta-analysis is shown in Table 1.

We performed a series of meta-analyses using the Schmidt–Hunter method ( Hunter and Schmidt 1990) in which effect sizes from individual studies are weighted by their sample size to the proportion of total sample size. Most studies reported more than one result when comparing behavioral type to fitness. For these studies, each r value was converted to a Fisher’s z_{r}. Fisher’s z_{r} values were then averaged and converted back to r to give a single, overall r value for each study ( Rosenthal 1991). The sample size for each result was also averaged to give a single overall N for each study ( Schum et al. 2003). This technique is standard in meta-analysis and reduces the risk of treating nonindependent results as independent ( Rosenthal 1991). Results of analyses were tested for significance using the Z-test ( Rosenthal 1991). To address the “file drawer problem” (see Rosenthal 1991), we calculated the fail safe number for each weighted mean r. This value indicates the number of studies with effect sizes of 0 that would be needed to reduce the observed effect size to a nonsignificant level (P > 0.05). To test for homogeneity of results within each analysis, we calculated I 2 , which describes the percentage of variation (0–100%) across studies that is due to heterogeneity ( Higgins et al. 2003).