Faring most poorly was the Psychopathy Checklist (PCL-R). Correlations of scores between two evaluators hired by the same agency were in the low range. On average, psychologists differed by five points on the instrument, which has a score range of of zero to 40. In one case, two evaluators were apart by a whopping 24 points!
Agreement among evaluators was only moderate on the Static-99 and the MnSOST-R, two actuarial risk assessment instruments for which scoring is relatively more straightforward.
The study, published in the respected journal Psychological Assessment, was a collaboration between scholars from the Department of Mental Health Law and Policy at the University of South Florida and researchers with the Florida Department of Children and Families. It utilized archived records culled from the almost 35,000 individuals screened for possible Sexually Violent Predators (SVP) civil commitment in Florida between 1999 and 2009. The researchers located 315 cases in which the same individual was evaluated by separate clinicians who each administered both the PCL-R and at least one of the two actuarial measures within a short enough time frame to enable direct scoring comparisons.
It would be a mistake to lean too heavily on the results of a single isolated study. But the present study adds to a burgeoning body from several groups of independent researchers, all pointing to troubling problems with the accuracy of instruments designed to forecast risk of recidivism among sex offenders.
Collectively, the research has been especially critical of the ability of the highly prejudicial construct of psychopathy to add meaningfully to risk prediction in this high-stakes arena. Indeed, just this week another study has come out indicating that neither psychopathy scores nor sexual deviance measures improve on the accuracy provided by an actuarial instrument alone.
An especially interesting finding of that Canadian study is that reoffense rates were still below 12 percent over a 6-year followup period for even the most high-risk offenders -- those with high risk ratings on the Static-99R plus high levels of psychopathy and sexual deviance (as measured by phallometric testing). This makes it inappropriate to inflate risk estimates over and above those derived from Static-99R scores alone, the authors caution.
Item-level analysis finds varying rates of accuracy
A unique contribution of the Florida study is its analysis of the relative accuracy of every single item in each of the three instruments studied. Handy tables allow a forensic practitioner to see which items have the poorest reliability, meaning they should be viewed skeptically by forensic decision-makers.
For example, take the MnSOST-R, a now-defunct instrument with a score range of –14 to 31 points. The total gap between evaluators was as wide as 19 points; the items with the greatest variability in scoring were those pertaining to offenders' functioning during incarceration, such as participation in treatment.
Meanwhile, the weak performance of the Psychopathy Checklist owes much to the items on its so-called “Factor 1,” which attempt to measure the personality style of the psychopath. As I've discussed before, rating someone as “glib,” “callous” or “shallow” is a highly subjective enterprise that opens the door to a veritable avalanche of personal bias.
Piggy-backing off a recommendation by John Edens and colleagues, the Florida team suggests that the prejudicial deployment of the Psychopathy Checklist may be superfluous, in that scores on Factor 2 alone (the items reflecting a chronic criminal lifestyle) are more predictive of future violence or sexual recidivism.
Next up, we need to identify the causes of the poor interrater reliability for forensic risk prediction instruments in real-world settings. Is it due to inadequate training, differing clinical skills, variable access to collateral data, intentional or unintentional bias on the part of examiners, adversarial allegiance effects (not a factor in the present study, since both evaluators were appointed by the same agency), or some combination?
In the meantime, the fact that two evaluators working on the same side cannot reliably arrive at the same risk rating for any particular individual should certainly raise our skepticism about the validity of risk prediction based on these instruments.