October 18, 2012

Static-99R risk estimates wildly unstable, developers admit

The developers of the widely used Static-99R risk assessment tool for sex offenders have conceded that the instrument is not accurate in providing numerical estimates of risk for sexual recidivism for any specific offender.

The startling admission was published in the current issue of Criminal Justice and Behavior.

Examining the data from the 23 separate groups (totaling 8,106 offenders) that cumulatively make up the instrument’s aggregate norms, the researchers found alarmingly large variability in risk estimates depending on the underlying sample. The problem was especially acute for offenders with higher risk scores. A few examples:
  • At a low Static-99R score of "2," an offender’s predicted sexual recidivism rate after 10 years ranged from a low of 3 percent to a high of 20 percent, depending on the sample.
  • A score of "5" led to a recidivism estimate after five years of 10 percent in a large, representative sample of Swedish sex offenders, but a 250 percent higher risk, of 25 percent, in one U.S. sample. The absolute differences for more extreme scores were even larger.
  • Conversely, the Static-99R score that would predict a 15 percent likelihood of recidivism after five years ranged from a low-risk score of "2" to a high-risk score of "8," an enormous difference (greater than two standard deviations).
The study’s authors -- Karl Hanson, Leslie Helmus, David Thornton, Andrew Harris and Kelly Babchishin -- concede that such large variability in risk estimates "could lead to meaningfully different conclusions concerning an offender’s likelihood of recidivism."

Overall risk lower than previously found

Despite the wide variations in rates of offending, the absolute recidivism rate for the typical sex offender in the combined samples was low overall. The rate of recidivism among typical sex offenders after five years was only 7 percent or less (with a range of 4 to 12 percent), lower than had been reported in a previous meta-analysis. The 10-year risk ranged from 6 to 22 percent for the typical offender.

The research team speculates that the risk inflation in earlier analyses may have been an artifact of characteristics of the underlying samples, with data from higher-risk offenders more likely to be preserved and available for study. We know that a sister instrument, the MnSOST-R, produced inflated estimates of risk due to oversampling of high-risk offenders.

Will risk inflation continue?

MC Escher, "Hand with Reflecting Sphere"
The Static-99R has a very modest ability to discriminate recidivists from non-recidivists. Its so-called "Area Under the Curve" statistic of around .70 means that, if you were to randomly select one known recidivist and one non-recidivist from a group of offenders, there is about a 70 percent probability that someone who will reoffend will have a higher score than someone who won’t.

Such information about a test’s relative accuracy may be helpful when one is choosing which method to employ in doing a risk assessment. But there are a number of problems with relying on it when reporting one's assessment of a specific individual.

First of all, even that level of reliability may be illusory. A study that is currently in progress is finding poor inter-rater agreement on scores in routine practice, especially at the higher risk levels.

Second, with base rates of recidivism hovering around 6 to 7 percent, even under optimal conditions it is very difficult to accurately predict who will reoffend. For every person correctly flagged as a recidivist based on a high Static-99R score, at least three non-recidivists will be falsely flagged, according to research by Jay Singh and others, as well as published error-rate calculations by forensic psychologists Gregory DeClue and Terence Campbell.

Finally, and perhaps most importantly, telling a judge or jury how an offender compares with other offenders does not provide meaningful information about the offender’s actual risk. Indeed, such testimony can be highly misleading. For example, told that "Mr. Smith scored in the 97th percentile," judges and jurors may understandably believe this to be an estimate of actual risk, when the less frightening reality is that the person's odds of reoffending are far, far lower (probably no greater than 16 percent), even if he scores in the high-risk range. Seeing such statements in reports always flashes me back to a slim little treatise that was required reading in journalism school, How to Lie With Statistics.

Rather, what the trier of fact needs is a well calibrated test, such that predicted probabilities of recidivism match up with actual observed risk. The newly developed MnSOST-3 is promising in that regard, at least for offenders in Minnesota, where it was developed. In contrast, the popular Static-99 tools have always overestimated risk.

When the Static-99 premiered, it featured a single table of misleadingly precise risk figures. High scorers were predicted to reoffend at a rate of 52 percent after 15 years, which made it easy for government evaluators to testify that an offender with a high score met the legal criteria required for civil commitment of being "likely" to reoffend.

The instrument’s developers now admit that this original risk table "turned out to be a gross simplification."

Indeed, with each of a series of new iterations over the past few years, the Static-99's absolute risk estimates have progressively declined, such that it would be difficult for the instrument to show high enough risk to support civil detention in most cases. However, in 2009 the developers introduced a new method that can artificially inflate risk levels by comparing an offender not to the instrument's aggregate norms, but to a specially created "high risk" subsample (or "reference group") with unusually high recidivism rates.

Some evaluators are using this method on any offender who is referred for possible civil commitment. For example, I was just reviewing the transcript of a government expert's testimony that he uses these special high-risk norms on offenders who are referred for "an administrative or judicial process." In some cases, this amounts to heaping prejudice upon prejudice. Let's suppose that an offender is referred in a biased manner, due to his race or sexual orientation (something that happens far more often than you might think, and will be the topic of a future blog post). Next, based solely on this referral, this individual's risk level is calculated using recidivism rates that are guaranteed to elevate his risk as compared with other, run-of-the-mill offenders. This method has not been peer reviewed or published, and there is no evidence to support its reliability or validity. Thus, it essentially amounts to the claim that the offender in question is at an especially high risk as compared with other offenders, just "because I (or we) say so." 

The admission of poor stability across samples should make it more difficult to claim that this untested procedure -- which assumes some level of commonality between the selected reference group and the individual being assessed -- is sufficiently accurate for use in legal proceedings. Given some of the sketchy practices being employed in court, however, I am skeptical that this practice will be abandoned in the immediate future.

The article is: "Absolute recidivism rates predicted by Static-99R and Static-2002R sex offender risk assessment tools vary across samples: A meta-analysis" by Leslie Helmus, R. Karl Hanson, David Thornton, Kelly M. Babchishin and Andrew J. R. Harris. Click HERE to request a copy from Dr. Hanson. 

2 comments:

Roy Aranda said...

I'm aware of the flaws of the promise of a "gold standard" and weaknesses in actuarials that were a "breath of fresh air" geared to render more objectivity in assessment of risk of violence, and in SVP work, risk of sexual recidivism.

The chipping away of the "flavor of the day" version of statistical bins and groupings over the past several years, along with compelling findings that recidivism risk decreases with age (in some cases precipitously), and that recidivism risk has been on a decline, seems to open the door for a new look at clinical judgment (yikes!).

Is it reasonable to argue that because a tool fails to offer much scientific support re recidivism risk and the rates are low that John Q. sex offender can be safely released to society?

In a world in which the false positives are unacceptatbly high (I suppose it depends somewhat on what side of the fence you sit on), actually identifying true positives behooves us all.

But how do we get there? Heck if I know, but I wonder if it is time to abandon a sinking ship in which the last compartment just got flooded.

How long can you resuscitate what seems to be a terminally ill procedure, if not methodology?

The promise is no longer a promise.

How can clinical wisdom, strengthened by so much that is empirically sound out there, not be the way to go?

It is the band leader who has to coordinate and navigate all instruments and players. The soloist is gravely ill.

Roy Aranda, Psy.D., J.D.
NY and Long Island

Jeffrey C. Singer, PhD. said...

Well said Dr. Aranda!