Showing posts with label psychological testing. Show all posts
Showing posts with label psychological testing. Show all posts

February 15, 2020

Flawed science? Two efforts launched to improve scientific validity of psychological test evidence in court

There’s this forensic psychologist, we’ll call him Dr. Harms, who is infamous for his unorthodox approach. He scampers around the country deploying a bizarre admixture of obscure, outdated and unpublished tests that no one else has ever heard of.

Oh, and the Psychopathy Checklist (PCL-R). Dr. Harms never omits that. To him, everyone is a chillingly dangerous psychopath. Even a 30-year-old whose last crime was at age 15.

What’s most bizarre about Dr. Harms’s esoteric method is that he gets away with it. Attorneys may try to challenge him in court, but their protests usually fall flat. Judges rule that any weaknesses in his method should go to the “weight” that jurors give Dr. Harm’s opinions, rather than the admissibility of his tests.

Psychological tests hold a magical allure as objective truth. They retain their luster even while forensic science techniques previously regarded as bulletproof are undergoing unprecedented scrutiny. Based in large part on our briefcases full of tests, courts have granted psychologists unprecedented influence over an ever-increasing array of thorny issues, from future dangerousness to parental fitness to refugee trauma. Behind the scenes, meanwhile, a lucrative test-production industry is gleefully rubbing its hands all the way to the bank.

In other forensic “science” niches such as bite-mark analysis and similar types of pattern matching that have contributed to wrongful convictions, appellate attorneys have had to wage grueling, decades-long efforts to reign in shoddy practice. (See Radley Balko's The Cadaver King and the Country Dentist for more on this.) But leaders in the field of forensic psychology are grabbing the bull by the horns and inviting us to do better, proposing novel ways for us to self-police.

New report slams "junk science” psychological assessments


In one of two significant developments, a group of researchers today released evidence of systematic problems with the state of psychological test admissibility in court. The researchers' comprehensive survey found that only about two-thirds of the tools used by clinicians in forensic settings were generally accepted in the field, while even fewer -- only about four in ten -- were favorably reviewed in authoritative sources such as the Mental Measurements Yearbook.

Despite this, psychological tests are rarely challenged when they are introduced in court, Tess M.S. Neal and her colleagues found. Even when they are, the challenges fail about two-thirds of the time. Worse yet, there is little relationship between a tool’s psychometric quality and the likelihood of it being challenged.

Slick ad for one of a myriad of new psych tests.
“Some of the weakest tools tend to get a pass from the courts,” write the authors of the newly issued report, "Psychological Assessments in Legal Contexts: Are Courts Keeping 'Junk Science' Out of the Courtroom?”

The report, currently in press in the journal Psychological Science in the Public Interest, proposes that standard batteries be developed for forensic use, based on the consensus of experts in the field as to which tests are the most reliable and valid for assessing a given psycholegal issue. It further cautions against forensic deployment of newly developed tests that are being marketed by for-profit corporations before adequate research or review by independent professionals.

"Life or death" call to halt prejudicial use of psychopathy test


In a parallel development in the field, 13 prominent forensic psychologists have issued a rare public rebuke of improper use of the controversial Psychopathy Checklist (PCL-R) in court. The group is calling for a halt to the use of the PCL-R in the sentencing phase of death-penalty cases as evidence that a convicted killer will be especially dangerous if sentenced to life in prison rather than death.

As I’ve reported previously in a series of posts (here and here, for example), scores on the PCL-R swing wildly in forensic settings based on which side hired the expert. In a phenomenon known as adversarial allegiance, prosecution-retained experts produce scores in the high-psychopathy range in about half of cases, as compared with less than one out of ten cases for defense experts.

Research does not support testimony being given by prosecution experts in capital trials that PCL-R scores can accurately predict serious violence in institutional settings such as prison, according to the newly formed Group of Concerned Forensic Mental Health Professionals. And once such a claim is made in court, its prejudicial impact on jurors is hard to overcome, potentially leading to a vote for execution.

The "Statement of Concerned Experts," whose authors include prominent professionals who helped to develop and test the PCL-R, is forthcoming from the respected journal Psychology, Public Policy, and Law.

Beware the all-powerful law of unintended consequences


This scrutiny of how psychological instruments are being used in forensic practice is much needed and long overdue. Perhaps eventually it may even trickle down to our friend Dr. Harms, although I have a feeling it won't be before his retirement.

But never underestimate the law of unintended consequences.

The research group that surveyed psychological test use in the courts developed a complex, seemingly objective method to sort tests according to whether they were generally accepted in the field and/or favorably reviewed by independent researchers and test reviewers.

Ironically enough, one of the tests that they categorized as meeting both criteria – general acceptance and favorable review – was the PCL-R, the same test being targeted by the other consortium for its improper deployment and prejudicial impact in court. (Perhaps not so coincidentally, that test is a favorite of the aforementioned Dr. Harms, who likes to score it high.)

The disconnect illustrates the fact that science doesn’t exist in a vacuum. Psychopathy is a value-laden construct that owes its popularity in large part to current cultural values, which favor the individual-pathology model of criminal conduct over notions of rehabilitation and desistance from crime.

It’s certainly understandable why reformers would suggest the development of “standard batteries … based on the best clinical tools available.” The problem comes in deciding what is “best.”

Who will be privileged to make those choices (which will inevitably reify the dominant orthodoxy and its implicit assumptions)?

What alternatives will those choices exclude? And at whose expense?

And will that truly result in fairer and more scientifically defensible practice in the courtroom?

It’s exciting that forensic psychology leaders are drawing attention to the dark underbelly of psychological test deployment in forensic practice. But despite our best efforts, I fear that equitable solutions may remain thorny and elusive.

January 30, 2014

Research roundup

The articles are flooding in at an alarming rate, threatening to bury me under yet another avalanche. Before I am completely submerged, let me share brief synopses of a few of the more informative ones that I have gotten around to reading.


Assessor bias in high-stakes testing: The case of children’s IQ


I’ve blogged quite a bit about bias in forensic assessment, reporting on problems with such widely used tests as the Psychopathy Checklist and the Static-99R. As I’ve reported, some of the bias can be chalked up to adversarial allegiance, or which side the evaluator is working for, whereas some may be due to personality differences among evaluators. Now, researchers are extending this research into other realms -- with alarming findings.


In a study of intelligence testing among several thousand children at 448 schools, the researchers found significant and nontrivial variations in test scoring that had nothing to do with children’s actual intelligence differences. The findings, reported in the journal Psychological Assessment, are especially curious because scoring of the test in question, the Wechsler Intelligence Scale for Children-Fourth Edition (WISC-IV), seems relatively straightforward and objective (at least as compared to inherently subjective tests like the Psychopathy Checklist, for example).


The article is:

  • Whose IQ Is It? Assessor Bias Variance in High-Stakes Psychological Assessment.  McDermott, Paul A.; Watkins, Marley W.; Rhoad, Anna M. Psychological Assessment, Published online on Nov 4 , 2013. To request a copy from the first author, click HERE.





Beware pseudo-precision in expert opinions


I’ve never forgotten a video I saw a long time ago, in which the filmmakers drove up to random strangers and asked for directions to a nearby landmark. Some of the good samaritans gave enthusiastic instructions that were completely wrong, while other people gave correct directions but in a more tentative fashion. The trouble is, the more confident someone appears, the more we judge them as knowing what they are talking about.  


One way we gauge a presenter’s confidence, in turn, is by their level of precision. In a new study, researchers found that participants were more likely to rely on advice given by people who provided more precise information. For example, they were more likely to trust someone who said that the Mississippi River was 3,992 miles long, rather than 4,000 miles long.


What this means in the forensic realm is that we should not make claims of false precision, when our evidence base is weak. For example, we should not claim to know that someone has a 44 percent chance of violent reoffense within three years. Such misleading claims-making lends an aura of confidence and expertise that is not warranted.


The article is:




Ethics and the DSM-5


Speaking of avalanches, the volume of critical response to the DSM-5 is lessening now that the tome has been on the bookshelves for eight months. Trying to keep my finger on the pulse because of my training activities on the manual’s forensic implications, I found an interesting summary of the ethical dilemmas of the latest trends in psychiatric diagnosis.


The author, Jennifer Blumenthal-Barby, is an ethics professor at Baylor College of Medicine’s Center for Medical Ethics and Health Policy. In her critique, published in the Journal of Medical Ethics, she focuses on consequence-based concerns about the dramatic expansion of psychiatric diagnoses in the latest edition of the American Psychiatric Association’s influential manual. Concerns include:


  • False positives, or over-diagnosis, in clinical (and I would add forensic) practice
  • Risks associated with pharmacological treatments of new conditions
  • Neglect of larger structural issues and reduction of individual responsibility through medicalization
  • Discrediting of psychiatry through the trivialization of mental disorders
  • Efforts to eradicate conditions that are valuable or even desirable


Although her discussion is fairly general, she does mention a few of the proposed diagnostic changes of forensic relevance that I’ve blogged about. These include the proposed hypersexual disorder and a proposal to eliminate the age qualifier (of 18 and above) for antisocial personality disorder, to make it consistent with all of the other personality disorders.


It’s a good, brief overview suitable for assignment to students and professionals alike.


The article is: 
  • Psychiatry’s new manual (DSM-5): ethical and conceptual dimensions. Journal of Medical Ethics. Published online on 10 Dec. 2013. To request a copy, click HERE.




Dual relationships: Are they all bad?


We’ve all seen the memo: Dual relationships are to be avoided.


But is that always true?


Not according to ethics instructor Ofer Zur.


Multiple relationships are situations in which a mental health professional has a professional role with a client and another role with a person closely related to the client. In a new overview, Zur asserts that, not only are some multiple relationships ethical, they may be unavoidable, desirable, or even -- in some cases -- mandated.


In delineating the ethics and legality of 26 different types of multiple relationships, Zur stresses that in forensic settings, most multiple relationships should be avoided.


The article, Not All Multiple Relationships Are Created Equal: Mapping the Maze of 26 Types of Multiple Relationships, is another good teaching tool, and is freely available online at Zur’s continuing education website.

By the way, if you are in California and are looking for more ethics training, Zur and two of my former colleagues from the state psychological association’s Ethics Committee -- Michael Donner, PhD and Pamela Harmell, PhD -- are co-presenting at an interactive ethics session at the upcoming California Psychological Association convention. The convention runs April 9-13 in Monterey, and the ethics conversation -- “Ethics are not Rules: Psych in the Real World” -- is on Saturday, April 12.

January 12, 2014

Putting the Cart Before the Horse: The Forensic Application of the SRA-FV

As the developers of actuarial instruments such as the Static-99R acknowledge that their original norms inflated the risk of re-offense for sex offenders, a brand-new method is cropping up to preserve those inflated risk estimates in sexually violent predator civil commitment trials. The method introduces a new instrument, the “SRA-FV,” in order to bootstrap special “high-risk” norms on the Static-99R. Curious about the scientific support for this novel approach, I asked forensic psychologist and statistics expert Brian Abbott to weigh in.

Guest post by Brian Abbott, PhD*

NEWS FLASH: Results from the first peer-reviewed study about the Structured Risk Assessment: Forensic Version (“SRA-FV”), published in Sexual Abuse: Journal of Research and Treatment (“SAJRT”), demonstrate the instrument is not all that it’s cracked up to be.
Promotional material for an SRA-FV training
For the past three years, the SRA-FV developer has promoted the instrument for clinical and forensic use despite the absence of peer-reviewed, published research supporting it validity, reliability, and generalizability. Accordingly, some clinicians who have attended SRA-FV trainings around the country routinely apply the SRA-FV in sexually violent predator risk assessments and testify about its results in court as if the instrument has been proven to measure what it intends to assess, has known error rates, retains validity when applied to other groups of sexual offenders, and produces trustworthy results.

Illustrating this rush to acceptance most starkly, within just three months of its informal release (February 2011) and with an absence of any peer-reviewed research, the state of California incredibly decided to adopt the SRA-FV as its statewide mandated dynamic risk measure for assessing sexual offenders in the criminal justice system. This decision was rescinded in September 2013, with the SRA-FV replaced with a similar instrument, the Stable-2007.

The SRA-FV consists of 10 items that purportedly measure “long-term vulnerabilities” associated with sexual recidivism risk. The items are distributed among three risk domains and are assessed using either standardized rating criteria devised by the developer or by scoring certain items on the Psychopathy Checklist-Revised (PCL-R). Scores on the SRA-FV range from zero to six. Some examples of the items from the instrument include: sexual interest in children, lack of emotionally intimate relationships with adults, callousness, and internal grievance thinking. Patients from the Massachusetts Treatment Center in Bridgewater, Massachusetts who were evaluated as sexually dangerous persons between 1959 and 1984 served as members of the SRA-FV construction group (unknown number) and validation sample (N = 418). It was released for use by Dr. David Thornton, a co-developer of the Static-99R, Static-2002R, and SRA-FV and research director at the SVP treatment program in Wisconsin, in December 2010 during training held in Atascadero, California. Since then, Dr. Thornton has held similar trainings around the nation where he asserts that the SRA-FV is valid for predicting sexual recidivism risk, achieves incremental validity over the Static-99R, and can be used to choose among Static-99R reference groups.

A primary focus of the trainings is a novel system in which the total score on the SRA-FV is used to select one Static-99R “reference group” among three available options. The developer describes the statistical modeling underlying this procedure, which he claims increases predictive validity and power over using the Static-99R alone. However, reliability data is not offered to support this claim. In the December 2010 training, several colleagues and I asked for the inter-rater agreement rate but Dr. Thornton refused to provide it.

I was astounded but not surprised when some government evaluators in California started to apply the SRA-FV in sexually violent predator risk assessments within 30 days after the December 2010 training. This trend blossomed in other jurisdictions with sexually violent predator civil confinement laws. Typically, government evaluators applied the SRA-FV to select Static-99R reference groups, invariably choosing to compare offenders with the “High Risk High Needs” sample with the highest re-offense rates. A minority of clinicians stated in reports and court testimony that the SRA-FV increased predictive accuracy over the Static-99R alone but they were unable to quantify this effect. The same clinicians have argued that the pending publication of the Thornton and Knight study was sufficient to justify its use in civil confinement risk assessments for sexually violent predators. They appeared to imply that the mere fact that a construction and validation study had been accepted for publication was an imprimatur that the instrument was reliable and valid for its intended purposes. Now that the research has been peer-reviewed and published, the results reflect that these government evaluators apparently put the proverbial cart before the horse.

David Thornton and Raymond Knight penned an article that documents the construction and validation of the SRA-FV. The publication is a step in the right direction, but by no means do the results justify widespread application of the SRA-FV in sexual offender risk assessment in general or sexually violent predator proceedings in particular. Rather, the results of the study only apply to the group upon which the research was conducted and do not generalize to other groups of sexual offenders. Before discussing the limitations of the research, I would like to point out some encouraging results.

The SRA-FV did, as its developer claimed, account for more sources of sexual recidivism risk than the Static-99R alone. However, it remains unknown which of the SRA-FV’s ten items contribute to risk prediction. The study also found that the combination of the Static-99R and SRA-FV increased predictive power. This improved predictive accuracy, however, must be replicated to determine whether the combination of the two instruments will perform similarly in other groups of sexual offenders. This is especially important when considering that the SRA-FV was constructed and validated on individuals from the Bridgewater sample from Massachusetts who are not representative of contemporary groups of sexual offenders. Thornton and Knight concede this point when discussing how the management of sexual offenders through all levels of the criminal justice system in Massachusetts between 1959 and 1984 was remarkably lenient compared to contemporary times. Such historical artifacts likely compromise any reliable generalization from patients at Bridgewater to present-day sexual offenders.

Training materials presented four months before
State of California rescinded use of the SRA-FV

Probably the most crucial finding from the study is the SRA-FV’s poor inter-rater reliability. The authors categorize the 64 percent rate of agreement as “fair.” It is well known that inter-rater agreement in research studies is typically higher than in real-world applications. This has been addressed previously in this blog in regard to the PCL-R. A field reliability study of the SRA-FV among 19 government psychologists rating 69 sexually violent predators in Wisconsin (Sachsenmaier, Thornton, & Olson, 2011) found an inter-rater agreement rate of only 55 percent for the SRA-FV total score, which is considered as poor reliability. These data illustrate that 36 percent to 45 percent of an SRA-FV score constitutes error, raising serious concerns over the trustworthiness of the instrument. To their credit, Thornton and Knight acknowledge this as an issue and note that steps should be taken to increase reliable scoring. Nonetheless, the current inter-rater reliability falls far short of the 80 percent floor recommended for forensic practice (Heilbrun, 1992). Unless steps are taken to dramatically improve reliability, the claims that the SRA-FV increases predictive accuracy either alone or in combination with the Static-99R, and that it should be used to select Static-99R reference groups, are moot.

It is also important to note that, although Thornton and Knight confuse the terms validation and cross validation in their article, this study represents a validation methodology. Cross-validation is a process by which the statistical properties found in a validation sample (such as reliability, validity, and item correlations) are tested in a separate group to see whether they hold up. In contrast, Thornton and Knight first considered the available research data from a small number of individuals from the Bridgewater group to determine what items would be included in the SRA-FV. This group is referred to as the construction sample. The statistical properties of the newly conceived measure were studied on 418 Bridgewater patients who constitute the validation sample. The psychometric properties of the validation group have not been tested on other contemporary sexual offender groups. Absent such cross-validation studies, we simply have no confidence that the SRA-FV works at it has been designed for groups other than the sample upon which it was validated. To their credit, Thornton and Knight acknowledge this limitation and warn readers not to generalize the validation research to contemporary groups of sexual offenders.

The data on incremental predictive validity, while interesting, have little practical value at this point for two reasons. One, it is unknown whether the results will replicate in contemporary groups of sexual offenders. Two, no data are provided to quantify the increased predictive power. The study does not provide an experience table of probability estimates at each score on the Static-99R after taking into account the effect of the SRA-FV scores. It seems disingenuous, if not misleading, to inform the trier of fact that the combined measures increase predictive power but to fail to quantify the result and the associated error rate.

In my practice, I have seen the SRA-FV used most often to select among three Static-99R reference groups. Invariably, government evaluators in sexually violent predator risk assessments assign SRA-FV total scores consistent with the selection of the Static-99R High Risk High Needs reference group. Only the risk estimates associated with the highest Static-99R scores in this reference group are sufficient to support an opinion that an individual meets the statutory level of sexual dangerousness necessary to justify civil confinement. Government evaluators who have used the SRA-FV for this purpose cannot cite research demonstrating that the procedure works as intended or that it produces a reliable match to the group representing the individual being assessed. Unfortunately, Thornton and Knight are silent on this application of the SRA-FV.

In a recently published article, I tested the use of the SRA-FV for selecting Static-99R reference groups. In brief, Dr. Thornton used statistical modeling based solely on data from the Bridgewater sample to devise this model. The reference group selection method was not based on the actual scores of members from each of the three reference groups. Rather, it was hypothetical, presuming that members of a Static-99R reference group will exhibit a certain range of SRA-FV score that do not overlap with any of the other two reference groups. To the contrary, I found that the hypothetical SRA-FV reference group system did not work as designed, as the SRA-FV scores between reference groups overlapped by wide margins. In other words, the SRA-FV total score would likely be consistent with selecting two if not all three Static-99R reference groups. In light of these findings, it is incumbent upon the developer to provide research using actual subjects to prove that the SRA-FV total score is a valid method by which to select a single Static-99R reference group and that the procedure can be applied reliably. At this point, credible support does not exist for using the SRA-FV to select Static-99R reference groups.

The design, development, validation, and replication of psychological instruments is guided by the Standard for Educational and Psychological Testing (“SEPT” -- American Educational Research Association et al., 1999). When comparing the Thornton and Knight study to the framework provided by SEPT, it is apparent the SRA-FV is in the infancy stage of development. At best, the SRA-FV is a work in progress that needs substantially more research to improve its psychometric properties. Aside from its low reliability and inability to generalize the validation research to other groups of sexual offenders, other important statistical properties await examination, including but not limited to:

  1. standard error of measurement
  2. factor analysis of whether items within each of the three risk domains significantly load in their respective domains
  3. the extent of the correlation between each SRA-FV item and sexual recidivism
  4. which SRA-FV items add incremental validity beyond the Static-99R or may be redundant with it; and proving each item has construct validity. 

It is reasonable to conclude that at its current stage of development the use of the SRA-FV in forensic proceedings is premature and scientifically indefensible. In closing , in their eagerness to improve the accuracy of their risk assessments, clinicians relied upon Dr. Thornton’s claim in the absence of peer-reviewed research demonstrating that the SRA-FV achieved generally accepted levels of reliability and validity. The history of forensic evaluators deploying the SRA-FV before the publication of the construction and validation study raises significant ethical and legal questions:

  • Should clinicians be accountable to vet the research presented in trainings by an instrument’s developer before applying a tool in forensic practice? 

  • What responsibility do clinicians have to rectify testimony where they presented the SRA-FV as if the results were reliable and valid?

  •  How many individuals have been civilly committed as sexually violent predators based on testimony that the findings from the SRA-FV were consistent with individuals meeting the legal threshold for sexual dangerousness, when the published data does not support this conclusion?

Answers to these questions and others go beyond the scope of this blog. However, in a recent appellate decision, a Washington Appeals Court questions the admissibility of the SRA-FV in the civil confinement trial of Steven Ritter. The appellate court determined that the application of the SRA-FV was critical to the government evaluator’s opinion that Mr. Ritter met the statutory threshold for sexual dangerousness. Since the SRA-FV is considered a novel scientific procedure, the appeals court reasoned that the trial court erred by not holding a defense-requested evidentiary hearing to decide whether the SRA-FV was admissible evidence for the jury to hear. The appeals court remanded the issue to the trial court to hold a Kelly-Frye hearing on the SRA-FV. Stay tuned!

References

Abbott, B.R. (2013). The Utility of Assessing “External Risk Factors” When Selecting Static-99R Reference Groups. Open Access Journal of Forensic Psychology, 5, 89-118.

American Educational Research Association, American Psychological Association and National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.

Heilbrun, K. (1992). The role of psychological testing in forensic assessment. Law and Human Behavior, 16, 257-272. doi: 10.1007/BF01044769.

In Re the Detention of Steven Ritter. (2013, November). In the Appeals Court of the State of Washington, Division III. 

Sachsenmaier, S., Thornton, D., & Olson, G. (2011, November). Structured risk assessment forensic version (SRA-FV): Score distribution, inter-rater reliability, and margin of error in an SVP population. Presentation at the 30th Annual Research and Treatment Conference of the Association for the Treatment of Sexual Abusers, Toronto, Canada.

Thornton, D. & Knight, R.A. (2013). Construction and validation of the SRA-FV Need Assessment. Sexual Abuse: A Journal of Research and Treatment. Published online December 30, 2013. doi: 10.1177/ 1079063213511120. 
* * *


*Brian R. Abbott is licensed psychologist in California and Washington who has evaluated and treated sexual offenders for more than 35 years. Among his areas of forensic expertise, Dr. Abbott has worked with sexually violent predators in various jurisdictions within the United States, where he performs psychological examinations, trains professionals, consults on psychological and legal issues, offers expert testimony, and publishes papers and peer-reviewed articles.



(c) Copyright 2013 - All rights reserved

January 5, 2014

New evidence of psychopathy test's poor accuracy in court

Use of a controversial psychopathy test is skyrocketing in court, even as mounting evidence suggests that the prejudicial instrument is highly inaccurate in adversarial settings.

The latest study, published by six respected researchers in the influential journal Law and Human Behavior, explored the accuracy of the Psychopathy Checklist, or PCL-R, in Sexually Violent Predator cases around the United States.

The findings of poor reliability echo those of other recent studies in the United States, Canada and Europe, potentially heralding more admissibility challenges in court. 

Although the PCL-R is used in capital cases, parole hearings and juvenile sentencing, by far its most widespread forensic use in the United States is in Sexually Violent Predator (SVP) cases, where it is primarily invoked by prosecution experts to argue that a person is at high risk for re-offense. Building on previous research, David DeMatteo of Drexel University and colleagues surveyed U.S. case law from 2005-2011 and located 214 cases from 19 states -- with California, Texas and Minnesota accounting for more than half of the total -- that documented use of the PCL-R in such proceedings.

To determine the reliability of the instrument, the researchers examined a subset of 29 cases in which the scores of multiple evaluators were reported. On average, scores reported by prosecution experts were about five points higher than those reported by defense-retained experts. This is a large and statistically significant difference that cannot be explained by chance. 

Prosecution experts were far more likely to give scores of 30 or above, the cutoff for presumed psychopathy. Prosecution experts reported scores of 30 or above in almost half of the cases, whereas defense witnesses reported scores that high in less than 10 percent.

Looking at interrater reliability another way, the researchers applied a classification scheme from the PCL-R manual in which scores are divided into five discreet categories, from “very low” (0-8) to “very high” (33-40). In almost half of the cases, the scores given by two evaluators fell into different categories; in about one out of five cases the scores were an astonishing two or more categories apart (e.g., “very high” versus “moderate” psychopathy). 

Surprisingly, interrater agreement was even worse among evaluators retained by the same side than among opposing experts, suggesting that the instrument’s inaccuracy is not solely due to what has been dubbed adversarial (or partisan) allegiance.

Despite its poor accuracy, the PCL-R is extremely influential in legal decision-making. The concept of psychopathy is superficially compelling in our current era of mass incarceration, and the instrument's popularity shows no sign of waning. 

Earlier this year, forensic psychologist Laura Guy and colleagues reported on its power in parole decision-making in California. The state now requires government evaluators to use the PCL-R in parole fitness evaluations for “lifers,” or prisoners sentenced to indeterminate terms of up to life in prison. Surveying several thousand cases, the researchers found that PCL-R scores were a strong predictor of release decisions by the Parole Board, with those granted parole scoring an average of about five points lower than those denied for parole. Having just conducted one such evaluation, I was struck by the frightening fact – alluded to by DeMatteo and colleagues -- that the chance assignment of an evaluator who typically gives high scores on the PCL-R “might quite literally mean the difference between an offender remaining in prison versus being released back into the community.”

Previous research has established that Factor 1 of the two-factor instrument – the factor measuring characterological traits such as manipulativeness, glibness and superficial charm – is especially prone to error in forensic settings. This is not surprising, as traits such as “glibness” are somewhat in the eye of the beholder and not objectively measurable. Yet, the authors assert, “it is exactly these traits that seem to have the most impact” on judges and juries.

Apart from the issue of poor reliability, the authors questioned the widespread use of the PCL-R as evidence of impaired volitional control, an element required for civil commitment in SVP cases. They labeled as “ironic, if not downright contradictory” the fact that psychopathy is often touted in traditional criminal responsibility (or insanity) cases as evidence of badness as opposed to mental illness, yet in SVP cases it magically transforms into evidence of a major mental disorder that interferes with self-control. 

The evidence is in: The Psychopathy Checklist-Revised is too inaccurate in applied settings to be relied upon in legal decision-making. With consistent findings of abysmal interrater reliability, its prejudicial impact clearly outweighs any probative value. However, the gatekeepers are not guarding the gates. So long as judges and attorneys ignore this growing body of empirical research, prejudicial opinions will continue to be cloaked in a false veneer of science, contributing to unjust outcomes.

* * * * *
The study is: 

The Role and Reliability of the Psychopathy Checklist-Revised in U.S. Sexually Violent Predator Evaluations: A Case Law Survey by DeMatteo, D., Edens, J. F., Galloway, M., Cox, J., Toney Smith, S. and Formon, D. (2013). Law and Human Behavior

Copies may be requested from the first author (HERE).

The same research team has just published a parallel study in Psychology, Public Policy and Law

“Investigating the Role of the Psychopathy Checklist-Revised in United States Case Law” by DeMatteo, David; Edens, John F.; Galloway, Meghann; Cox, Jennifer; Smith, Shannon Toney; Koller, Julie Present; Bersoff, Benjamin

My related essays and blog posts (I especially recommend the three marked with asterisks):



(c) Copyright Karen Franklin 2013 - All rights reserved

February 26, 2013

Tipping points: Of life, death and psychological data

Forensic psychologists and the machinery of execution

Andre Thomas, Texas
When Andre Thomas killed his wife and children, he was careful to use three different knives so that "the blood from each body would not cross-contaminate, thereby ensuring that the demons inside each of them would die," as Marc Bookman explained it in an eloquent Mother Jones report. Then, he cut out their hearts and went to the police station to confess. While awaiting trial, he cut out one of his eyes. Later, he cut out the other, eating it in order to keep the government from using it to spy on his mind.

In response to changing social mores and international condemnation (only a handful of countries remains in the business of killing their wayward citizens), the U.S. Supreme Court in 2002 exempted the mentally retarded from execution, following up three years later by exempting juveniles. With this narrowing of the contours of capital punishment, the question of how mentally impaired one must be to avoid execution is increasingly in the forefront. That makes severe mental illness "the next frontier" of capital jurisprudence, in the words of psychology-law scholar Bruce Winick.

How insane?

Executing the floridly insane constitutes cruel and unusual punishment, barred under the Eighth Amendment of the U.S. Constitution. However, the "Ford standard" for competency to be executed is very low; a condemned person need merely understand the link between his crime and his punishment. In Thomas's case, the government insists that he is not insane enough to be spared, despite chronic auditory hallucinations, delusions, and treatment for paranoid schizophrenia. 

Making this case especially ironic is that Thomas has become a poster child for the need for new laws allowing preemptive detention of people whose mental illness makes them dangerous. "At least twice in the three weeks before the crime, Thomas had sought mental health treatment," reports the Texas Tribune in a series on mental health and the criminal justice system. "On two occasions, staff members at the medical facilities were so worried that his psychosis made him a threat to himself or others that they sought emergency detention warrants for him. Despite talk of suicide and bizarre biblical delusions, he was not detained for treatment."

John Errol Ferguson, Florida
With the U.S. Supreme Court declining to draw a bright line, the question of exactly how rational a condemned prisoner's understanding must be in order for an execution to proceed has become central to legal appeals by psychotic prisoners like Thomas. Another current example is the case of John Errol Ferguson, a mass killer in Florida whose October execution was stayed due to concerns about his mental state. Ferguson's long history of paranoid schizophrenia is undisputed; the question is whether his grandiose and religious delusions interfere with his understanding that the state is going to kill him for his crimes, and that when he dies he will be, well, dead.

Ferguson's lawyers have argued that the killer lacks rational understanding, because he believes he is "the Prince of God" and will be returned to Earth post-execution to save the world from a communist plot. The state of Florida counters that all that is required to be competent for execution is that a prisoner have an "awareness" that he is set to be executed for crimes he committed. To resolve the dispute, Florida's governor appointed a panel of experts to collectively evaluate Ferguson; a lower court also heard extensive testimony from prison personnel and other mental health experts, including malingering expert Richard Rogers, who administered a large battery of malingering tests and opined that Ferguson was not faking mental illness. Ultimately, the circuit court found little to distinguish Ferguson's belief system from typical religious ideation:
"There is no evidence in the record that Ferguson’s belief as to his role in the world and what may happen to him in the afterlife is so significantly different from beliefs other Christians may hold so as to consider it a sign of insanity."

How intellectually impaired?

Meanwhile, with the categorical exemption of prisoners with mental retardation from the death row rosters, courts around the nation are seeing pitched battles over intelligence scores that can make the difference between life and death. On each side of the IQ Wars in so-called Atkins hearings (named for the 2002 U.S. Supreme Court decision barring execution of the developmentally disabled) are neuropsychologists whose testimony delves into the technicalities of margins of error, practice effects, and the now-familiar Flynn Effect. This latter phenomenon of IQ inflation, in which scores on any given IQ test rise by about three points per decade, creates a situation in which a person on the cusp of mental retardation might score over 70 -- making him eligible for execution -- on an older IQ test but not on a newer one.

Ronell Wilson, New York
Take the case of Ronell Wilson in New York, who murdered two undercover police officers. His nine-day Atkins hearing earlier this winter featured seven experts dissecting nine IQ scores obtained over a 13-year period. In its 55-page opinion, the U.S. District Court spent many pages explaining why a 95 percent confidence interval (a range of two Standard Errors of Measure on either side of a score, something commonly reported in clinical practice) was inappropriate in Atkins claims, because it could place people into the range of mental retardation even if they score well above 70 on IQ tests. The court instead opted for a 66 percent confidence level. Either way, it was all much ado about nothing: "Even after taking into account the possibility of measurement error, the Flynn Effect, and (to a limited extent) the practice effect," Wilson's IQ scores ranging from 70 to 84 were "simply too high to qualify him under the definition of significantly subaverage intellectual functioning."

As Peter Aldhous reports in the New Scientist, the outcomes of these IQ battles vary widely by jurisdiction (and quality of lawyering, I would imagine). Overall, 38 percent of Atkins claims are successful, according to a study at Cornell Law School, but the success rate is 81 percent in North Carolina compared with only 12 percent in Alabama. A convicted killer named Earl Davis with IQ scores of 75, 76, 65 and 70 was spared execution on the basis of the Flynn effect. But that same effect was not persuasive in the case of Kevin Green of Virginia, whose mean IQ score was actually three points lower than Davis's (71, 55, 74 and 74); Green was executed in 2008.

Texas, meanwhile, which has carried out more than one-third of all executions in the United States since capital punishment was reinstated, has come up with its own unique standard of mental retardation, based on the character Lennie from John Steinbeck's Of Mice and Men. Wrote the Texas Court of Criminal Appeals in a 2004 explication of the level of mental retardation necessary to avoid the death penalty: 
"Texas citizens might agree that Steinbeck's Lennie should, by virtue of his lack of reasoning ability and adaptive skills, be exempt. But, does a consensus of Texas citizens agree that all persons who might legitimately qualify for assistance under the social services definition of mental retardation be exempt from an otherwise constitutional penalty?"

A technical spectacle

Whereas in the real world intelligence and insanity are continuous variables, the law chooses to treat them as dichotomous. Psychologists assist in promoting this legal fiction, helping to sort the condemned into discreet categories of sane or insane, mentally retarded or able-minded. Although the tests we used are supposedly objective, data in this highly polarized area can be skewed to favor one outcome or the other. Neuropsychology experts hired by the defense may focus on the Flynn Effect and argue for large confidence bands around IQ scores. Meanwhile, at least one "go-to" psychologist for prosecutors in Texas took a decidedly different approach, systematically skewing data so that more marginally functioning men were made eligible for execution.

Denkowski's Atkins cases, Texas Observer
George Denkowski developed his own method of evaluating Atkins claims, based on his idea that individuals on Death Row may do poorly on traditional tests because of cultural and social factors rather than lack of intellectual ability. So he discounted evidence that defendants, for example, could not count money or take care of their basic hygiene, reasoning that maybe they just were not taught those skills. With an inmate named Daniel Plata, for example, Denkowski bumped up his IQ score from 70 to 77 and his score on a test of adaptive functioning from 61 to 71. He even  published an article in the American Journal of Forensic Psychology in 2008 in which he explained this system of clinical overrides. Complaints by fellow psychologists that his technique had no scientific basis eventually led the Texas State Board of Examiners of Psychologists to issue a reprimand and to bar him from conducting future intellectual disability evaluations in criminal cases. He admitted no legal wrongdoing but agreed to a $5,500 fine -- a pretty lightweight penalty considering that two of the 29 condemned men he evaluated were executed.

Unethical as his method was, it did give attention to the issues of race and class, which may hide in plain sight when appeals revolve around the technical interpretations of psychological test data. It is Constitutionally impermissible for race to be considered in capital cases. But it stretches credulity to believe race played no role, for example, in the case of eye-plucking Andre Thomas: Thomas is African American, his late wife was white, all of the jurors were white, and four jurors had acknowledged opposition to interracial marriages. In the very last sentence of his closing argument for the death penalty, reported Bookman in the Mother Jones piece, the prosecutor asked jurors whether they would be willing to risk Thomas "asking your daughter out, or your granddaughter out?" This in the town of Sherman, which burned its entire Black district to the ground in 1930 during a race riot triggered by -- what else -- rumors that a Black man had raped a white woman.

Trauma as common denominator

Setting aside the technical criteria for insanity and mental retardation, if one could boil capital cases down to one common denominator, it would be trauma. In my experiences working in the capital trenches, I have found that most Death Row denizens survived horrific childhoods dominated by physical, sexual and emotional torture and neglect, combined with multi-generational patterns of mental illness and violence, all overlaid with hard-core substance abuse.

As forensic psychiatrist Pablo Stuart described this phenomenon in an interview with reporter Scott Johnson at Oakland Effect, a journalism project focusing on violence in Oakland, California, “the fact that there is such consistency on these cases is significant. Some of these people, they just never had a chance.”

* * * * *
Related resources:

The Mother Jones report on Andre Thomas is HERE; the audio podcast, read by M*A*S*H star Mike Farrell, can be downloaded or listened to HERE.
My 2009 posts on the Andre Thomas case are HERE and HERE.
 
My prior posts on the Ford standard of competency and the U.S. Supreme Court's decision in the case of Leon Panetti (with links to court rulings and lots of related resources) are HERE, HERE and HERE. The U.S. Supreme Court's 2007 opinion in Panetti v. Quarterman is HERE. A 28-minute educational video, "Executing the Insane: The Case of Scott Panetti," is available HERE.

My 2010 post on the Denkowski case is HERE.

Psychologist Kevin McGrew's master archive on the Flynn Effect is HERE.

Related books include Michael Perlin's Mental Disability and the Death Penalty: The Shame of the States (the first chapter of which can be previewed HERE) and Daniel Murrie and David DeMatteo's Forensic Mental Health Assessments in Death Penalty Cases.