Anyone who has taken the Uniform CPA Examination, prepared for it, or been involved in the CPA licensure process knows that the passing score is 75. But very few understand what that 75 means.
In January, the exam structure had its biggest overhaul since the exam switched from paper and pencil to computer-based testing in 2004.
New candidates are curious about how the changes will affect their scores, and many CPAs who sat for the exam before 2004 want to know what’s different from the pre-computer-scoring days. This article provides an overview of the scoring process and answers some frequently asked questions.
COMPUTER VS. PENCIL
In the paper-and-pencil days, scoring was done by hand, which took several weeks, according to John Mattar, the AICPA Examinations Team’s director–Psychometrics & Research. Now, the examinations team writes software that evaluates answers based on an answer key a committee has agreed on, he said.
Essays likewise were scored by a room full of CPAs who were trained to score them. Now, the Examinations Team uses software to score them.
“If the gold standard is what a trained human scorer would score, you gather a relatively large sample—around 1,000 to 1,200 responses scored by people—then you use a program to build a mathematical model that will take elements of those papers and predict human scores and validate that model using data from real candidates and show the software is scoring the way the humans would score it,” Mattar said. “Now you have an approved scoring model and can run responses electronically through that software almost instantly and get scores.” However, even with the automated scoring, a sample of responses is also scored by people as a continuing quality-control check, he said.
The software looks for elements a human would score on, such as organization, development, and usage of language.
Because a new section of essays was introduced into the Business Environment and Concepts (BEC) section this year, initially, those essays will still have to be scored by humans, Mattar said. The Examinations Team will need to build computer models after it receives enough sample responses.
If a test taker’s total score is close to the passing score, the candidate’s written responses will be automatically regarded by human graders. When there is more than one grader for a response, the average of the scores is used as the final grade, he said.
How questions appear on tests is also different. In the past, there were different forms of the exam, but all Form A’s, for example, contained the same questions. Today’s system—multi-stage testing (MST)—allows the Examinations Team to target the exam to the ability of the candidate to get a more precise estimate of his or her proficiency, Mattar said.
When are easier or more difficult questions given?
Candidates take three multiple-choice testlets (groups of multiple-choice questions) per exam section. The first testlet is always a medium testlet. Those who perform well get a more difficult second testlet, while those who do not perform well receive a second medium difficulty testlet. Similarly, the third testlet can be a medium or a more difficult one and is based on performance on the first two testlets. Task-based simulation (TBS) questions are pre-assigned and are not chosen based on performance on the multiple-choice testlets. Exhibit 1 illustrates the process.
If you do poorly on the first testlet, you can still pass the exam, but you will need to do better on the second and third testlets.
You can get all medium testlets and still pass, but for this to happen, you would have to have good, but not excellent, performance on the first two testlets, and then excellent performance on the last testlet.
How do you decide which questions are difficult and which are medium?
The difficulty levels of the test questions (and other statistics that are used to describe each test question) are determined through statistical analysis of candidate responses. At the question level, difficulty is not quantified as a category (for example, moderate or difficult), but as a numeric value along a scale. Testlets are classified as either medium or difficult based on the average difficulty of the questions within that testlet.
Does that mean difficult testlets can have easier questions and medium testlets can have difficult questions?
Yes. All testlets have questions ranging in difficulty. Questions in difficult testlets just have a higher average level of difficulty than those in medium testlets.
The testing process involves looking at the “statistical characteristics” of questions. What does that mean?
Three statistics describe the questions: difficulty—whether the question is generally easier or more difficult for candidates; discrimination—how well the question differentiates between more able and less able candidates; and guessing—the chances of candidates answering the question correctly just by guessing. The statistics are generated when the questions are administered as pretest questions and used in the scoring when the questions are operational. (For more explanation on this, see the sidebar, “Pretest vs. Operational Questions,” below.) The formulas for generating the statistics and scoring the exam come from a scoring approach commonly referred to as item response theory (IRT). IRT is used or has recently been adopted by nearly all of the large licensing examination programs in the United States and also by many of the moderate-size and smaller examination programs.
Can I compute my score from the number of questions I answered correctly?
No. The total reported score is a scaled value that takes into account both the response to and the statistical characteristics of each question administered.
How was the passing score set for each section?
Volunteers who are licensed CPAs with recent experience supervising entry-level CPAs participated in a passing score study. They reviewed test questions and how candidates performed on those questions in order to judge what test performance is required to ensure protection of the public interest. The Board of Examiners (BOE) used the results as a guide when it established the passing scores for each section. These passing scores were then mapped to a score of 75 on the scale used to report scores. This process, known as standard setting, is common practice in high-stakes testing.
The BOE established new passing scores in March, following the close of the first testing window for the new exam. The reported passing score for all sections will remain 75.
How do I find out my scores for each content area of the exam?
The AICPA does not release subscores by content area but does report categories of performance. Use caution in interpreting your content area performance, however. The subscores are calculated on fewer items and, therefore, are not as reliable as the final score. The performance comparisons of weaker, comparable and stronger are provided to candidates as a general indicator of performance.
Can I pass the BEC section just by doing well on the multiple-choice questions?
Yes. However, it would be very difficult to do so as you would have to perform exceedingly well. It is advisable to be prepared for both the multiple-choice and the written communication questions.
Can I pass the Auditing and Attestation (AUD), Financial Accounting and Reporting (FAR), or Regulation (REG) sections just by doing well on the multiple-choice questions?
No. The portion contribution from task-based simulations in those sections is too large. You would need to get some of the task-based simulations questions correct to pass.
In general terms, what are the steps taken to produce the reported score?
Initially, for purposes of score reporting, each component (multiple-choice questions (MCQs), task-based simulations and written communication) is treated separately (see Exhibit 2). For the multiple-choice and task-based simulation components, IRT is used to obtain the scaled score for each type of question. (IRT is a class of mathematical models used for exam development and analysis, making it easier and more efficient to compare candidate scores when they are based on exams that have different questions.)
The multiple-choice score is then mapped to a scale of 0 to 100. Similarly, the task-based simulation score and total written communication raw score are mapped to a scale of 0 to 100. The scores are then combined with the policy weights (60% multiple-choice and 40% simulations for the AUD, FAR and REG sections; 85% multiple-choice and 15% written communications for the BEC section). The final step involves mapping the aggregate score to the 0-to-99 scale used for score reporting.
Several technical reports related to the exam’s psychometric characteristics can be found on the psychometrics section of the exam website, aicpa.org/exam. The exam website also contains information about the changes that took place this year, a sample test and a white paper detailing how the exam is scored, including additional FAQs not listed in this article.
Pretest vs. Operational Questions
All four sections of the Uniform CPA Examination contain multiple-choice questions. The AUD, FAR and REG sections have an additional portion for task-based simulation (TBS) questions; the BEC section has a portion for written communication questions, but no TBS questions.
Each test contains operational and pretest questions. Operational questions are scored; pretest questions are not scored. Instead, a candidate’s response to a pretest question is used to evaluate the question’s statistical performance. It is important to note that most questions are operational questions; however, pretest questions are mixed into the exam and are not identified as pretest questions. From the candidate’s perspective, pretest questions are indistinguishable from operational questions. Pretest questions that meet certain statistical criteria are used as operational questions on future exams. This strategy for pretesting questions is common practice in high-stakes testing. Before appearing on the CPA exam, all operational and pretest questions go through several extensive and rigorous subject matter reviews to ensure that they are technically correct, have a single best or correct answer, are current, and measure entry-level content as specified in the content specification outlines (CSOs). Operational questions have also been statistically evaluated to ensure they meet the psychometric requirements of the CPA exam.
This article was prepared by AICPA staff, led by the Examinations Team. It is based on a white paper available at tinyurl.com/4qzlvrl.
This article, based on an AICPA white paper, provides an overview of the scoring process of the Uniform CPA Examination and answers some frequently asked questions.
All candidates start with a medium testlet (group of multiple- choice questions), but how they perform on that first group determines how difficult the next two testlets will be. Essentially, the better you do in one group, the more difficult the subsequent group.
Three statistics are used to describe the questions: difficulty—whether the question is easier or more difficult for candidates; discrimination—how well the question differentiates between more able and less able candidates; and guessing—the chances of candidates answering the question correctly just by guessing.
Software evaluates written communication answers by predicting human scores based on elements such as organization, development and language usage.
To comment on this article or to suggest an idea for another article, contact Alexandra DeFelice, senior editor, at firstname.lastname@example.org or 212-596-6122.
“CPA Exam to Undergo an Evolution,” May 2010, page 54
More from the JofA:
Find us on Facebook | Follow us on Twitter | View JofA videos