After conquering the Wharton MBA, the bar exam, 13 of 15 AP courses and the GRE verbal test, ChatGPT finally met its Waterloo in the form of an accounting class. The AI chatbot did not merely do poorly either — with a score of 47.4%, it utterly bombed, not even getting a D grade.
These results came from a recent study from Brigham Young University, which involved 327 co-authors from 186 educational institutions in 14 countries participated in the research, contributing 25,181 classroom accounting exam questions. They also recruited undergrad BYU students to feed another 2,268 textbook test bank questions to ChatGPT. The questions covered accounting information systems (AIS), auditing, financial accounting, managerial accounting and tax, and varied in difficulty and type (true/false, multiple choice, short answer, etc.)
Human students, while not exactly acing the questions, did much better, averaging 76.7%. The AI did outperform students on 11.3% of questions, mainly on AIS and auditing, but did worse than humans on tax, financial and managerial assessments. This could possibly be because ChatGPT is made for language versus math. Illustrating this, during testing, ChatGPT did not always recognize it was performing mathematical operations and made nonsensical errors, such as adding two numbers in a subtraction problem or dividing numbers incorrectly.
Other observations include:
- ChatGPT often provides explanations for its answers, even if they are incorrect. Other times, ChatGPT’s descriptions are accurate, but it will then proceed to select the wrong multiple-choice answer.
- ChatGPT sometimes makes up facts. For example, when providing a reference, it generates a real-looking reference that is completely fabricated. The work and sometimes the authors do not even exist.
- ChatGPT’s answers to the same question sometimes varied when the question was entered multiple times, and its responses did not always progress from incorrect to correct.
- The bot’s response to questions that depend on the interpretation of images, such as business process diagrams (BPDs) or tabulated data in picture format, varied. ChatGPT sometimes recognized that it lacked the image and declined to answer, sometimes recognized the missing image but answered anyway (sometimes correctly, sometimes not), and sometimes did not recognize the missing image and answered anyway (sometimes correctly, sometimes not).
- ChatGPT could generate code and find errors in previously written code. For example, given a database schema or flat file, ChatGPT could write correct SQL and normalize the data.
- ChatGPT struggled to handle long, written questions with multiple parts, even when allowing for “carry over” mistakes.
- In a case study context, ChatGPT was able to provide responses to questions based on assessing past strategic actions of the firm. However, where data was required to be used, ChatGPT was unable to respond to the questions other than providing formulas.
- ChatGPT performed even worse where there was a requirement for students to apply knowledge. This highlights that ChatGPT is a general-purpose tool as opposed to an accounting-specific tool. It is not unsurprising, therefore, that students are better at responding to more accounting-specific questions where the technology is not yet trained to answer accounting-specific questions.
“When this technology first came out, everyone was worried that students could now use it to cheat,” said lead study author David Wood, a BYU professor of accounting. “But opportunities to cheat have always existed. So for us, we’re trying to focus on what we can do with this technology now that we couldn’t do before to improve the teaching process for faculty and the learning process for students. Testing it out was eye-opening.”
If ChatGPT could not pass an accounting class, it might be safe to assume it cannot pass the CPA exam either. Accounting Today is currently exploring this very thing and shall be releasing its own findings shortly.
Credit: Source link