A pair of recent studies have raised concerns about OpenAI's ChatGPT large language model programs. While these models have gained popularity for their text generation capabilities, which are often indistinguishable from human responses, there is evidence suggesting a decline in their accuracy over time. What's particularly worrying is that researchers are struggling to pinpoint the cause of this deterioration.
ChatGPT "Drift"
A joint effort by researchers from Stanford and UC Berkeley published a study noting the noticeable changes in ChatGPT's behavior over time, and they're somewhat perplexed about the reasons behind this decline. To assess the consistency of ChatGPT's underlying GPT-3.5 and GPT-4 models, the team investigated its propensity to "drift," meaning it provided answers of varying quality and accuracy, as well as its ability to follow given commands. The researchers asked both ChatGPT-3.5 and -4 to solve mathematical problems, address sensitive or risky questions, engage in visual reasoning based on prompts, and generate computer code.
Losing Previous Capabilities
Their findings revealed that the behavior of the same LLM service can change significantly in a relatively short time, emphasizing the need for ongoing monitoring of LLM quality. For example, in March 2023, GPT-4 could identify prime numbers with nearly 98% accuracy, but by June, its accuracy had dropped to less than 3% for the same task. On the other hand, GPT-3.5 had improved its performance in identifying prime numbers between March and June 2023. When it came to generating computer code, both versions demonstrated a decline in their abilities during the same period.
These discrepancies could have real-world consequences soon. A recent paper published in the JMIR Medical Education journal by NYU researchers indicated that ChatGPT's responses to healthcare-related queries are nearly indistinguishable from those of human medical professionals in terms of tone and phrasing. Participants in the study had limited ability to differentiate between responses from human healthcare providers and those generated by OpenAI's large language model. This comes at a time when concerns about AI's handling of medical data privacy and its tendency to provide inaccurate information are on the rise.
Declining Performance
Academics are not the only ones noticing ChatGPT's declining performance. Business Insider reported ongoing discussions on OpenAI's developer forum regarding the model's progress, or lack thereof. Some users expressed disappointment, highlighting the need for an official response from OpenAI.
Being Closed Source
OpenAI's research and development activities for its large language models are notoriously shielded from external review, a practice that has faced criticism from industry experts and users. Matei Zaharia, one of the co-authors of a ChatGPT quality review paper and an associate professor of computer science at UC Berkeley, suggested on Twitter that the decline in quality might be due to limitations in reinforcement learning from human feedback (RLHF) and fine-tuning, or it could be attributed to bugs in the system.
Conclusion
While ChatGPT may perform well on basic Turing Test benchmarks, its inconsistent quality raises significant concerns for the public. These concerns persist as these models continue to proliferate and integrate into everyday life, with limited barriers to their widespread adoption. This is most likely all due to the LLM still being in its infant stage, and countinuing to learn.
It is also important to remember that user expectations have grown tremendously, and the model is now expected to provide the most helpful and relevant responses to anything but perfect prompts. OpenAI will learn and adapt their products for the changing market.