top of page

News Feature: How good is ChatGPT?

Updated: Aug 15

January 24, 2023, #27

ChatGPT has made headlines in the press, but as yet the number of papers offering a systematic analysis of its capabilities is small. Gary Marcus is a Prof. Emeritus at New York University, and a well-known critic of AI developments, writes in his post “Scientists, please don’t let your chatbots grow up to be co-authors: Five reasons why including ChatGPT in your list of authors is a bad idea”¹. He continues: “The worst thing about ChatGPT’s close-but-no-cigar answer is not that it’s wrong. It’s that it seems so convincing.”¹ He goes on to argue: “ChatGPT has proven itself to be both unreliable and untruthful. It makes boneheaded arithmetical errors, invents fake biographical details, bungles word problems, defines non-existent scientific [phenomena, stumbles] over arithmetic conversion, and on and on.”¹ A more neutral analysis can be found in a recent paper on the arXiv. In “How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection”² by a Chinese team Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. The authors “collected tens of thousands of comparison responses from both human experts and ChatGPT, with questions ranging from open-domain, financial, medical, legal, and psychological areas.”² The authors are more cautious about their opinions. They are particularly interested in building a data set to examine ChatGPT’s effectiveness: “The human evaluations and linguistics analysis provide us insights into the implicit differences between humans and ChatGPT, which motivate our thoughts on LLMs’ [Large Language Models] future directions.”² Nonetheless they conclude that: “On the English datasets, the F1-scores for human answers are slightly higher than those for ChatGPT without any exceptions…”² The situation is different for different data sources: “On the Chinese datasets, the F1-scores of humans and ChatGPT are comparable with no significant difference. This suggests that the difficulty in detecting ChatGPT depends on the data source.”² None of this is conclusive, of course. Anyone who gets a free account at “” can experiment for themselves. What is ultimately interesting about ChatGTP is not whether it is so good that it can replace human authors, but rather how successful it is at making the question plausible.


1: Marcus, Gary. 2023. ‘Scientists, Please Don’t Let Your Chatbots Grow up to Be Co-Authors’. Substack newsletter. The Road to AI We Can Trust (blog). 14 January 2023. 2: Guo, Biyang, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. 2023. ‘How Close Is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection’. arXiv.

bottom of page