ChatGPT: an incentive to improve the way we assess our students

Jose Antonio Rodríguez Serrano, Senior Lecturer in Machine Learning at Esade, has been using this artificial intelligence to solve the exercises he sets his students. After some surprising results, these are his thoughts on how this tool will affect the future of education.

Jose A. Rodríguez-Serrano

ChatGPT has become the latest viral phenomenon in artificial intelligence. Since its launch last November, millions of users have tried it, and numerous press articles and experts have highlighted its capabilities.  

When given a question or request (known as a "prompt"), this tool is capable of writing answers automatically, producing a language that is not only grammatically correct, but which also presents the illusion that it understands the question and can explain the answer.  

Example of ChatGPT usage
Example of ChatGPT usage

Technically, ChatGPT is a language model which has been designed to replicate human language. It has been fed large volumes of text available on the Internet (books, forums, websites), prior to being refined with a great many questions and answers written by humans.  

ChatGPT generates the illusion of understanding the question and reasoning the answer

ChatGPT is not a new model. According to its creators, it is an incremental evolution of previous models in the GPT-3 family, which surprised us in the past with their capacity to generate press articles, for example. The main new development here is that this is the first time an open demo has been offered for evaluation by the whole community

Users have had the opportunity to test it in a variety of tasks (searches, programming or math questions, logical puzzles), with wide-ranging results – including tests in which it passes Amazon Web Services certification exams

It is no surprise that these landmark developments have sparked debate, which is ongoing. Will ChatGPT replace search technologies and will it spell the end of Google? Are content creation professions at risk?  

Eating your own cookies” 

As a lecturer in Machine Learning, I was very curious to test ChatGPT, looking no further than our own environment: will ChatGPT be capable of completing the exercises that I use in class?  

I cannot deny that my first reaction on seeing how ChatGPT responded was one of amazement. Although the question posed is a simple one for a student, this tool exceeded the expectations I had for a text generation system:  

ChatGPT solving a math task
ChatGPT solving a math task

Other tests yielded results that were incorrect or, as we will see later, superficial. But the debate is justified: if a technology like ChatGPT is capable of passing a machine learning quiz (at least one with short, conceptual or multiple choice answers), is this technology going to be disruptive in the education sector? Will it change interaction with students or the work of faculty?  

ChatGPT is here to stay in the education sector 

In the same way that a professional or a student uses autocorrect tools or tools like Grammarly to perfect a text, or Google Translate to suggest a translation, it is conceivable that models like ChatGPT and its successors will be embedded in products to help users to write better and more quickly, or for consultation purposes

A first question that has arisen in the education community is: and what if students use it to automatically generate answers for exercises or work set 

Last year, Mike Sharples, Emeritus Professor of Educational Technology, conducted an experiment in which he asked a language model to write an essay. A reasonable result was achieved, close perhaps to what a student might write, but very superficial and with incorrect references.   

Something similar occurs when an in-depth analysis is made of the answers from ChatGPT to the exercises set in the machine learning class mentioned earlier. As we can see in this example, with a more open question, once again the tool responds reasonably well at a generic level, but without references to verify the answer or specific numbers (in other tests in which references were requested, the system cited non-existent sources or produced numbers or successful cases which could not be checked and were probably incorrect).  

Lack of references makes it difficult to verify ChatGPT's responses.
Lack of references makes it difficult to verify ChatGPT's responses.

This has two main implications:  

  • The first implication is that competent students will surely realize they cannot produce a complete essay simply by using these tools, and in the event that they use them, they will have to spend time checking, verifying and looking for references, which – on a cognitive and pedagogical level – meets the initial objectives in the same way.  
     
    In other words, just as Google Translate cannot translate an entire document without some human supervision, these tools will surely end up being used with considerable manual intervention, for consultation purposes, or to produce a first idea that can subsequently be worked on.  
     
  • The second implication is that in the education sector we must continue seeking to ensure that assessment reflects the learning process, independently of the tools at the students' disposal. 
     
    This may mean placing greater emphasis on elements such as “constructive feedback” or “learning by doing”. As Sharples argues: If AI systems have a lasting influence on education, maybe that will come from educators and policy makers having to rethink how to assess students

We should not forget that the tool can also be used to our advantage: ChatGPT could be asked to give an answer to a question, and then the class could debate whether or not the answer is correct, or try to verify it.  

Beyond the essay generation 

It is interesting to imagine “positive” applications that this type of system may have in the field of education beyond the debate about essays.  

For example, it may help students to learn programming (“give me an example of code to make X”). Although it remains imperfect, the field of code generation is also evolving very rapidly, and despite having caused some controversy, ChatGPT is moving in a promising direction. 

Although still imperfect, ChatGPT makes it easy to learn programming.
Although still imperfect, ChatGPT makes it easy to learn programming.

Moreover, many possibilities are emerging for making short explanations or querying premises. For example, if in the future one of these models could be “refined” with the content of a subject, we might consider assistants that answered questions supervised by faculty, such as “what book would you recommend for looking in greater depth at this concept from the previous class?”, or questions that do not require supervision, such as “what outstanding tasks do I have and for when?” 

When the answers are wrong

Many cases in which the answers from ChatGPT are incorrect have been documented. These range from simple puzzles to highly specialized questions or enquiries about recent events.  

It is important to understand that these models are optimized to generate text: a highly complex model is fed with millions of documents in order to calculate what words have a strong likelihood of subsequently appearing in a text. These models may be seen as a kind of highly sophisticated “autocomplete” tool.  

It's a mistake to be relying on it for anything important right now

As a result, it may respond well to basic questions about math, but it has not been specifically optimized to perform mathematical calculations. OpenAI does not hide these limitations: it describes this tool as a research preview, and the CEO has stated that “ChatGPT is incredibly limited, but good enough at some things to create a misleading impression of greatness.  It's a mistake to be relying on it for anything important right now”. 

Indeed, before ChatGPT appeared, Gary Marcus had tried to explain why products like Alexa or Siri are still not capable of conversing despite language models being so advanced. A distinction has to be made between the model and the product: the application of the model in a closed environment conceived for a specific purpose. 

However, we also tend to underestimate technological progress in the long term. ChatGPT is not a product, it offers a taste of what can be done with this type of model. For example, although at present it is only capable of answering basic questions, perhaps in a not so distant future a similar system could be adapted for a specific field, such as mathematics, or for machine learning classes.  

ChatGPT is not a product, it is a foretaste of what can be done with this type of model

We should also bear in mind that, very probably, ChatGPT or advanced language models will not end up being used as we are seeing them used in this demo, but in the future products based on these technologies will be created. Furthermore, these products interact in a way that minimizes the importance of errors, in the same way that we are tolerant of errors in Google, translation, weather forecasts or marketing campaigns.  

What other problems are anticipated?  

Another aspect, general to any machine learning system, is that the answers propagate the biases we find in the sources of data. Researchers and other public figures have shared examples in social media in which the questions show gender or racial bias. This is a recurrent problem in the field of machine learning, and the solution involves treating the set of training data to make it as neutral as possible.  

Another very delicate limitation is that having the capability to write does not mean there is the capability to understand facts or factual information. It is feared that these types of tools may generate false answers which, whether intentionally or not, generate misinformation.  

It will make very easy for a student to generate content, but checking its veracity will remain costly

The authors of the book Prediction Machines argued that the way in which machine learning changes the economy is by cutting the cost of the capacity to make predictions. In the case of systems like ChatGPT, it is going to be very cheap for a student or a user to generate content. But the challenge lies in the fact that checking the veracity of this content or correcting it remains just as costly as before

And the human touch?  

In spite of all these promising advances, there is consensus in the scientific community that these artificial intelligence systems are far from being sentient or capable of complex reasoning (a much-cited article by Emily Bender and other authors have dubbed them stochastic parrots, arguing that they work by repeating what they have read, without understanding it).  

Therefore, we are a long way from the point at which these language models will be able to imitate human qualities, such as generating constructive feedback or feeling empathy towards the students.  

Another challenge facing these systems is personalization: the capacity to offer someone the specific training they need, explained in a way that best adapts to their needs and prior knowledge, is something that generates considerable demand in the sector, but which autonomous systems still find very difficult to achieve.  

Surely, as in the case of other tools that have astonished us in the past, such as Google or Wikipedia, we will get the most out of these tools when their capabilities are combined with human intervention

All written content is licensed under a Creative Commons Attribution 4.0 International license.