Demystifying GPT Self-Repair for Code Generation. (arXiv:2306.09896v3 [cs.CL] UPDATED)


Demystifying GPT Self-Repair for Code Generation. (arXiv:2306.09896v3 [cs.CL] UPDATED)
By: <a href="http://arxiv.org/find/cs/1/au:+Olausson_T/0/1/0/all/0/1">Theo X. Olausson</a>, <a href="http://arxiv.org/find/cs/1/au:+Inala_J/0/1/0/all/0/1">Jeevana Priya Inala</a>, <a href="http://arxiv.org/find/cs/1/au:+Wang_C/0/1/0/all/0/1">Chenglong Wang</a>, <a href="http://arxiv.org/find/cs/1/au:+Gao_J/0/1/0/all/0/1">Jianfeng Gao</a>, <a href="http://arxiv.org/find/cs/1/au:+Solar_Lezama_A/0/1/0/all/0/1">Armando Solar-Lezama</a> Posted: June 23, 2023

Large Language Models (LLMs) have shown remarkable aptitude in code
generation but still struggle on challenging programming tasks. Self-repair —
in which the model debugs and fixes mistakes in its own code — has recently
become a popular way to boost performance in these settings. However, only very
limited studies on how and when self-repair works effectively exist in the
literature, and one might wonder to what extent a model is really capable of
providing accurate feedback on why the code is wrong when that code was
generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4’s
ability to perform self-repair on APPS, a challenging dataset consisting of
diverse coding challenges. To do so, we first establish a new evaluation
strategy dubbed pass@t that measures the pass rate of the tasks against the
total number of tokens sampled from the model, enabling a fair comparison to
purely sampling-based approaches. With this evaluation strategy, we find that
the effectiveness of self-repair is only seen in GPT-4. We also observe that
self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback
on the programs generated by GPT-3.5 and using expert human programmers to give
feedback on the programs generated by GPT-4, we unlock significant performance
gains.

Provided by:
http://arxiv.org/icons/sfx.gif

DoctorMorDi

DoctorMorDi

Moderator and Editor