Artificial intelligence still not ready for long-term code maintenance
Alibaba tested AI agents on 100 real codebases over 233 days, but the results were disappointing. While AI easily passes test cases and writes code, it fails to maintain it error-free over eight months, causing significant breakdowns.
During the experiment, 75% of AI models broke existing functioning code during the maintenance phase. With each new iteration, the agents accumulated technical debt and produced “fragile” code, sacrificing quality for speed. This indicates that AI is not yet capable of replacing humans in long-term software maintenance.
Further development of AI models must focus on improving their ability to sustain stable and high-quality code support, especially in complex real-world projects.