LLM-Guided Evolution: An Autonomous Model Optimization for Object Detection

YiMing Yu YiMing.Yu@gtri.gatech.edu 0000-0002-3467-122X Georgia Tech Research InstituteAtlantaGeorgiaUSA and Jason Zutty Jason.Zutty@gtri.gatech.edu 0000-0001-7977-1454 Georgia Tech Research InstituteAtlantaGeorgiaUSA

(2025)

Abstract.

In machine learning, Neural Architecture Search (NAS) requires domain knowledge of model design and a large amount of trial-and-error to achieve promising performance. Meanwhile, evolutionary algorithms have traditionally relied on fixed rules and pre-defined building blocks. The Large Language Model (LLM)-Guided Evolution (GE) framework transformed this approach by incorporating LLMs to directly modify model source code for image classification algorithms on CIFAR data and intelligently guide mutations and crossovers. A key element of LLM-GE is the ”Evolution of Thought” (EoT) technique, which establishes feedback loops, allowing LLMs to refine their decisions iteratively based on how previous operations performed. In this study, we perform NAS for object detection by improving LLM-GE to modify the architecture of You Only Look Once (YOLO) models to enhance performance on the KITTI dataset. Our approach intelligently adjusts the design and settings of YOLO to find the optimal algorithms against objective such as detection accuracy and speed. We show that LLM-GE produced variants with significant performance improvements, such as an increase in Mean Average Precision from 92.5% to 94.5%. This result highlights the flexibility and effectiveness of LLM-GE on real-world challenges, offering a novel paradigm for automated machine learning that combines LLM-driven reasoning with evolutionary strategies.

Computer aided/automated design, Automated Machine Learning, Large Language Models, Neuroevolution

^†^†copyright: acmlicensed^†^†journalyear: 2025^†^†doi: XXXXXXX.XXXXXXX^†^†conference: The Genetic and Evolutionary Computation Conference 2025; July 14–18, 2025; Málaga, Spain^†^†ccs: Computing methodologies Genetic programming^†^†ccs: Computing methodologies Object detection^†^†ccs: Computing methodologies Neural networks

1. Introduction

Neural Architecture Search (NAS) and Object Detection are two interrelated challenges in machine learning, each with significant hurdles. Object detection requires both precise localization and accurate classification, making model design inherently complex. Traditional handcrafted architectures often struggle with generalization across datasets, hardware constraints, and real-time processing needs. NAS addresses this by automating the search for optimal architectures, efficiently exploring vast design spaces to tailor models for specific detection tasks. However, NAS itself is computationally expensive, requiring advanced search strategies like reinforcement learning, evolutionary algorithms, or gradient-based methods to balance accuracy and efficiency. The integration of LLMs into NAS for object detection is crucial for developing scalable, high-performance models capable of handling challenges such as scale variation, occlusion, and deployment on resource-constrained devices. Solving these challenges will enable more robust, efficient, and widely applicable object detection systems, from autonomous vehicles to edge computing.

In 2024 Morris et al. (Morris et al., 2024) introduced a novel technique they coined Large Language Model (LLM)- Guided Evolution (GE). They demonstrated its utility by pairing the LLM Mixtral (Jiang et al., 2024) with the ExquisiteNetV2 (Zhou and Su, 2021) classifier for the CIFAR-10 dataset and evolved image classifiers with higher accuracy and fewer parameters.

This paper builds upon the research thus far utilizing LLM-GE on the problem of object detection for the KITTI dataset (Geiger et al., 2013). This is the first known application of the integration of an LLM with an evolutionary algorithm for the problem of object detection. This paper introduces a novel approach of using an LLM to intelligently modify neural network architectures through their module, layer, and hyperparameter descriptions in YAML configuration files, which is a major step forward in automated machine learning and NAS. This paper demonstrates the powerful capabilities of LLM-GE with its Character Role Play (CRP) and Evolution of Thought (EoT) to outperform human state-of-the-art algorithms in the You Only Look Once (YOLO) (Redmon, 2016) family on key objectives of mAP@50, mAP@50-95, precision, and recall.

2. Background

2.1. Automated Machine Learning

The development of machine learning traditionally requires expertise in multiple areas, from data preprocessing to feature engineering, model selection, hyperparameter tuning, and deployment. As the demand for AI-driven solutions grows across industries and the complexity of machine learning tasks increases, Automated Machine Learning (AutoML) has emerged as a solution to address these challenges. AutoML automates the end-to-end process of developing machine learning models, enabling non-experts to build high-performance models efficiently while reducing the need for manual intervention.

AutoML consists of several key components that automate different stages of the machine learning pipeline. Data preprocessing and feature engineering involve data cleaning, transformation, and feature extraction, ensuring that datasets are prepared efficiently without manual intervention (Hutter et al., 2019). Model selection enables AutoML to explore various algorithms to identify the model with the best performance. Hyperparameter tuning alters model parameters and training configurations using techniques such as grid search and genetic algorithms to enhance accuracy and efficiency (Bergstra et al., 2011). For deep learning architecture design, NAS automates the discovery of optimal network architectures. Finally, model evaluation and deployment ensure that selected models are validated using cross-validation, allowing seamless integration into real-world applications (Olson and Moore, 2016). These components work together to streamline machine learning development, making AI more accessible and scalable across various domains (Ledell and Poirier, 2020).

Frequent in the development of AutoML solutions are Evolutionary Algorithm (EA)-based approaches (Olson and Moore, 2016; Zutty et al., 2015), which utilize a genetic encoding of machine learning pipeline choices and their hyperparameters.

AutoML has demonstrated its ability to surpass human-designed architectures in diverse domains, from image recognition (NASNet (Zoph and Le, 2017), EfficientNet (Tan and Le, 2019)) to structured data analysis (AlphaD3M (Drori et al., 2021)).

The overall goal of research in autoML is to discover higher-performing algorithms with fewer evaluations of candidate solutions or less total computational time. This paper considers an emerging new subfield of autoML which involves the usage of LLMs, described in Section 2.2.

2.2. Evolutionary Computation integrated with LLMs

The recent advancements in LLM capabilities have jump-started a new intersection between evolutionary computation and the employment of LLMs for automated machine learning (Wu et al., 2024). LLMs excel in code production due to the structure and rules followed in each language.

While traditional AutoML and evolutionary approaches rely on a representation or encoding of a genome for each algorithm, an evolutionary algorithm paired with an LLM is able to operate directly on code (Hemberg et al., 2024). Such an EA still retains its bio-inspired metaphor, where evaluation, selection, crossover, and mutation are in play. In this new form, evaluation and selection are handled using standard EA approaches, but the operators of crossover and mutation utilize LLMs, where each operation involves prompting an LLM with the code that is the individual.

Recently, this type of approach has been demonstrated several times on common benchmark problems such as image classification with the CIFAR-10 dataset (Morris et al., 2024; Nasir et al., 2024) or MNIST (Chen et al., [n. d.]) and the traveling salesman problem (Liu et al., 2023). An open question this paper seeks to answer is does this approach work well with less frequently published datasets and problems, where the LLM may not have been trained with as much information?

Across open literature, a variety of LLMs have been used for integration with EAs including: CodeGen-6B, Mixtral 8x7B, GPT-3.5 Turbo, Llama-2-70B-Instruct, and PaLM-62B. However, few papers compare the performance of LLMs within optimizations or blend their results.

2.3. Neural Architecture Search for Object Detection

Handcrafted neutral architecture requires domain knowledge of model design and large amounts of trail-and-error experiments to achieve promising performance for classification, detection, and segmentation tasking. As NAS demonstrated its efficacy in designing SoTA image classification models (Zoph and Le, 2017; Tan and Le, 2019), the NAS emerged to design detection backbones for object detection initially (Chen et al., 2019).

A state-of-the-art object detection system typically consists of a backbone, a feature fusion neck, a region proposal network (RPN), and an RCNN head. In one-stage detection models, such as YOLO, the detection pipeline consists of only a backbone and a head. NAS has emerged as a powerful tool to automate the design of detection architectures under given constraints, often guided by principles such as maximum entropy. The MAE-DET model was proposed to automatically design detection backbones using the Maximum Entropy Principle, eliminating the need for network training while still achieving state-of-the-art performance (Sun et al., 2022). Similarly, DAMO-YOLO was developed to optimize the backbone, neck, and head, resulting in a new model that surpasses previous YOLO series models on the COCO dataset (Xu et al., 2022).

In addition, EAutoDet was introduced to design both the backbone and Feature Pyramid Network (FPN) by constructing a supernet that jointly optimizes these modules using a differentiable NAS method (Wang et al., 2022). Another approach, Structural-to-Modular NAS (Yao et al., 2020), applies a multi-objective search algorithm to explore module combinations for backbone optimization, aiming to find efficient and effective architectures for object detection (Yao et al., 2020).

While none of the existing NAS methods for object detection have utilized Large Language Models, in this paper, we introduce a novel LLM-driven NAS approach. Specifically, we use a LLM-GE to intelligently modify neural network architectures through their module, layer, and hyperparameter descriptions in YAML configuration files, effectively enhancing YOLO’s performance on the KITTI dataset.

3. LLM-Guided Evolution Framework

The LLM-Guided Evolution framework (Morris et al., 2024) introduces Guided Evolution, which combines LLMs with evolutionary algorithms to automate and enhance NAS. GE leverages the Evolution of Thought (EoT) feedback loop and Character Role Play (CRP) to intelligently guide both mutations and crossovers in code, making NAS more efficient and creative. In their study, the authors utilized Mixtral AI’s 8x7B Mixture of Experts Model (Jiang et al., 2024), known as Mixtral. The mixtral employs a Mixture of Experts architecture with eight experts, each containing 7 billion parameters. During inference, only a subset of experts is active, which leads to reduced computational overhead and results in efficient processing for tasks requiring rapid inference.

Instead of modifying Python code, our study uses a YAML configuration file for YOLO as the seed model. To effectively modify a YAML file for YOLO, an LLM must possess the following core abilities:

(1)

Code Understanding: Familiarity with YAML syntax and YOLO-specific configurations.
(2)

Domain Knowledge: Understanding of YOLO architecture and relevant machine learning concepts.
(3)

Contextual Adaptability: The ability to adapt configurations to specific use cases.

Additionally, the LLMs must have the following reasoning capabilities:

(1)

Logical Reasoning: To create a valid YAML file that adheres to YOLO’s structure and to understand how changes affect the model’s performance.
(2)

Problem-Solving: To suggest changes that optimize performance. Identifying and resolving invalid configurations
(3)

Creativity: To propose innovative modifications, such as adding or removing layers, adjusting scaling factors, or implementing new layers to balance performance and efficiency.
(4)

Generalization: To apply changes based on abstract user requirements.

This combination of core abilities and reasoning capabilities make it a challenge for the LLMs to effectively modify YOLO YAML configurations and achieving optimal performance.

4. Application to YOLO Optimization on KITTI data

The KITTI benchmark suit is a widely used real-world computer vision benchmark, providing data for stereo, optical flow, visual odometry, 2D / 3D object detection, and 2D / 3D tracking. For autonomous driving tasks, it provides a comprehensive set of images and corresponding annotations for object detection, scene understanding, and visual odometry. The dataset includes images captured in real-world urban, rural, and highway environments, making it a challenging and representative benchmark for evaluating object detection models. The KIITI object detection dataset contains 7,481 training images and 7,518 test images, with a total of 51,865 annotations for 9 object classes, which are: Car, Pedestrian, Van, Cyclist, Truck, Misc, Tram, Person Sitting, and ”Don’t care”. These objects are labeled with bounding boxes in various sizes. (Geiger et al., 2013)

In this work, we focus on optimizing the YOLO object detection model to improve its performance on the KITTI object detection dataset. The YOLO is a popular real-time detection algorithm due to its single-stage approach and its iterative progression. Since YOLOv1’s inception in 2016 (Redmon, 2016) it has gone through 11 subsequent iterations to reach YOLOv12 (Tian et al., 2025). It is well known for its real-time inference capabilities and the balance between its performance and speed. The model predicts bounding boxes and class probabilities simultaneously with low latency, making it well-suited for tasks in autonomous driving.

The performance of a YOLO model can be affected by several factors. In this paper, we will consider the following factors:

Network architecture:

The network architecture has a significant impact to performance. Various versions of YOLO (e.g., YOLOv3, YOLOv4, YOLOv5 …) impact performance due to significant architecture innovations in their backbone networks and head networks. These innovations include the introduction of a Feature Pyramid Networks (FPN) in YOLOv3, the introduction of a CSPDarknet backbone in YOLOv5, and the introduction of spatial channel decoupled downsampling and large kernel convolutions in YOLOv10.(Hussain, 2024b; Redmon and Farhadi, 2018; Hussain, 2024a)

Training Strategies

In YOLO’s training framework, data augmentation stands out as a dynamic and practical mechanism. The data augmentation technique includes scale, translation, rotation, shear, random scaling, random erasing, random cropping, and Mosaic transformations. The Mosaic was introduced in YOLOv4 training to enhance robustness. In addition, hyperparameter tuning for learning rate, weight decay, momentum, and optimizer selection can significantly impact training and generalization.

Designing a network architecture and optimizing its parameters for training manually is a time-intensive process, requiring both domain knowledge and trial-and-error experimentation. By leveraging the LLM-Guided Evolution, we aim to automate this optimization process. During the evolution, the LLM intelligently refines YOLO’s configuration and hyperparameters through guided mutations and crossovers. This process enables the model to adapt to the specific characteristics of the KITTI dataset for the desired objectives.

This study integrates The LLM-Guided Evolution was with the Ultralytics YOLO (Jocher et al., 2023) repository. Ultralytics YOLO is a state-of-the-art open-source framework, representing the latest advancement in the renowned YOLO series for real-time object detection and image segmentation. It includes all previous versions of YOLO and enhances performance, flexibility, and efficiency by integrating advanced features and continuous improvements from the latest YOLO architecture developments and publications. In the Ultralytics YOLO repository, the models are defined using model configuration files represented with YAML, where each YAML file defines a unique model architecture. In general, the YAML file contains essential details to construct a model, such as the number of layers, types of layers, the upper connection of the layers, activation function, input argument of the given layers, and other settings. It includes the following components:

•

Backbone: Defines the feature extraction network, such as Darknet53 for YOLOv3 or other backbone architectures.
•

Head: Defines additional layers for handling specific task, such as objective detection
•

Other parameters: Defines model parameters, such as input size or multiples for depth and width. It can also contain hyperparameters for training such as learning rate and weight decay.

⬇

# –Block–

# Parameters

nc: 80 # number of classes

depth_multiple: 1.0 # model depth multiple

width_multiple: 1.0 # layer channel multiple

\par# –Block–

# backbone

backbone:

# [from, number, module, args]

- [-1, 1, Conv, [32, 3, 1]] # 0

- [-1, 1, Conv, [64, 3, 2]] # 1-P1/2

- [-1, 1, Bottleneck, [64]]

- [-1, 1, Conv, [128, 3, 2]] # 3-P2/4

- [-1, 2, Bottleneck, [128]]

- [-1, 1, Conv, [256, 3, 2]] # 5-P3/8

- [-1, 8, Bottleneck, [256]]

- [-1, 1, Conv, [512, 3, 2]] # 7-P4/16

- [-1, 8, Bottleneck, [512]]

- [-1, 1, Conv, [1024, 3, 2]] # 9-P5/32

- [-1, 4, Bottleneck, [1024]] # 10

\par# –Block–

# head

head:

- [-1, 1, Bottleneck, [1024, False]]

- [-1, 1, Conv, [512, 1, 1]]

- [-1, 1, Conv, [1024, 3, 1]]

- [-1, 1, Conv, [512, 1, 1]]

- [-1, 1, Conv, [1024, 3, 1]] # 15 (P5/32-large)

\par- [-2, 1, Conv, [256, 1, 1]]

- [-1, 1, nn.Upsample, [None, 2, ”nearest”]]

- [[-1, 8], 1, Concat, [1]] # cat backbone P4

- [-1, 1, Bottleneck, [512, False]]

- [-1, 1, Conv, [256, 1, 1]]

- [-1, 1, Conv, [512, 3, 1]] # 22 (P4/16-medium)

\par- [-2, 1, Conv, [128, 1, 1]]

- [-1, 1, nn.Upsample, [None, 2, ”nearest”]]

- [[-1, 6], 1, Concat, [1]] # cat backbone P3

- [-1, 1, Bottleneck, [256, False]]

- [-1, 2, Bottleneck, [256, False]] # 27 (P3/8-small)

- [[27, 22, 15], 1, Detect, [nc]] # Detect(P3, P4, P5)

Figure 1. YAML configuration of seed model consists of parameters block, backbone block, and head block.

In our study, the YAML file is divided into blocks according to the backbone, head, and parameters. These are three natural segmentation points for the configuration of a YOLO architecture. Listing 1 shows the YAML file for YOLOv3, which comprises a parameter block, a backbone block, and a head block. These blocks function analogously to genetic segments in a genome, which can be modified by the LLM-Guided Evolution via mutations and crossovers.

Each layer in Listing 1 follows the format [from, number, module, args], where:

•

from indicates the index of the previous layer used as input,
•

number specifies the number of times the module is repeated,
•

module defines the type of layer (e.g., Conv),
•

args contains parameters specific to that module, such as kernel size, stride, and number of channels.

An ordered set of these individual layers forms either a Head or a Backbone. All of these elements form a search space that the LLM-GE explores to construct new model architectures.

We selected YOLOv3 as our seed model for Llama-3.3-GE1 due to its high efficiency in real-time applications, making it an ideal candidate for evaluating whether LLMs possess knowledge of advanced layers introduced in later YOLO versions.

In our study, we use mAP@50, mAP@50-95, precision, recall, total number of parameters, and inference speed as our objectives for GE. The Mean Average Precision (mAP), is a primary metric for object detection, and is computed as the mean of average precision (AP) across all classes. For mAP@50, the AP are computed at an Intersection over Union (IoU) threshold of 0.5. For mAP@50-90, the mean AP averaged over multiple IoU thresholds ranging from 0.5 to 0.95 with increments of 0.5. Precision measures how many of the detected objects are actually correct, while recall measures how many of the actual objects are correctly detected. The total number of parameters determines the storage and memory requirements of the model, impacting its efficiency and deployment feasibility. Inference speed is measured as the average time to process a single image, which is crucial for real-time applications.

Our study demonstrates LLM-Guided Evolution can explore the search space of configurations of YOLO, generating YOLO variants with significant improvements in detection performance.

5. Results

In this investigation, our goal is to use LLM-Guided Evolution to generate YOLO variants that optimize performance for object detection on KITTI data. We focus on objectives relating to algorithm task performance, as well as computational efficiency of the model.

5.1. Evaluation of LLM-GEs

To enable meaningful evolution, LLMs must possess the core abilities and reasoning capabilities outlined in Section 3 to propose valid YAML configurations that can be successfully evaluated within the training framework, ultimately leading to performance improvements. For this study, we ran the evolution in two manners of operation. In the first, which we call Llama-3.3-GE1, we seeded our LLM-GE with the YAML configuration file split into the three blocks shown in Listing 1. In this manner of operation we seeded the evolution with only the configuration for YOLOv3. In the second manner of operation, which we call Llama-3.3-GE2, we made the following changes:

(1)

We modified the representation of the configuration file to be a single block comprising the head, backbone, and hyperparameters.
(2)

We modified the prompts to the LLM to instruct it to modify a specific part of the YAML file, such as the the head, backbone, or parameters.
(3)

We included additional seeds into the starting population of the LLM-GE including YOLOv3-tiny, YOLOv9s, YOLOv10-tiny, and YOLOv11, in addition to the YOLOv3 seed used in the first manner of operation

Table 1 summarizes the framework’s performance over 50 generations for Llama-3.3-GE1 and 59 generations for Llama-3.3-GE2. Note that GE1 produced a 40.7% failure rate, where the YAML files during the evaluation process, while GE2 produced a 51.9% failure rate. A possible contribution to the increase could be the added variety of seeds, or the more open prompting style to the LLM.

Figure 2 illustrates the number of individuals on the Pareto frontier over time for each GE. The figure shows that the number of individuals on the Pareto frontiers for both of the GEs grows steadily throughout the generations. Note that the two experiments ran for different numbers of generations: LLM-GE1 stops after 50 generations, while LLM-GE2 stops after 59 generations. The GEs exhibit upward trends, suggesting continued search of the tradeoff space between objectives. Both trends can be characterized by a linear behavior and the trend of Llama-3.3-GE1 has a steeper slope than that of the Llama-3.3-GE2, indicating a faster improvement for the number of individuals in the Pareto Fronts.

Table 1. Performance of LLM-GEs.

	Llama-3.3-GE1	Llama-3.3-GE2
Total runtime	38 days	35 days
Total generation	50	59
No. of variants	1891	1930
No. of invalid variants	769	1001
No. of Pareto frontier	135	128

Refer to caption — Figure 2. Number of Individuals in the Pareto frontier per generation.

In addition, to compare the overall performance of LLM-Guided Evolution with difference in seedings and content included in prompts, we evaluate the quality of generated solutions using hypervolume under the Pareto frontiers (for minimization of objectives). A smaller hypervolume indicates broader exploration of the Pareto-optimal space. The optimization objectives include the number of parameters, inference speed, precision, and recall, where GE minimizes the number of parameters and inference speed while maximizing precision and recall.

To compute the hypervolume shown in Figure 3, the number of parameters and inference speed are normalized between 0 and 1. Additionally, 1 - precision and 1 - recall are used to align all objectives with a minimization trend. As a result, the scores are contained within a hypercube of unit hypervolume.

Figure 3 shows the hypervolumes for Llama-3.3-GE1 and Llama-3.3-GE2. The blue connected squares denote the hypervolume for Llama-3.3-GE1 at each generation and the orange connected triangles denote the hypervolume for Llama-3.1-GE2 at each generation. The figure shows that Both Llama-3.3-GE1 and Llama-3.3-GE2 improve over time, and that Llama-3.3-GE1 achieves better hypervolume than Llama-3.3-GE2. It is interesting to note that the Llama-3.3-GE1 outperforms the other Llama-GE2 after 25 generations and there was a significant improvement from generation 26 to generation 27. Meanwhile, the hypervolume of Llama-3.3-GE2 reaches a plateau after 33 generations.

The figure also shows that at 0th generation, the hypervolumes of the two GEs are not the same. This could be caused by the randomness of creating the initial populations or by the difference in the set of seeds. The Llama-3.3-GE1 was only seeded with YOLOv3 while Llama-3.3-GE2 was seeded with additional seeds of YOLOv3-tiny, YOLOv9s, YOLOv10-tiny, and YOLOv11. We note a further investigation on the stochastic nature of this process with significantly more trials in our proposed future work.

In addition, Llama-3.3-GE1 discovered 135 Pareto-optimal YOLO variants in its final generation, while Llama-3.3-GE2 generated 128 Pareto-optimal YOLO variants in its final generation. By inspecting the Pareto front individuals, we observed common changes such as modifications to the number of layers, the width and depth multipliers, and the arguments of the layers (see Listing 1), especially in Llama-3.3-GE1. These observations suggest that Llama-3.3 relies solely on the seed model, YOLOv3, and lacks awareness of other layer types. In contrast, Llama-3.3-GE2, seeded with multiple state-of-the-art YOLO models, incorporated a wider variety of layers (See Listings 4-6).

Note that even though Figure 3 shows that the hypervolume for Llama-3.3-GE2 appears to stagnate around generation 34, Figure 2 shows that during this time, the evolution is still discovering a significant number of new individuals along the Pareto frontier, suggesting an increase in fidelity along the objectives in the trade-off space. It is a possibility that the Llama-3.3-GE2 is in a state of punctuated equilibrium.

In some cases, the LLM attempted to create new layers as Python code for the configuration. However, our current framework does not yet utilize this information. Future work will focus on integrating these generated layers into the model to further enhance YOLO’s architecture.

⬇

# –PROMPT LOG–

# Ultralytics YOLO, AGPL-3.0 license

# YOLO object detection model. For details see https://docs.ultralytics.com/models/yolov3

\par# –OPTION–

# Parameters

nc: 80 # number of classes

depth_multiple: 1.5 # increased model depth multiple

width_multiple: 1.5 # increased layer channel multiple

\par# backbone

backbone:

# [from, number, module, args]

- [-1, 1, Conv, [40, 3, 1]] # 0, increased channels

- [-1, 1, Conv, [80, 3, 2]] # 1-P1/2, increased channels

- [-1, 1, Bottleneck, [80]] # increased channels

- [-1, 1, Conv, [160, 3, 2]] # 3-P2/4, increased channels

- [-1, 2, Bottleneck, [160]] # increased channels

- [-1, 1, Conv, [320, 3, 2]] # 5-P3/8, increased channels

- [-1, 8, Bottleneck, [320]] # increased channels

- [-1, 1, Conv, [640, 3, 2]] # 7-P4/16, increased channels

- [-1, 8, Bottleneck, [640]] # increased channels

- [-1, 1, Conv, [1280, 3, 2]] # 9-P5/32, increased channels

- [-1, 4, Bottleneck, [1280]] # 10, increased channels

- [-1, 1, Bottleneck, [1280]] # Additional bottleneck layer

\par# head

head:

- [-1, 1, Bottleneck, [1280, False]]

- [-1, 1, SPP, [640, [5, 9, 13]]]

- [-1, 1, Conv, [1280, 3, 1]]

- [-1, 1, Conv, [640, 1, 1]]

- [-1, 1, Conv, [1280, 3, 1]] # 15 (P5/32-large)

- [-1, 1, Conv, [1280, 1, 1]] # Additional convolutional layer

\par- [-2, 1, Conv, [320, 1, 1]]

- [-1, 1, nn.Upsample, [None, 2, ”nearest”]]

- [[-1, 8], 1, Concat, [1]] # cat backbone P4

- [-1, 1, Bottleneck, [640, False]]

- [-1, 1, Conv, [320, 1, 1]]

- [-1, 1, Conv, [640, 3, 1]] # 22 (P4/16-medium)

\par- [-2, 1, Conv, [160, 1, 1]]

- [-1, 1, nn.Upsample, [None, 2, ”nearest”]]

- [[-1, 6], 1, Concat, [1]] # cat backbone P3

- [-1, 1, Bottleneck, [320, False]]

- [-1, 2, Bottleneck, [320, False]] # 27 (P3/8-small)

\par- [[27, 22, 15], 1, Detect, [nc]] # Detect(P3, P4, P5)

\par

Figure 4. Example 1 generated by Llama 3.3-GE2 with mAP@50 of 94.5%.

⬇

\par# Parameters

nc: 80 # number of classes

scales: # model compound scaling constants, i.e.’model=yolov10n.yaml’ will call yolov10.yaml with scale ’n’

# [depth, width, max_channels]

s: [0.6, 0.85, 1280] # Increased depth, width, and max channels for enhanced feature extraction

\parbackbone:

# [from, repeats, module, args]

- [-1, 1, Conv, [160, 3, 2]] # 0-P1/2, increased channels for better initial feature extraction

- [-1, 1, Conv, [320, 3, 2]] # 1-P2/4, increased channels for deeper feature extraction

- [-1, 5, C2f, [320, True]] # Increased repeats for deeper feature extraction

- [-1, 1, Conv, [640, 3, 2]] # 3-P3/8

- [-1, 5, C2f, [640, True]] # Increased repeats for deeper feature extraction

- [-1, 1, SCDown, [1280, 3, 2]] # 5-P4/16

- [-1, 5, C2f, [1280, True]] # Increased repeats for deeper feature extraction

- [-1, 1, SCDown, [1280, 3, 2]] # 7-P5/32, increased channels

- [-1, 1, SPPF, [1280, 5]] # 8, increased channels

- [-1, 1, PSA, [1280]] # 9, increased channels

\parhead:

- [-1, 1, nn.Upsample, [None, 2, ”nearest”]]

- [[-1, 6], 1, Concat, [1]] # cat backbone P4

- [-1, 5, C2f, [640]] # Increased repeats for enhanced feature fusion

\par- [-1, 1, nn.Upsample, [None, 2, ”nearest”]]

- [[-1, 4], 1, Concat, [1]] # cat backbone P3

- [-1, 5, C2f, [320]] # Increased repeats for enhanced feature fusion

\par- [-1, 1, Conv, [640, 3, 2]]

- [[-1, 12], 1, Concat, [1]] # cat head P4

- [-1, 5, C2f, [1280]] # Increased repeats for enhanced feature fusion

\par- [[12, 15, 18], 1, v10Detect, [nc]] # Detect(P3, P4, P5)

\par\par

Figure 5. Example 2 generated by Llama 3.3-GE2 .

⬇

# Parameters

nc: 80 # number of classes

depth_multiple: 0.5 # model depth multiple

width_multiple: 0.5 # layer channel multiple

\par# backbone

backbone:

# [from, number, module, args]

- [-1, 1, Conv, [32, 3, 1]] # 0

- [-1, 1, Conv, [64, 3, 2]] # 1-P1/2

- [-1, 1, Bottleneck, [64]]

- [-1, 1, Conv, [128, 3, 2]] # 3-P2/4

- [-1, 1, Bottleneck, [128]]

- [-1, 1, Conv, [256, 3, 2]] # 5-P3/8

- [-1, 2, Bottleneck, [256]]

- [-1, 1, Conv, [512, 3, 2]] # 7-P4/16

- [-1, 2, Bottleneck, [512]]

- [-1, 1, SPP, [256, [5, 9, 13]]] # Added SPP module

- [-1, 1, Conv, [512, 1, 1]] # Added Conv module

\par# head

head:

- [-1, 1, Bottleneck, [512, False]]

- [-1, 1, Conv, [512, 3, 1]]

- [-1, 1, Conv, [256, 1, 1]]

- [-1, 1, Conv, [512, 3, 1]] # 10 (P5/32-large)

\par- [-2, 1, Conv, [128, 1, 1]]

- [-1, 1, nn.Upsample, [None, 2, ”nearest”]]

- [[-1, 6], 1, Concat, [1]] # cat backbone P4

- [-1, 1, Bottleneck, [256, False]]

- [-1, 1, Conv, [128, 1, 1]]

- [-1, 1, Conv, [256, 3, 1]] # 16 (P4/16-medium)

\par- [-2, 1, Conv, [64, 1, 1]]

- [-1, 1, nn.Upsample, [None, 2, ”nearest”]]

- [[-1, 3], 1, Concat, [1]] # cat backbone P3

- [-1, 1, Bottleneck, [128, False]]

- [-1, 1, Conv, [64, 1, 1]] # 20 (P3/8-small)

\par- [[20, 16, 10], 1, Detect, [nc]] # Detect(P3, P4, P5)

\par

Figure 6. Example 3 generated by Llama 3.3-GE2 .

Listing 4 shows a modified YAML configuration file of the seed model. It was one of the Pareto front individuals proposed by Llama-3.3-GE2. The modifications made by the LLM-guided mutations and crossovers include changing the inputs for the Conv and Bottleneck layers and changing the number of layers.

5.2. Performance Improvements

The application of LLM-Guided Evolution to YOLO optimization generates promising YOLO variants with performance improvements across several key metrics, demonstrating the efficacy of this approach in optimizing the architecture and associated hyperparameters when seeded with a state-of-the-art object detection model for a challenging real-world dataset like KITTI. These improvements can be attributed to the framework’s ability to intelligently explore the configuration space through LLM-Guided mutations and crossovers. This autonomous framework successfully generated many YOLO variants, outperforming the original network’s performance of 92.5% mAP@50 on holdout data. A notable example generated by GE with Llama 3.3 achieves 94.5% mAP@50 on holdout data.

To understand the variety of individuals with unique contributions discovered across both GE1 and GE2, Figure 7 presents a parallel coordinate plot of the individuals on the Pareto frontier evaluated on the validation dataset, which the LLM-Guided Evolution framework uses to assess the performance of newly generated individuals. We combined the 135 Pareto-optimal individuals from GE1 with the 128 Pareto-optimal individuals from GE2, and kept only the individuals that remained Pareto-optimal. These individuals are co-dominant with respect to the objectives of model parameters, inference speed, precision, and recall, while also highlighting two key object detection metrics: mAP@50 and mAP@50-95. In the plot, blue diamonds connected by lines represent the performance of the remaining 11 individuals generated by Llama-3-3-GE1 and orange stars connected by lines indicate the performance of the remaining 63 individuals generated by Llama-3.3 GE2. The figure also shows the performance of YOLOv3, YOLOv3 SPP, YOLOv3 tiny, and other State-of-the-Art YOLO model for reference. We retrained these models on the same KITTI data with the default settings for training. Optimization of the training setting is not included in this study, but it remains a topic for future exploration.

Similar to Figure 7, Figure 8 presents the parallel coordinate plot of the Pareto front individuals evaluated on the holdout dataset. We reduced the set of individuals using the same procedure as applied to the validation set. The figure illustrates that some individuals outperform the reference models, including YOLOv3, YOLOv3 SPP, YOLOv3 tiny, YOLOv9s, YOLOv10tiny and YOLOv11. However, these improved variants come at the cost of higher computational complexity, making them more resource-intensive. In the plot, blue diamonds connected by lines represent the performance of 6 individuals generated by Llama-3.3-GE1 and orange stars connected by lines indicate the performance of 45 individuals generated by Llama-3.3-GE2. Listings 4-6 are examples 1-3 highlighted in Figure 8. They demonstrate that the LLM-GE is able to explore the configuration space by modifying numbers of layers, adding layers ( such as SPP layer and Conv layer), and changing the inputs of layers.

These visualizations effectively illustrate that the LLM-GE successfully evolved YOLO variants with trade-off between mAP and speed. This result highlights the flexibility and effectiveness of GE in real-world computer vision challenges, offering a novel paradigm for autonomous model optimization that combines LLM-driven reasoning with evolutionary strategies.

6. Conclusions and Future Work

We introduced a novel approach of using a LLM-GE to intelligently modify neural network architectures through their module, layer, and hyperparameter descriptions in YAML configuration files to optimize the YOLO object detection model on real-world KITTI data. Our approach focuses on modifying the YAML configuration file that defines YOLO architectures within the Ultralytics codebase, enabling structured and interpretable optimization. This method allows for systematic exploration of hyperparameter spaces and architecture variations in a manner that is both efficient and scalable. Our results are promising, as LLM-GE successfully evolved multiple YOLO variants with significant improvements in performance over seeded individuals and state-of-the-art implementations. By allowing LLMs to drive the evolutionary process, we observed notable enhancements in object detection accuracy.

This study serves as a proof-of-concept for evolving YAML configuration files using LLM-GE. While our approach has demonstrated clear potential, there remains significant opportunity for further research. Future work should explore more sophisticated evolutionary techniques, including island migration with multiple simultaneous LLMs, co-evolutionary improvements using prompts specifically tailored to LLMs and tasks at hand, and fine-grained parameter tuning. Additionally, expanding the scope of LLM-GE beyond YAML configurations to incorporate automated code generation for implementing new layers in adherence with the YAML settings could further enhance YOLO’s adaptability. Increasing the number of repeated trials for rigorous statistical analysis and continuing integration with SoTA open-source LLMs, such as DeepSeek-R1, Codestral, and Mixtral 8x22B, will be crucial for further advancements. It is also worth exploring the impact of seeds on the evolutionary process.

Other areas of future growth involve the manner of integration with ongoing research in the utilization of LLMs, including concepts such as Retrieval Augmented Generation (RAG) or fine-tuning to improve the performance of the mating and mutation operations that the LLMs are responsible for.

Ultimately, our goal is to continue to build a framework that adapts easily to new problem domains, objectives, and state-of-the-art models. We will continue to automate the evolution of YOLO variants to outperform existing SoTA models across all key metrics. As LLMs continue to advance, they will play an increasingly pivotal role in automating and refining neural network design, paving the way for more efficient, adaptive, and high-performing AI models.

Acknowledgements.

The authors wish to acknowledge Clint Morris, the original first-author of LLM-GE, and whose flexible codebase enabled the research performed in this paper.

References

(1)
Bergstra et al. (2011) James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems (NeurIPS).
Chen et al. ([n. d.]) Angelica Chen, David M Dohan, and DR So. [n. d.]. Evoprompting: Language models for code-level neural architecture search, CoRR abs/2302.14838 (2023). doi: 10.48550. arXiv preprint arXiv.2302.14838 20 ([n. d.]), 145.
Chen et al. (2019) Yuhang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Changxin Pan, and Jian Sun. 2019. DetNAS: Backbone Search for Object Detection. (2019).
Drori et al. (2021) Iddo Drori, Yamuna Krishnamurthy, Rémi Rampin, Rui Lourenço, Edwin Wang, Kyunghyun Cho, Claudio T. Silva, and Juliana Freire. 2021. AlphaD3M: Machine Learning Pipeline Synthesis. arXiv preprint arXiv:2111.02508 (2021).
Geiger et al. (2013) A Geiger, P Lenz, C Stiller, and R Urtasun. 2013. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research 32, 11 (2013), 1231–1237. doi:10.1177/0278364913491297 arXiv:https://doi.org/10.1177/0278364913491297
Hemberg et al. (2024) Erik Hemberg, Stephen Moskal, and Una-May O’Reilly. 2024. Evolving code with a large language model. Genetic Programming and Evolvable Machines 25, 2 (2024), 21.
Hussain (2024a) Muhammad Hussain. 2024a. YOLOv1 to v8: Unveiling Each Variant–A Comprehensive Review of YOLO. IEEE Access 12 (2024), 42816–42833. doi:10.1109/ACCESS.2024.3378568
Hussain (2024b) Muhammad Hussain. 2024b. Yolov5, yolov8 and yolov10: The go-to detectors for real-time vision. arXiv preprint arXiv:2407.02988 (2024).
Hutter et al. (2019) Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. 2019. Automated Machine Learning: Methods, Systems, Challenges. Springer.
Jiang et al. (2024) Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024).
Jocher et al. (2023) Glenn Jocher, Jing Qiu, and Ayush Chaurasia. 2023. Ultralytics YOLO. Ultralytics. https://ultralytics.com If you use this software, please cite it using the metadata from this file..
Ledell and Poirier (2020) Erin Ledell and Sebastien Poirier. 2020. H2O AutoML: Scalable Automatic Machine Learning. 7th ICML Workshop on Automated Machine Learning (2020) (2020).
Liu et al. (2023) Fei Liu, Xialiang Tong, Mingxuan Yuan, and Qingfu Zhang. 2023. Algorithm evolution using large language model. arXiv preprint arXiv:2311.15249 (2023).
Morris et al. (2024) Clint Morris, Michael Jurado, and Jason Zutty. 2024. Llm guided evolution-the automation of models advancing models. In Proceedings of the Genetic and Evolutionary Computation Conference. 377–384.
Nasir et al. (2024) Muhammad Umair Nasir, Sam Earle, Julian Togelius, Steven James, and Christopher Cleghorn. 2024. LLMatic: Neural Architecture Search Via Large Language Models And Quality Diversity Optimization. In Proceedings of the Genetic and Evolutionary Computation Conference (Melbourne, VIC, Australia) (GECCO ’24). Association for Computing Machinery, New York, NY, USA, 1110–1118. doi:10.1145/3638529.3654017
Olson and Moore (2016) Randal S Olson and Jason H Moore. 2016. TPOT: A tree-based pipeline optimization tool for automating machine learning. In Workshop on automatic machine learning. PMLR, 66–74.
Redmon (2016) J Redmon. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Redmon and Farhadi (2018) Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. CoRR abs/1804.02767 (2018). arXiv:1804.02767 http://arxiv.org/abs/1804.02767
Sun et al. (2022) Zhenhong Sun, Ming Lin, Xiuyu Sun, Zhiyu Tan, Hao Li, and Rong Jin. 2022. MAE-DET: Revisiting Maximum Entropy Principle in Zero-Shot NAS for Efficient Object Detection. arXiv:2111.13336 [cs.CV] https://arxiv.org/abs/2111.13336
Tan and Le (2019) Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946 (2019).
Tian et al. (2025) Yunjie Tian, Qixiang Ye, and David Doermann. 2025. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv:2502.12524 [cs.CV] https://arxiv.org/abs/2502.12524
Wang et al. (2022) Xiaoxing Wang, Jiale Lin, Juanping Zhao, Xiaokang Yang, and Junchi Yan. 2022. Eautodet: Efficient architecture search for object detection. In European Conference on Computer Vision. Springer, 668–684.
Wu et al. (2024) Xingyu Wu, Sheng-hao Wu, Jibin Wu, Liang Feng, and Kay Chen Tan. 2024. Evolutionary computation in the era of large language model: Survey and roadmap. arXiv preprint arXiv:2401.10034 (2024).
Xu et al. (2022) Xianzhe Xu, Yiqi Jiang, Weihua Chen, Yilun Huang, Yuan Zhang, and Xiuyu Sun. 2022. Damo-yolo: A report on real-time object detection design. arXiv preprint arXiv:2211.15444 (2022).
Yao et al. (2020) Lewei Yao, Hang Xu, Wei Zhang, Xiaodan Liang, and Zhenguo Li. 2020. SM-NAS: Structural-to-modular neural architecture search for object detection. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 12661–12668.
Zhou and Su (2021) Shi-Yao Zhou and Chung-Yen Su. 2021. A novel lightweight convolutional neural network, ExquisiteNetV2. arXiv preprint arXiv:2105.09008 (2021).
Zoph and Le (2017) Barret Zoph and Quoc V. Le. 2017. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2017).
Zutty et al. (2015) Jason Zutty, Daniel Long, Heyward Adams, Gisele Bennett, and Christina Baxter. 2015. Multiple objective vector-based genetic programming using human-derived primitives. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation. 1127–1134.