06. Training an AI model

Assuming you work for a nationwide used vehicle retailer and aim to develop a machine learning (ML) model that accurately estimates the market value of used cars, the first step involves gathering a comprehensive dataset with relevant transaction data from several thousand used cars. For this project, let’s assume you can compile detailed information on close to 10,000 vehicles. Each transaction in your dataset should encompass the output variable—specifically, the sales price—as well as an extensive range of input features for each car. These features may include: - **Make and Model**: This identifies the manufacturer and specific design of the vehicle, which directly influences its resale value. - **Launch Year**: The year the car was originally manufactured; older models may decrease in value over time, while newer models may still have a high market value. - **Mileage**: The total distance the vehicle has been driven, typically measured in miles or kilometres; lower mileage often correlates with higher value. - **Color**: The external color of the vehicle can impact desirability; certain colors may be more sought after in the used car market. - **Secondary Options**: These may include features such as the type of seats (e.g., leather vs. cloth), advanced entertainment systems, safety features, and technological enhancements like navigation systems. - **Exterior and Interior Condition**: This assesses the physical state of the car, including scratches, dents, upholstery wear, and cleanliness, all of which can significantly affect value. - **Repair History**: Documentation of maintenance and repairs performed on the vehicle, where a complete history could enhance perceived reliability. - **Accident History**: Information on any past accidents, which tends to negatively impact resale value. Once you have gathered this data, it is crucial to clean and preprocess it. This involves handling missing values, removing duplicates, and standardising formats (e.g., ensuring all prices are in the same currency and using consistent measurement units). With a clean dataset, you have a clear foundation for supervised learning, as the dataset contains both input features and corresponding sales prices. It’s advisable to explore various supervised learning algorithms to determine which performs best for your specific data. Start with a selection of pertinent algorithms, such as linear regression, decision trees, random forests, and more advanced models like gradient boosting machines or neural networks. Following the selection of algorithms, you will need to configure crucial hyperparameters for each one. Some of the critical hyperparameters include: - **Loss Function**: This quantifies how close the predicted values are to the actual sales prices. Common options are Mean Absolute Error (MAE) or Mean Squared Error (MSE). - **Data Split**: Divide your dataset into three distinct sets: a training set (typically 70%), a validation set (15%), and a test set (15%). The training set is used to fit the model, the validation set for tuning the model’s hyperparameters, and the test set for final performance evaluation. - **Number of Epochs**: The total number of times the entire training set will be used to update the model parameters; you may start with a higher number and adjust based on performance. - **Batch Size**: This refers to the subset of data processed before the model's internal parameters are updated; smaller batch sizes can lead to more precise gradient updates but may take longer to train. - **Hidden Layers**: When employing neural networks, the number of hidden layers can greatly affect the model’s complexity—more hidden layers allow the model to learn intricate feature interrelationships. After defining these hyperparameters, each algorithm will undergo a training phase. After each training session, utilise the validation data to assess model accuracy. Based on validation results, you will iteratively adjust the hyperparameters, re-train the model, and reevaluate performance until the best possible model is reached for each algorithm. Once the models are trained, the next step is to compare the performance of these optimised models using the test set. This comparison will help identify which algorithm produced the top-performing model in terms of accuracy and reliability. In the concluding phase, evaluate whether the best ML model's performance surpasses human judgment in valuing used cars. If the model demonstrates superior accuracy, it can become the primary tool for valuation. If it is nearly as accurate but not better, you may consider using the model to support human decision-making. However, if it lags significantly behind human judgment, it may be prudent to set the model aside until either improved data quality or more advanced algorithms become available. This thorough assessment ensures that the model adds tangible value to your operations, enhancing the accuracy of pricing in the competitive used vehicle market.