04. Building a robust data pipeline

Data is to AI what fuel is to fire; the cleaner, richer, and larger the dataset upon which the AI model is trained, the more accurate and reliable the outcomes will be. Below are five crucial steps for building a robust dataset that enhances the performance of AI models. First, focus on achieving maximum consistency in the labelling and formatting of each entry in your dataset. Inconsistent labelling is similar to background noise when trying to engage with someone in a loud room; it distracts from the clarity you need. For instance, consider VideaHealth, a startup that leverages AI to aid dentists in diagnosing X-rays of teeth. As highlighted in a Harvard case study, Videa sourced several million X-ray images from various dental service organisations. However, they encountered significant challenges due to variations in image formats and clinical labelling conventions among different dental practices. To address this, Videa's developers implemented software to standardise the image formats and label conventions. This step proved essential for enhancing the accuracy of their AI models by providing uniform data inputs. Second, evaluate whether your dataset is sufficiently rich in features—those variables associated with each case. A richer dataset with relevant features significantly improves model accuracy. For example, if you aim to develop an AI model to recommend jackets based on a female customer's potential purchase of trousers on your online store, consider factors like her purchase history, age, ethnicity, profession, and even geographical details such as whether she resides in a bustling metropolis or a tranquil small town. If you lack information on any of these features, your AI model might miss critical insights, rendering its recommendations less accurate. Third, identify and address any missing data within your dataset. One approach is manually collecting the missing information, although this can be time-consuming and costly, especially with large datasets. An alternative method is to employ statistical techniques, such as interpolation, to estimate missing values based on existing data. Alternatively, you might choose to train the AI model only on features with complete data or on those deemed most important to avoid compromising the model's overall performance. Fourth, scrutinise the dataset for unacceptable biases that may be embedded within it. For instance, if you are an HR manager aiming to use AI to screen job applicants, it's vital to examine whether historical ethnic or gender biases present in the dataset could adversely affect future hiring predictions. Addressing this issue involves implementing strategies to mitigate these biases during model training, a topic that we will explore in detail later in this course. Fifth, ensure that the size of your dataset is adequate. While a few thousand entries might be sufficient for simple input-output relationships, such as training a model to recognise car brands from rear-view images, more complex scenarios, like estimating the value of a used car—which involves numerous variables and intricate connections—require significantly larger datasets. Organisations often overlook the opportunity to centralise data collection across multiple units, which can lead to missed chances to create a more extensive data repository. To counteract this, it's essential to standardise, automate, and centralise data collection processes for every transaction. Moreover, organisations should establish proactive operational protocols that clearly define the what, why, how, and who of data collection strategies. Now, consider two specific opportunities within your organisation where you could train and deploy AI models. Analyse how you would construct a robust data pipeline tailored to each of these contexts, ensuring thorough preparation for effective data handling and model training.