Data Leakage - Nate's Notes

# Data Leakage ## Vectorization and Data Leakage You would usually use the same techniques to vectorize data to visualize it and to feed it to a model. There is an important distinction, however. When you vectorize data to feed it to a model, you should vectorize your training data and save the parameters you used to obtain the training vectors. You should then use the same parameters for your validation and test sets. When normalizing data, for example, you should compute summary statistics such as mean and standard deviation only on your training set (using the same values to normalize your validation data), and during inference in production. Using both your validation and training data for normalization, or to decide which categories to keep in your one-hot encoding, would cause data leakage, as you would be leveraging information from outside your training set to create training features. This would artificially inflate your model’s performance but make it perform worse in production. We will cover this in more detail in “Data leakage”. --- Date: 20220106 Links to: [Machine Learning in Startups](Machine%20Learning%20in%20Startups.md) Tags: References: * []()