Experiment tracking for XGBoost projects

Experiment tracking for XGBoost projects

XGBoost is a machine learning library based on the gradient boosting framework. It can run in distributed environments and can handle billions of data samples. Unlike LightGBM, it uses depth-wise tree growth. XGBoost also generates feature importance to help you identify the most critical features. This article will look at tracking various aspects of your XGBoost project, including logging the feature importance graph and monitoring the training using custom XGBoost callbacks.      

Let's get started.

Install Layer

Layer provides an open-source SDK for tracking all your machine learning project metadata. It provides a hub for logging, displaying, comparing, sharing datasets, models, and project documentation.  

GitHub - layerai/sdk: Metadata store for Production ML
Metadata store for Production ML. Contribute to layerai/sdk development by creating an account on GitHub.

Furthermore, Layer provides seamless local and cloud integration. For example, you can perform simple computer vision experiments locally and quickly move the execution to the cloud by adding a few lines of code to your project. This saves you the time you'd spend setting up and configuring servers. What's more, Layer gives you up to 30 hours per week of free GPU. Layer will still track your project's metadata whether you use your local resources to run your project or run it on Layer Cloud.

Install Layer to start tracking your XGBoost project's metadata.

pip install -U layer

Connect your script to Layer

Layer stores your ML metadata under your account. Let's import Layer and set up an account.

Click the link generated to set up your account.

You can sign up with either Google or GitHub. Click Continue on the next screen to set up your account.

On the next page, you'll get a chance set a unique username for your account.    

After choosing a username and clicking continue, you will be signed in to your account, and a code to authenticate the account will be generated.  

Copy the code, paste it into the textbox in your notebook, and press Enter to authenticate your account on a notebook.

If you are not working on a notebook environment, you can obtain an API key on the developer settings of your Layer account.  

Use that API key to log in to your Layer account. The key gives you full access to your account. You should, therefore, keep it private.

Layer encapsulates all ML metadata in a project. You can create multiple projects, and each project can have numerous experiments. Therefore, the first step is to initialize a project.

Once the project is created, the SDK will output a link that you can use to access your project.

All your project metadata will now be stored visible on this page. To make a project public, click the setting menu. Any project you create will be private by default. Later in the article, we'll discuss how to add project documentation to this page.    

Version datasets

Layer allows you to save and version your datasets. Versioning occurs automatically as you make changes to the data. Versioning the data helps you to quickly pick a specific data version and use it to train machine learning models. You are also able to link models to their corresponding dataset version. This makes it easy to make your machine learning model reproducible. Storing dataset versions ensures that you don't repeat expensive data preprocessing steps.  

Add datasets | Layer documentation
Open in Layer Open in Colab Layer Examples Github

Let's illustrate how to add and version datasets with Layer using the simple heart disease dataset from the UCI machine learning repository. We need to sync our local and Layer remote environments to upload the data to Layer. The @resources decorator is responsible for this. Syncing is done by passing the files or folders we would like to sync to the decorator.  

Next, define a function that returns a DataFrame. The function should be wrapped with the @dataset decorator. The preferred dataset name is passed to this decorator. The function is executed on Layer infrastructure by passing it to the layer.run function.

Layer will execute the function remotely and return a link to the data. Notice that it also contains the dataset version. Layer increments this number as you change the data. Click the link to view the dataset on Layer.

On the dataset page, you will see:

  • A sample of the dataset.
  • Dataset summary charts and statistics.
  • The fabric that was used to build the dataset.  
  • Different dataset versions if you have executed the dataset build function more than once.
  • The individual that executed each build.
  • The dataset owner.
  • How long it took to build the dataset.
  • Execution logs and any logged data.
  • Use from Layer code snippet that you can copy and start using the dataset in your project.  

Log model parameters

Logging model parameters makes it possible to reproduce different experiments. It also lets you compare different experiments to see how the parameters affect the model's accuracy. Layer provides a log function that you can use to log anything to Layer.

Logged parameters are visible on the Layer web UI.

Version models

Versioning models enable you to try out different algorithms and parameters. Layer versions XGBoost models automatically. Layer creates the first version when you run the model training function. This version will be incremented in subsequent runs. Therefore, each model version will have its metadata. This makes it possible to compare different model runs.

Apart from the model version, you will also see the following information on the model's page:

  • The person who executed each model build.
  • The algorithm used to execute that train.
  • The environment where the model was executed, either local or Layer remote fabrics.
  • The Use from Layer snippets that provides code that you can copy and paste to start using the model immediately.
  • When and how long it took to execute the train.
  • Execution logs that you can use to debug when training using Layer fabrics.

Log test metrics

The next step after training the XBoost model is to calculate metrics on the test data. You can use Layer to log the metrics for each run. You can then compare the metrics on different experiments. We use layer.log to log the test metrics. In this case, we log the:

Log sample predictions

Layer supports logging of Pandas DataFrames. You can use this feature to log sample datasets or sample model predictions. In the code snippet below, we use it to log some predictions from our XGBoost model.

Log using custom XGBoost callbacks

XGBoost provides callbacks for tracking machine learning models. An example of a built-in callback is the EarlyStopping callback. Layer provides a built-in XGBoost callback for tracking your XGBoost projects.

The callback is passed as a parameter when initializing the model.

Logging charts

You can also log charts generated in XGBoost modeling to Layer. Each logged item is attached to a specific model run, making it easy to compare different experiments. For example, in the example below, we log a confusion matrix using Matplolib.    

Use Layer trained model to make predictions

Models trained on Layer can be fetched and used to make predictions immediately. Copy the code snippet from Use from Layer to fetch the model.

We now use the XGBoost model to make predictions on new data.

Compare different experiments

You may want to compare different experiments after logging the XGBoost model metadata. Comparison is done by ticking the experiments you want to compare from the left panel. Layer will then display the comparison of all logged metadata from those experiments.  

Document XGBoost projects using Layer

Creating project reports with Layer is very easy. Layer provides an empty page for every project you create. You can populate this page by creating a README.MD file. Layer will read the content of this file when you run layer.init and populate the content on that page.

Layer project reports can also be dynamic. You paste the link to the following Layer entities, and they will be displayed on the report:

The project below shows an example of a Layer project using a dynamic project report.

Final thoughts

This article has covered how you can perform experiment tracking for XGBoost projects. We have also seen how you can store the various metadata generated when creating XGBoost models. Specifically, we have covered:

  • Versioning datasets used for XGBoost models.
  • Versioning XGBoost model parameters.
  • Logging model test metrics.
  • Logging model sample predictions.
  • Versioning charts and images.
  • Logging with XGBoost custom callbacks.
  • Fetching the XGBoost model for predictions.
  • Comparing different XGBoost experiments with Layer.
  • Creating dynamic reports for XGBoost projects using Layer.
Try Live XGBoost Notebook

Interested in discussing a use case for your organization? Book a slot below, and we'll show you how to integrate Layer into your existing ML code without breaking a sweat.

Book a demo

For more machine learning news, tutorials, code, and discussions join us on Slack, Twitter, LinkedIn, and GitHub. Also, subscribe to this blog so you don't miss a post.

Subscribe to Layer

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.