
Real-time Serving for XGBoost, Scikit-Learn RandomForest, LightGBM, and More

This post-dive into how the NVIDIA
Triton Inference Server offers highly optimized real-time serving forest models by using the Forest Inference Library backend.

The success of deep neural networks in multiple areas has prompted a great deal of thought and effort on how to deploy these models for use in real-world applications efficiently. However, efforts to accelerate the deployment of tree-based models (including random forest and gradient-boosted models) have received less attention, despite their continued dominance in tabular data analysis and their importance for use-cases where interpretability is essential.

As organizations like DoorDash and CapitalOne turn to tree-based models for the analysis of massive volumes of mission-critical data, it has become increasingly important to provide tools to help make deploying such models easy, efficient, and performant.

NVIDIA Triton Inference Server offers a complete solution for deploying deep learning models on both CPUs and GPUs with support for a wide variety of frameworks and model execution backends, including PyTorch, TensorFlow, ONNX, TensorRT, and more. Starting in version 21.06.1, to complement NVIDIA Triton Inference Server existing deep learning capabilities, the new Forest Inference Library (FIL) backend provides support for tree models, such as XGBoost, LightGBM, Scikit-Learn RandomForest, RAPIDS cuML RandomForest, and any other model supported by Treelite

Based on the RAPIDS Forest Inference Library (FIL), the NVIDIA Triton Inference Server FIL backend allows users to take advantage of the same features of the NVIDIA Triton Inference Server they use to achieve optimal throughput/latency for deep learning models to deploy tree-based models on the same system.

In this post, we’ll provide a brief overview of the NVIDIA Triton Inference Server itself then dive into an example of how to deploy an XGBoost model using the FIL backend. Using NVIDIA GPUs, we will see that we do not always have to choose between deploying a more accurate model or keeping latency manageable.

In the example notebook, by taking advantage of the FIL backend’s GPU-accelerated inference on an NVIDIA DGX-1 server with eight V100 GPUs, we’ll be able to deploy a much more sophisticated fraud detection model than we would be able to on CPU while keeping p99 latency under 2ms and still offer over 400K inferences per second (630 MB/s) or about 20x higher throughput than on CPU.

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server offers a complete open source solution for real-time serving of machine learning models. Designed to make the process of performant model deployment as simple as possible, NVIDIA Triton Inference Server provides solutions to many of the most common problems encountered when attempting to deploy ML algorithms in real-world applications, including:

  • Multi-Framework Support: Supports all of the most common deep learning frameworks and serialization formats, including PyTorch, TensorFlow, ONNX, TensorRT, OpenVINO, and more. With the introduction of the FIL backend, NVIDIA Triton Inference Server also provides support for XGBoost, LightGBM, Scikit-Learn/cuML RandomForest, and Treelite-serialized models from any framework.
  • Dynamic Batching: Allows users to specify a batching window and collate any requests received in that window into a larger batch for optimized throughput.
  • Multiple Query Types: Optimizes inference for multiple query types: real time, batch, streaming, and also supports model ensembles. 
  • Pipelines and Ensembles: Models deployed with NVIDIA Triton Inference Server can be connected in sophisticated pipelines or ensembles to avoid unnecessary data transfers between client and server or even host and device.
  • CPU Model Execution: While most users will want to take advantage of the substantial performance gains offered by GPU execution, NVIDIA Triton Inference Server allows you to run models on either CPU or GPU to meet your specific deployment needs and resource availability.
  • Customization: If NVIDIA Triton Inference Server does not provide support for part of your pipeline, or if you need specialized logic to link together various models, you can add precisely the logic you need with a custom Python or C++ backend.
  • Run anywhere: On scaled-out cloud or data center, enterprise edge, and even on embedded devices.  It supports both bare metal and virtualized environments (e.g. VMware vSphere) for AI inference.
  • Kubernetes and AI platform support:
    • Available as a Docker container and integrates easily with Kubernetes platforms like AWS EKS, Google GKE, Azure AKS, Alibaba ACK, Tencent TKE or Red Hat OpenShift.
    • Available in Managed CloudAI workflow platforms like Amazon SageMaker, Azure ML, Google Vertex AI, Alibaba Platform for AI Elastic Algorithm Service, and Tencent TI-EMS.
  • Enterprise support: NVIDIA AI Enterprise software suite includes full support of NVIDIA Triton Inference Server, such as access to NVIDIA AI experts for deployment and management guidance, prioritized notification of security fixes and maintenance releases, long term support (LTS) options and a designated support agent.
The diagram shows the NVIDIA Triton Inference Server Architecture
Figure 1: NVIDIA Triton Inference Server Architecture Diagram.

To get a better sense of how we can take advantage of some of these features with the FIL backend for deploying tree models, let’s look at a specific use case.

Example: Fraud Detection with the FIL Backend

In order to deploy a model in NVIDIA Triton Inference Server, we need a configuration file specifying some details about deployment options and the serialized model itself. Models can currently be serialized in any of the following formats:

  • XGBoost binary format
  • XGBoost JSON
  • LightGBM text format
  • Treelite binary checkpoint files

In the following notebook, we will walk through every step of the process for deploying a fraud detection model, from training the model to writing the configuration file and optimizing the deployment parameters. Along the way, we’ll demonstrate how GPU deployments can dramatically increase throughput while keeping latency to a minimum. Furthermore, since FIL can easily scale up to very large and sophisticated models without substantially increasing latency, we’ll see that it is possible to deploy a much more complex and accurate model on GPU than on CPU for any given latency budget.


As we can see in this notebook, the FIL backend for NVIDIA Triton Inference Server allows us to easily serve tree models with just the serialized model file and a simple configuration file. Without NVIDIA Triton Inference Server, those wishing to serve XGBoost, LightGBM, or Random Forest models from other frameworks have often resorted to hand-rolled Flask servers with poor throughput-latency performance and no support for multiple frameworks. NVIDIA Triton Inference Server’s dynamic batching and concurrent model execution automatically maximizes throughput and Model Analyzer helps in choosing the most optimal deployment configuration. Manual selection can take hundreds of combinations and can delay the model rollout. With the FIL backend, we can serve models from all of these frameworks alongside each other with no custom code and highly optimized performance.


With the FIL backend, the NVIDIA Triton Inference Server now offers a highly optimized real-time serving of forest models, either on their own or alongside deep learning models. While both CPU and GPU executions are supported, we can take advantage of GPU-acceleration to keep latency low and throughput high even for complex models. As we saw in the example notebook, this means that there is no need to compromise model accuracy by falling back to a simpler model, even with tight latency budgets.

If you would like to try deploying your own XGBoost, LightGBM, Sklearn, or cuML forest model for real-time inference, you can easily pull the NVIDIA Triton Inference Server Docker container from NGC, NVIDIA’s catalog of GPU-optimized AI software. You can find everything you need to get started in the FIL backend documentation. NVIDIA Triton also offers example Helm charts if you’re ready to deploy to a Kubernetes cluster. For enterprises looking to trial Triton Inference Server with real-world workloads, the NVIDIA LaunchPad program offers a set of curated labs using Triton in the NVIDIA AI Enterprise suite.

If you run into any issues or would like to see additional features added, please do let us know through the FIL backend issue tracker. You can also contact the RAPIDS team through Slack, Google Groups, or Twitter.

Lastly, NVIDIA is hosting its GPU Technology Conference (GTC) this March. The event is virtual, developer-focused, and free to attend. There are numerous technical sessions and a training lab discussing the NVIDIA Triton Inference Server FIL backend use cases and best practices. Register today!


Can Robots Follow Instructions for New Tasks?

People can flexibly maneuver objects in their physical surroundings to accomplish various goals. One of the grand challenges in robotics is to successfully train robots to do the same, i.e., to develop a general-purpose robot capable of performing a multitude of tasks based on arbitrary user commands. Robots that are faced with the real world will also inevitably encounter new user instructions and situations that were not seen during training. Therefore, it is imperative for robots to be trained to perform multiple tasks in a variety of situations and, more importantly, to be capable of solving new tasks as requested by human users, even if the robot was not explicitly trained on those tasks.

Existing robotics research has made strides towards allowing robots to generalize to new objects, task descriptions, and goals. However, enabling robots to complete instructions that describe entirely new tasks has largely remained out-of-reach. This problem is remarkably difficult since it requires robots to both decipher the novel instructions and identify how to complete the task without any training data for that task. This goal becomes even more difficult when a robot needs to simultaneously handle other axes of generalization, such as variability in the scene and positions of objects. So, we ask the question: How can we confer noteworthy generalization capabilities onto real robots capable of performing complex manipulation tasks from raw pixels? Furthermore, can the generalization capabilities of language models help support better generalization in other domains, such as visuomotor control of a real robot?

In “BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning”, published at CoRL 2021, we present new research that studies how robots can generalize to new tasks that they were not trained to do. The system, called BC-Z, comprises two key components: (i) the collection of a large-scale demonstration dataset covering 100 different tasks and (ii) a neural network policy conditioned on a language or video instruction of the task. The resulting system can perform at least 24 novel tasks, including ones that require interaction with pairs of objects that were not previously seen together. We are also excited to release the robot demonstration dataset used to train our policies, along with pre-computed task embeddings.

The BC-Z system allows a robot to complete instructions for new tasks that the robot was not explicitly trained to do. It does so by training the policy to take as input a description of the task along with the robot’s camera image and to predict the correct action.

Collecting Data for 100 Tasks
Generalizing to a new task altogether is substantially harder than generalizing to held-out variations in training tasks. Simply put, we want robots to have more generalization all around, which requires that we train them on large amounts of diverse data.

We collect data by teleoperating the robot with a virtual reality headset. This data collection follows a scheme similar to how one might teach an autonomous car to drive. First, the human operator records complete demonstrations of each task. Then, once the robot has learned an initial policy, this policy is deployed under close supervision where, if the robot starts to make a mistake or gets stuck, the operator intervenes and demonstrates a correction before allowing the robot to resume.

This mixture of demonstrations and interventions has been shown to significantly improve performance by mitigating compounding errors. In our experiments, we see a 2x improvement in performance when using this data collection strategy compared to only using human demonstrations.

Example demonstrations collected for 12 out of the 100 training tasks, visualized from the perspective of the robot and shown at 2x speed.

Training a General-Purpose Policy
For all 100 tasks, we use this data to train a neural network policy to map from camera images to the position and orientation of the robot’s gripper and arm. Crucially, to allow this policy the potential to solve new tasks beyond the 100 training tasks, we also input a description of the task, either in the form of a language command (e.g., “place grapes in red bowl”) or a video of a person doing the task.

To accomplish a variety of tasks, the BC-Z system takes as input either a language command describing the task or a video of a person doing the task, as shown here.

By training the policy on 100 tasks and conditioning the policy on such a description, we unlock the possibility that the neural network will be able to interpret and complete instructions for new tasks. This is a challenge, however, because the neural network needs to correctly interpret the instruction, visually identify relevant objects for that instruction while ignoring other clutter in the scene, and translate the interpreted instruction and perception into the robot’s action space.

Experimental Results
In language models, it is well known that sentence embeddings generalize on compositions of concepts encountered in training data. For instance, if you train a translation model on sentences like “pick up a cup” and “push a bowl”, the model should also translate “push a cup” correctly.

We study the question of whether the compositional generalization capabilities found in language encoders can be transferred to real robots, i.e., being able to compose unseen object-object and task-object pairs.

We test this method by pre-selecting a set of 28 tasks, none of which were among the 100 training tasks. For example, one of these new test tasks is to pick up the grapes and place them into a ceramic bowl, but the training tasks involve doing other things with the grapes and placing other items into the ceramic bowl. The grapes and the ceramic bowl never appeared in the same scene during training.

In our experiments, we see that the robot can complete many tasks that were not included in the training set. Below are a few examples of the robot’s learned policy.

The robot completes three instructions of tasks that were not in its training data, shown at 2x speed.

Quantitatively, we see that the robot can succeed to some degree on a total of 24 out of the 28 held-out tasks, indicating a promising capacity for generalization. Further, we see a notably small gap between the performance on the training tasks and performance on the test tasks. These results indicate that simply improving multi-task visuomotor control could considerably improve performance.

The BC-Z performance on held-out tasks, i.e., tasks that the robot was not trained to perform. The system correctly interprets the language command and translates that into action to complete many of the tasks in our evaluation.

The results of this research show that simple imitation learning approaches can be scaled in a way that enables zero-shot generalization to new tasks. That is, it shows one of the first indications of robots being able to successfully carry out behaviors that were not in the training data. Interestingly, language embeddings pre-trained on ungrounded language corpora make for excellent task conditioners. We demonstrated that natural language models can not only provide a flexible input interface to robots, but that pretrained language representations actually confer new generalization capabilities to the downstream policy, such as composing unseen object pairs together.

In the course of building this system, we confirmed that periodic human interventions are a simple but important technique for achieving good performance. While there is a substantial amount of work to be done in the future, we believe that the zero-shot generalization capabilities of BC-Z are an important advancement towards increasing the generality of robotic learning systems and allowing people to command robots. We have released the teleoperated demonstrations used to train the policy in this paper, which we hope will provide researchers with a valuable resource for future multi-task robotic learning research.

We would like to thank the co-authors of this research: Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, and Sergey Levine. This project was a collaboration between Google Research and the Everyday Robot Project. We would like to give special thanks to Noah Brown, Omar Cortes, Armando Fuentes, Kyle Jeffrey, Linda Luu, Sphurti Kirit More, Jornell Quiambao, Jarek Rettinghouse, Diego Reyes, Rosario Jau-regui Ruano, and Clayton Tan for overseeing robot operations and collecting human videos of the tasks, as well as Jeffrey Bingham, Jonathan Weisz, and Kanishka Rao for valuable discussions. We would also like to thank Tom Small for creating animations in this post and Paul Mooney for helping with dataset open-sourcing.


Rethinking Remote Console for Edge Computing

For IT administrators, remote access to all of their machines reduces the time and effort spent on urgent fixes and provides peace-of-mind in the case of emergencies or major issues.

The nature of edge deployments means that they are always on, sometimes running 24/7 in a different time zone than the IT administrators. So when a system experiences a bug or major issue, IT is required to travel to the edge site and debug the system. Sometimes this happens in the middle of the night. Even when teams have the foresight to set up tools in place to allow for remote access, they are often costly to develop, difficult to use, and present significant security vulnerabilities.

For IT administrators, remote access to all of their machines reduces the time and effort spent on urgent fixes and provides peace-of-mind in the case of emergencies or major issues. While versions of this tool can be found in other products, many lack critical features that are becoming increasingly important for organizations deploying and scaling AI at the edge. One of them is security.

To solve this, Fleet Command includes on-demand remote console functionality available to all administrators. Recently, we added several new features to remote console to increase security and to allow users to concurrently access multiple edge nodes in an organization.

Remote console is a mechanism allowing the Fleet Command administrator to securely access NVIDIA-Certified systems at edge locations, ensuring proper authorization and authentication mechanism. It provides remote access to allow administrators to streamline troubleshooting and emergency management from the comfort of their office.

Secure access at the edge

The most difficult part of building remote console functionality is ensuring the right security protocols are in place to prevent vulnerabilities and to restrict malicious behavior.  Major security breaches have occurred through perpetually open VPN instances that were created so IT administrators could remote access systems at other locations.

Today, IT teams are left with the difficult choice of spending resources to meticulously develop their own security protocols or to operate remote console sessions with little or no protection in place. With Fleet Command, IT teams no longer need to make that choice.

Remote console on Fleet Command is secure by default. It gives the Fleet Command administrator an on-demand browser-based shell to edge nodes using an ephemeral reverse websocket tunnel. Additionally remote console uses the standard TLS port 443 and does not require customers to open custom ingress firewall rules in their network, reducing the complexity of getting started.

Fleet Command administrators use the NGC authenticated secure shell terminal to pass through the HTTPS tunnel to remotely access systems at edge locations.
Figure 1. Secure remote console workflow on Fleet Command.

Since SSH access is granted through broker/remove with temporary elevation with a 1 hour default timeout, as opposed to standing access, remote console through Fleet Command provides just-in-time (JIT) access. JIT access helps minimize the risk of standing privileges that can be exploited by malicious actors.

Without JIT security, users are effectively given unlimited, privileged access to the systems at edge locations and the data and resources on them. By limiting the window of access with remote console on Fleet Command, organizations significantly reduce the risk of exposure.

Simultaneous fleet wide access

Another unique aspect of remote console on Fleet Command is usability. Because Fleet Command is a turnkey solution, all tools and features are designed to be as straightforward and easy to use as possible. One of those features, multiple remote console, allows for concurrent access to multiple edge nodes in an organization. By giving administrators access to multiple locations at the same time, they are able to provide simultaneous troubleshooting capabilities across their entire edge fleet. Multiple remote console allows access to one or many locations, and even has the ability to collaborate with other administrators.

To ensure the highest security across nodes, Fleet Command infrastructure isolates each of the open nodes on multiple remote console and ensures any issues on one of the systems does not affect other sessions.

Using remote console on Fleet Command

Using remote console on Fleet Command takes only a few clicks and is available through the Fleet Command user interface.

With no coding required, the Fleet Command UI makes it easy for administrators of any skill level to access remote console.
Figure 2: Access remote console on Fleet Command in just a few clicks.

New features are constantly added to Fleet Command to give organizations access to immediate innovation. The philosophy behind building a turnkey solution means that every new feature added is one feature that does not need to be built or designed by your developers. This accelerates the time it takes for you to go from zero to AI. Purpose-built for AI, Fleet Command includes dozens of these features and functionality that make new opportunities and use cases possible for virtually every industry.

Get started on Fleet Command

Getting started on Fleet Command is easy and trials are available on NVIDIA LaunchPad for any organization to evaluate the value of quickly deploying AI at scale. The NVIDIA LaunchPad program enables quick testing and prototyping on the same complete stack you can purchase and deploy.

With Fleet Command on LaunchPad, organizations get access to:

  • a turnkey cloud service to easily deploy and monitor real applications on real servers
  • a catalog of models and applications to explore the benefits of AI
  • an accelerated edge computing infrastructure from Equinix to seamlessly provision and run applications

View Fleet Command features in action.

Sign up for Edge AI News to stay up to date with the latest trends, customers use cases, and technical walkthroughs.


How Audio Analytic Is Teaching Machines to Listen

From active noise cancellation to digital assistants that are always listening for your commands, audio is perhaps one of the most important but often overlooked aspects of modern technology in our daily lives. Audio Analytic has been using machine learning that enables a vast array of devices to make sense of the world of sound. Read article >

The post How Audio Analytic Is Teaching Machines to Listen appeared first on The Official NVIDIA Blog.


Using a custom python Keras DataGenerator in R?

Hey guys,

Im kinda new to the ML community and i’ve been trying to make a custom data generator for both python and R. I’ve managed to make the python one based on this fantastic ressource but i’m struggling with calling it from a R script… I’ve tried with the reticulate library but i can’t seem to make it all work well together. Does anyone know how to do it or can point me toward a good ressource/ code example?

Thanks and have a nice day

submitted by /u/Limiv0rous
[visit reddit] [comments]


Help with digit recognition

Help with digit recognition

I want to use TensorFlow to make a program that can detect digits from an image. These are printed digits, not handwritten digits. I’ve attached an example photo of what I’m referring to.

What would be the best way to start on something like this? I saw there was an official tutorial for handwritten digits but I didn’t know if it applied to printed digits. What kind of algorithm should I use and what would be the best way to train a classifier?

submitted by /u/_ROFLWaffles
[visit reddit] [comments]


Combined accuracy of multioutput Model keras

Combined accuracy of multioutput Model keras

Hi, I have a model of the following structure. It has 6 outputs. Given an image, the model predicts classes of 6 different components from the image.

Multi output keras model

The metrics I used are:

Metrics used

As you can see it outputs an overall combined loss and separate losses for different outputs. But there is no combined accuracy score. What I want is a combined accuracy score ( Which will consider a sample is correct if all the output labels are correct). How can I calculate overall combined accuracy for my multioutput model?

submitted by /u/Aditya_Larma
[visit reddit] [comments]


Develop the Future of Virtual Worlds with Omniverse Code App

Dive into the Omniverse Code app—an integrated development environment for users to easily build their own Omniverse extensions, apps, or microservices.

It’s now even easier for developers to build advanced tools for 3D design and simulation with Omniverse Code,  a new NVIDIA Omniverse app that serves as an integrated development environment (IDE) for developers and power users.

Using Omniverse Code, now in beta, developers can quickly become familiar with the platform while building their Omniverse extensions, apps, or microservices. Omniverse Code includes Omniverse Kit SDK runtime, and provides the foundational tools, templates, and documentation. In a simple-to-navigate interface, developers can easily experience the powerful capabilities of Omniverse Kit SDK when working on their own Omniverse-based projects.

Get started with Extension Manager

When using Omniverse Code, there’s no need to build from scratch. Developers have access to hundreds of Omniverse Extensions to edit, modify, or integrate into their own extensions or applications.

The platform is extremely modular, easily extensible, and flexible. Users can tease apart extensions, use them as templates, or build feature sets on top of the existing extensions.

Extension Manager is one of the most valuable resources, housing over 200 NVIDIA-developed extensions, all part of the Omniverse Kit SDK.

Screenshot of the Omniverse Code App Extension Manager workflow.
Figure 1. Premade templates within the Code app helps you speed up development.

Learn more about using Extension Manager in Omniverse Code.

Experience interactive documentation

Developers can leverage the fully interactive Omni.ui documentation. The new feature is integrated directly into the user interface of Omniverse Code, with fully functioning buttons, sliders, and other features within the documentation.

It also exposes documentation code directly so users can copy and paste it as a whole, or modify it as needed. With Omniverse Code, interactive integration is extended across other areas of the platform, so developers can get started faster than ever.

Screenshot of the Omniverse Code App Interactive Documentation primate templates.
Figure 2. With interactive documentation you can take premade templates from the Code app rather than building from scratch.

One of the new frameworks for this release of Omniverse Kit is Omni.ui.scene—a new manipulator and scene overlay system—enables users to construct interactive manipulators and control objects within a 3D environment. Developers can get started with a provided collection of standard manipulators or build their own by writing very little Python code.  

Discover the new 3D viewport

With the release of Omniverse Kit 103 and Omniverse Code, a new, fully customizable viewport menu serves as a one-click portal into various tools available to developers.

The viewport manipulator is available and programmable in Python, so users can inspect, tweak, modify, or rebuild their own. Developers can also configure multiple viewports individually with unique cameras and renderers, unlocking the ability to preconfigure different vantages instantaneously.

Get more information about these features in this short introductory video. 

Get the latest Omniverse news

Join the Omniverse Developer Day at GTC to learn more about the new Code app, and interact directly with the development team.

Watch the GTC keynote, presented by NVIDIA CEO Jensen Huang, on March 22 at 8 am PT, to see the latest technologies driving the future of AI and graphics.

Learn more about Omniverse Code in the upcoming Twitch stream on Wednesday, Feb. 2 at 11 am PT / 8 pm CET. 



Applying Differential Privacy to Large Scale Image Classification

Machine learning (ML) models are becoming increasingly valuable for improved performance across a variety of consumer products, from recommendations to automatic image classification. However, despite aggregating large amounts of data, in theory it is possible for models to encode characteristics of individual entries from the training set. For example, experiments in controlled settings have shown that language models trained using email datasets may sometimes encode sensitive information included in the training data and may have the potential to reveal the presence of a particular user’s data in the training set. As such, it is important to prevent the encoding of such characteristics from individual training entries. To these ends, researchers are increasingly employing federated learning approaches.

Differential privacy (DP) provides a rigorous mathematical framework that allows researchers to quantify and understand the privacy guarantees of a system or an algorithm. Within the DP framework, privacy guarantees of a system are usually characterized by a positive parameter ε, called the privacy loss bound, with smaller ε corresponding to better privacy. One usually trains a model with DP guarantees using DP-SGD, a specialized training algorithm that provides DP guarantees for the trained model.

However training with DP-SGD typically has two major drawbacks. First, most existing implementations of DP-SGD are inefficient and slow, which makes it hard to use on large datasets. Second, DP-SGD training often significantly impacts utility (such as model accuracy) to the point that models trained with DP-SGD may become unusable in practice. As a result most DP research papers evaluate DP algorithms on very small datasets (MNIST, CIFAR-10, or UCI) and don’t even try to perform evaluation of larger datasets, such as ImageNet.

In “Toward Training at ImageNet Scale with Differential Privacy”, we share initial results from our ongoing effort to train a large image classification model on ImageNet using DP while maintaining high accuracy and minimizing computational cost. We show that the combination of various training techniques, such as careful choice of the model and hyperparameters, large batch training, and transfer learning from other datasets, can significantly boost accuracy of an ImageNet model trained with DP. To substantiate these discoveries and encourage follow-up research, we are also releasing the associated source code.

Testing Differential Privacy on ImageNet
We choose ImageNet classification as a demonstration of the practicality and efficacy of DP because: (1) it is an ambitious task for DP, for which no prior work shows sufficient progress; and (2) it is a public dataset on which other researchers can operate, so it represents an opportunity to collectively improve the utility of real-life DP training. Classification on ImageNet is challenging for DP because it requires large networks with many parameters. This translates into a significant amount of noise added into the computation, because the noise added scales with the size of the model.

Scaling Differential Privacy with JAX
Exploring multiple architectures and training configurations to research what works for DP can be debilitatingly slow. To streamline our efforts, we used JAX, a high-performance computational library based on XLA that can do efficient auto-vectorization and just-in-time compilation of the mathematical computations. Using these JAX features was previously recommended as a good way to speed up DP-SGD in the context of smaller datasets such as CIFAR-10.

We created our own implementation of DP-SGD on JAX and benchmarked it against the large ImageNet dataset (the code is included in our release). The implementation in JAX was relatively simple and resulted in noticeable performance gains simply because of using the XLA compiler. Compared to other implementations of DP-SGD, such as that in Tensorflow Privacy, the JAX implementation is consistently several times faster. It is typically even faster compared to the custom-built and optimized PyTorch Opacus.

Each step of our DP-SGD implementation takes approximately two forward-backward passes through the network. While this is slower than non-private training, which requires only a single forward-backward pass, it is still the most efficient known approach to train with the per-example gradients necessary for DP-SGD. The graph below shows training runtimes for two models on ImageNet with DP-SGD vs. non-private SGD, each on JAX. Overall, we find DP-SGD on JAX sufficiently fast to run large experiments just by slightly reducing the number of training runs used to find optimal hyperparameters compared to non-private training. This is significantly better than alternatives, such as Tensorflow Privacy, which we found to be ~5x–10x slower on our CIFAR10 and MNIST benchmarks.

Time in seconds per training epoch on ImageNet using a Resnet18 or Resnet50 architecture with 8 V100 GPUs.

Combining Techniques for Improved Accuracy
It is possible that future training algorithms may improve DP’s privacy-utility tradeoff. However, with current algorithms, such as DP-SGD, our experience points to an engineering “bag-of-tricks” approach to make DP more practical on challenging tasks like ImageNet.

Because we can train models faster with JAX, we can iterate quickly and explore multiple configurations to find what works well for DP. We report the following combination of techniques as useful to achieve non-trivial accuracy and privacy on ImageNet:

  • Full-batch training

    Theoretically, it is known that larger minibatch sizes improve the utility of DP-SGD, with full-batch training (i.e., where a full dataset is one batch) giving the best utility [1, 2], and empirical results are emerging to support this theory. Indeed, our experiments demonstrate that increasing the batch size along with the number of training epochs leads to a decrease in ε while still maintaining accuracy. However, training with extremely large batches is non-trivial as the batch cannot fit into GPU/TPU memory. So, we employed virtual large-batch training by accumulating gradients for multiple steps before updating the weights instead of applying gradient updates on each training step.

    Batch size 1024 4 × 1024 16 × 1024 64 × 1024
    Number of epochs 10 40 160 640
    Accuracy 56% 57.5% 57.9% 57.2%
    Privacy loss bound ε 9.8 × 108 6.1 × 107 3.5 × 106 6.7 × 104

  • Transfer learning from public data

    Pre-training on public data followed by DP fine-tuning on private data has previously been shown to improve accuracy on other benchmarks [3, 4]. A question that remains is what public data to use for a given task to optimize transfer learning. In this work we simulate a private/public data split by using ImageNet as “private” data and using Places365, another image classification dataset, as a proxy for “public” data. We pre-trained our models on Places365 before fine-tuning them with DP-SGD on ImageNet. Places365 only has images of landscapes and buildings, not of animals as ImageNet, so it is quite different, making it a good candidate to demonstrate the ability of the model to transfer to a different but related domain.

    We found that transfer learning from Places365 gave us 47.5% accuracy on ImageNet with a reasonable level of privacy (ε = 10). This is low compared to the 70% accuracy of a similar non-private model, but compared to naïve DP training on ImageNet, which yields either very low accuracy (2 – 5%) or no privacy (ε=109), this is quite good.

Privacy-accuracy tradeoff for Resnet-18 on ImageNet using large-batch training with transfer learning from Places365.

Next Steps
We hope these early results and source code provide an impetus for other researchers to work on improving DP for ambitious tasks such as ImageNet as a proxy for challenging production-scale tasks. With the much faster DP-SGD on JAX, we urge DP and ML researchers to explore diverse training regimes, model architectures, and algorithms to make DP more practical. To continue advancing the state of the field, we recommend researchers start with a baseline that incorporates full-batch training plus transfer learning.

This work was carried out with the support of the Google Visiting Researcher Program while Prof. Geambasu, an Associate Professor with Columbia University, was on sabbatical with Google Research. This work received substantial contributions from Steve Chien, Shuang Song, Andreas Terzis and Abhradeep Guha Thakurta.


Running a neural network model on an ARM64 that was trained on a x64 system

I have a neural network that was trained on a x64 system with a 3080. I am trying to run it on a jetson nano based on ARM64 architecture.

The neural network runs on the original machine that it was trained on, but trying to get it on the jetson nano gives bad marshal code, which I assume is a problem with the architecture

submitted by /u/rock2171
[visit reddit] [comments]