Categories
Offsites

Google Research: Looking Back at 2020, and Forward to 2021

When I joined Google over 20 years ago, we were just figuring out how to really start on the journey of making a high quality and comprehensive search service for information on the web, using lots of curiously wired computers. Fast forward to today, and while we’re taking on a much broader array of technical challenges, it’s still with the same overarching goal of organizing the world’s information and making it universally accessible and useful. In 2020, as the world has been reshaped by COVID-19, we saw the ways research-developed technologies could help billions of people better communicate, understand the world, and get things done. I’m proud of what we’ve accomplished, and excited about new possibilities on the horizon.

The goal of Google Research is to work on long-term, ambitious problems across a wide range of important topics — from predicting the spread of COVID-19, to designing algorithms, to learning to translate more and more languages automatically, to mitigating bias in ML models. In the spirit of our annual reviews for 2019, 2018, and more narrowly focused reviews of some work in 2017 and 2016, this post covers key Google Research highlights from this unusual year. This is a long post, but grouped into many different sections. Hopefully, there’s something interesting in here for everyone! For a more comprehensive look, please see our >750 research publications in 2020.

COVID-19 and Health
As the impact of COVID-19 took a tremendous toll on people’s lives, researchers and developers around the world rallied together to develop tools and technologies to help public health officials and policymakers understand and respond to the pandemic. Apple and Google partnered in 2020 to develop the Exposure Notifications System (ENS), a Bluetooth-enabled privacy-preserving technology that allows people to be notified if they have been exposed to others who have tested positive for COVID-19. ENS supplements traditional contact tracing efforts and has been deployed by public health authorities in more than 50 countries, states and regions to help curb the spread of infection.

In the early days of the pandemic, public health officials signalled their need for more comprehensive data to combat the virus’ rapid spread. Our Community Mobility Reports, which provide anonymized insights into movement trends, are helping researchers not only understand the impact of policies like stay-at-home directives and social distancing, and also conduct economic forecasting.

Community Mobility Reports: Navigate and download a report for regions of interest.

Our own researchers have also explored using this anonymized data to forecast COVID-19 spread using graph neural networks instead of traditional time series-based models.

Although the research community knew little about this disease and secondary effects initially, we’re learning more every day. Our COVID-19 Search Trends symptoms allows researchers to explore temporal or symptomatic associations, such as anosmia — the loss of smell that is sometimes a symptom of the virus. To further support the broader research community, we launched Google Health Studies app to provide the public ways to participate in research studies.

Our COVID-19 Search Trends are helping researchers study the link between the disease’s spread and symptom-related searches.

Teams across Google are contributing tools and resources to the broader scientific community, which is working to address the health and economic impacts of the virus.

A spatio-temporal graph for modelling COVID-19 Spread.

Accurate information is critical in dealing with public health threats. We collaborated with many product teams at Google in order to improve information quality about COVID-19 in Google News and Search through supporting fact checking efforts, as well as similar efforts in YouTube.

We helped multilingual communities get equal access to critical COVID-19 information by sponsoring localization of Nextstrain.org’s weekly Situation Reports and developing a COVID-19 open source parallel dataset in collaboration with Translators Without Borders.

Modelling a complex global event is particularly challenging and requires more comprehensive epidemiological datasets, the development of novel interpretable models and agent-based simulators to inform the public health response. Machine learning techniques have also helped in other ways from deploying natural language understanding to helping researchers quickly navigate the mountains of COVID-19 scientific literature, applying anonymization technology to protect privacy while making useful datasets available, and exploring whether public health can conduct faster screening with fewer tests via Bayesian group testing.

These are only a sample of the many pieces of work that happened across Google to help users and public health authorities respond to COVID-19. For more, see using technology to help take on COVID-19.

Research in Machine Learning for Medical Diagnostics
We continue to make headway helping clinicians harness the power of ML to deliver better care for more patients. This year we have described notable advances in applying computer vision to aid doctors in the diagnosis and management of cancer, including helping to make sure that doctors don’t miss potentially cancerous polyps during colonoscopies, and showing that an ML system can achieve substantially higher accuracy than pathologists in Gleason grading of prostate tissue, enabling radiologists to achieve significant reductions in both false negative and false positive results when examining X-rays for signs of breast cancer.

To determine the aggressiveness of prostate cancers, pathologists examine a biopsy and assign it a Gleason grade. In published research, our system was able to grade with higher accuracy than a cohort of pathologists who have not had specialist training in prostate cancer. The first stage of the deep learning system assigns a Gleason grade to every region in a biopsy. In this biopsy, green indicates Gleason pattern 3, while yellow indicates Gleason pattern 4.

We’ve also been working on systems to help identify skin disease, help detect age-related macular degeneration (the leading cause of blindness in the U.S. and U.K., and the third-largest cause of blindness worldwide), and on potential novel non-invasive diagnostics (e.g., being able to detect signs of anemia from retinal images).

Our study examines how a deep learning model can quantify hemoglobin levels — a measure doctors use to detect anemia — from retinal images.

This year has also brought exciting demonstrations of how these same technologies can peer into the human genome. Google’s open-source tool, DeepVariant, identifies genomic variants in sequencing data using a convolutional neural network, and this year won the FDA Challenge for best accuracy in 3 out of 4 categories. Using this same tool, a study led by the Dana-Farber Cancer Institute improved diagnostic yield by 14% for genetic variants that lead to prostate cancer and melanoma in a cohort of 2,367 cancer patients.

Research doesn’t end at measurement of experimental accuracy. Ultimately, truly helping patients receive better care requires understanding how ML tools will affect people in the real world. This year we began work with Mayo Clinic to develop a machine learning system to assist in radiotherapy planning and to better understand how this technology could be deployed into clinical practice. With our partners in Thailand, we’ve used diabetic eye disease screening as a test case in how we can build systems with people at the center, and recognize the fundamental role of diversity, equity, and inclusion in building tools for a healthier world.

Weather, Environment and Climate Change
Machine learning can help us better understand the environment and make useful predictions to help people in both their everyday life as well as in disaster situations. For weather and precipitation forecasting, computationally intensive physics-based models like NOAA’s HRRR have long reigned supreme. We have been able to show, though, that ML-based forecasting systems can predict current precipitation with much better spatial resolution (“Is it raining in my local park in Seattle?” and not just “Is it raining in Seattle?”) and can produce short-term forecasts of up to eight hours that are considerably more accurate than HRRR, and can compute the forecast more quickly, yet with higher temporal and spatial resolution.

A visualization of predictions made over the course of roughly one day. Left: The 1-hour HRRR prediction made at the top of each hour, the limit to how often HRRR provides predictions. Center: The ground truth, i.e., what we are trying to predict. Right: The predictions made by our model. Our predictions are every 2 minutes (displayed here every 15 minutes) at roughly 10 times the spatial resolution made by HRRR. Notice that we capture the general motion and general shape of the storm.

We’ve also developed an improved technique called HydroNets, which uses a network of neural networks to model the actual river systems in the world to more accurately understand the interactions of upstream water levels to downstream inundation, resulting in more accurate water-level predictions and flood forecasting. Using these techniques, we’ve expanded our coverage of flood alerts by 20x in India and Bangladesh, helping to better protect more than 200 million people in 250,000 square kilometers.

An illustration of the HydroNets architecture.

Better analysis of satellite imagery data can also give Google users a better understanding of the impact and extent of wildfires (which caused devastating effects in California and Australia this year). We showed that automated analysis of satellite imagery can help with rapid assessment of damage after natural disasters even with limited prior satellite imagery. It can also aid urban tree-planting efforts by helping cities assess their current tree canopy coverage and where they should focus on planting new trees. We’ve also shown how machine learning techniques that leverage temporal context can help improve ecological and wildlife monitoring.

Based on this work, we’re excited to partner with NOAA on using AI and ML to amplify NOAA’s environmental monitoring, weather forecasting and climate research using Google Cloud’s infrastructure.

Accessibility
Machine learning continues to provide amazing opportunities for improving accessibility, because it can learn to transfer one kind of sensory input into others. As one example, we released Lookout, an Android application that can help visually impaired users by identifying packaged foods, both in a grocery store and also in their kitchen cupboard at home. The machine learning system behind Lookout demonstrates that a powerful-but-compact machine learning model can accomplish this in real-time on a phone for nearly 2 million products.

Similarly, people who communicate with sign language find it difficult to use video conferencing systems because even if they are signing, they are not detected as actively speaking by audio-based speaker detection systems. Developing Real-Time, Automatic Sign Language Detection for Video Conferencing presents a real-time sign language detection model and demonstrates how it can be used to provide video conferencing systems with a mechanism to identify the person signing as the active speaker.

We also enabled useful Android accessibility capabilities such as Voice Access and Sound Notifications for important household sounds.

Live Caption was expanded to support calls on the Pixel phone with the ability to caption phone calls and video calls. This came out of the Live Relay research project, which enables deaf and hard of hearing people to make calls without assistance.

Applications of ML to Other Fields
Machine learning continues to prove vital in helping us make progress across many fields of science. In 2020, in collaboration with the FlyEM team at HHMI Janelia Research Campus, we released the drosophila hemibrain connectome, the large synapse-resolution map of brain connectivity, reconstructed using large-scale machine learning models applied to high-resolution electron microscope imaging of brain tissue. This connectome information will aid neuroscientists in a wide variety of inquiries, helping us all better understand how brains function. Be sure to check out the very fly interactive 3-D UI!

The application of ML to problems in systems biology is also on the rise. Our Google Accelerated Science team, in collaboration with our colleagues at Calico, have been applying machine learning to yeast, to get a better understanding of how genes work together as a whole system. We’ve also been exploring how to use model-based reinforcement learning in order to design biological sequences like DNA or proteins that have desirable properties for medical or industrial uses. Model-based RL is used to improve sample efficiency. At each round of experimentation the policy is trained offline using a simulator fit on functional measurements from prior rounds. On various tasks like designing DNA transcription factor binding sites, designing antimicrobial proteins, and optimizing the energy of Ising models based on protein structures, we find that model-based RL is an attractive alternative to existing methods.

In partnership with X-Chem Pharmaceuticals and ZebiAI, we have also been developing ML techniques to do “virtual screening” of promising molecular compounds computationally. Previous work in this area has tended to focus on relatively small sets of related compounds, but in this work, we are trying to use DNA-encoded small molecule libraries in order to be able to generalize to find “hits” across a wide swath of chemical space, reducing the need for slower, physical-based lab work in order to progress from idea to working pharmaceutical.

We’ve also seen success applying machine learning to core computer science and computer systems problems, a growing trend that is spawning entire new conferences like MLSys. In Learning-based Memory Allocation for C++ Server Workloads, a neural network-based language model predicts context-sensitive per-allocation site object lifetime information, and then uses this to organize the heap so as to reduce fragmentation. It is able to reduce fragmentation by up to 78% while only using huge pages (which are better for TLB behavior). End-to-End, Transferable Deep RL for Graph Optimization described an end-to-end transferable deep reinforcement learning method for computational graph optimization that shows 33%-60% speedup on three graph optimization tasks compared to TensorFlow default optimization, with 15x faster convergence over prior computation graph optimization methods.

Overview of GO: An end-to-end graph policy network that combines graph embedding and sequential attention.

As described in Chip Design with Deep Reinforcement Learning, we have also been applying reinforcement learning to the problem of place-and-route in computer chip design. This is normally a very time-consuming, labor-intensive process, and is a major reason that going from an idea for a chip to actually having a fully designed and fabricated chip takes so long. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of chip blocks, our method becomes better at rapidly generating optimized placements for previously unseen chip blocks. The system is able to generate placements that usually outperform those of human chip design experts, and we have been using this system (running on TPUs) to do placement and layout for major portions of future generations of TPUs. Menger is a recent infrastructure we’ve built for large-scale distributed reinforcement learning that is yielding promising performance for difficult RL tasks such as chip design.

Macro placements of Ariane, an open-source RISC-V processor, as training progresses. On the left, the policy is being trained from scratch, and on the right, a pre-trained policy is being fine-tuned for this chip. Each rectangle represents an individual macro placement. Notice how the cavity that is occupied by non-macro logic cells that is discovered by the from-scratch policy is already present from the outset in the pre-trained policy’s placement.

Responsible AI
The Google AI Principles guide our development of advanced technologies. We continue to invest in responsible AI research and tools, update our recommended technical practices in this area, and share regular updates — including a 2020 blog post and report — on our progress in implementation.

To help better understand the behavior of language models, we developed the Language Interpretability Tool (LIT), a toolkit for better interpretability of language models, enabling interactive exploration and analysis of their decisions. We developed techniques for measuring gendered correlations in pre-trained language models and scalable techniques for reducing gender bias in Google Translate. We used the kernel trick to propose a simple method to estimate the influence of a training data example on an individual prediction. To help non-specialists interpret machine learning results, we extended the TCAV technique introduced in 2019 to now provide a complete and sufficient set of concepts. With the original TCAV work, we were able to say that ‘fur’ and ‘long ears’ are important concepts for ‘rabbit’ prediction. With this work, we can also say that these two concepts are enough to fully explain the prediction; you don’t need any other concepts. Concept bottleneck models are a technique to make models more interpretable by training them so that one of the layers is aligned with pre-defined expert concepts (e.g., “bone spurs present”, or “wing color”, as shown below) before making a final prediction for a task, so that we can not only interpret but also turn on/off these concepts on the fly.

Aligning predictions to pre-identified concepts can make models more interpretable, as described in Concept Bottleneck Models.

In collaboration with many other institutions, we also looked into memorization effects of language models, showing that training data extraction attacks are realistic threats on state-of-the-art large language models. This finding along with a result that embedding models can leak information can have significant privacy implications (especially for models trained on private data). In Thieves of Sesame Street: Model Extraction on BERT-based APIs, we demonstrated that attackers with only API access to a language model could create models whose outputs had very high correlation with the original model, even with relatively few API queries to the original model. Subsequent work demonstrated that attackers can extract smaller models with arbitrary accuracy. On the AI Principle of safety we demonstrated that thirteen published defenses to adversarial examples can be circumvented despite attempting to perform evaluations using adaptive attacks. Our work focuses on laying out the methodology and the approach necessary to perform an adaptive attack, and thus will allow the community to make further progress in building more robust models.

Examining the way in which machine learning systems themselves are examined is also an important area of exploration. In collaboration with the Partnership on AI, we defined a framework for how to audit the use of machine learning in software product settings, drawing on lessons from the aerospace, medical devices, and finance industries and their best practices. In joint work with University of Toronto and MIT, we identified several ethical concerns that can arise when auditing the performance of facial recognition systems. In joint work with the University of Washington, we identified some important considerations related to diversity and inclusion when choosing subsets for evaluating algorithmic fairness. As an initial step in making responsible AI work for the next billion users and to help understand if notions of fairness were consistent in different parts of the world, we analyzed and created a framework for algorithmic fairness in India, accounting for datasets, fairness optimizations, infrastructures, and ecosystems

The Model Cards work that was introduced in collaboration with the University of Toronto in 2019 has been growing in influence. Indeed, many well-known models like OpenAI’s GPT-2 and GPT-3, many of Google’s MediaPipe models and various Google Cloud APIs have all adopted Model Cards as a way of giving users of a machine learning model more information about the model’s development and the observed behavior of the model under different conditions. To make this easier for others to adopt for their own machine learning models, we also introduced the Model Card Toolkit for easier model transparency reporting. In order to increase transparency in ML development practices, we demonstrate the applicability of a range of best practices throughout the dataset development lifecycle, including data requirements specification and data acceptance testing.

In collaboration with the U.S. National Science Foundation (NSF), we announced and helped to fund a National AI Research Institute for Human-AI Interaction and Collaboration. We also released the MinDiff framework, a new regularization technique available in the TF Model Remediation library for effectively and efficiently mitigating unfair biases when training ML models, along with ML-fairness gym for building simple simulations that explore potential long-run impacts of deploying machine learning-based decision systems in social environments.

In addition to developing frameworks for fairness, we developed approaches for identifying and improving the health and quality of experiences with Recommender Systems, including using reinforcement learning to introduce safer trajectories. We also continue to work on improving the reliability of our machine learning systems, where we’ve seen that approaches such as generating adversarial examples can improve robustness and that robustness approaches can improve fairness.

Differential privacy is a way to formally quantify privacy protections and requires a rethinking of the most basic algorithms to operate in a way that they do not leak information about any particular individual. In particular, differential privacy can help in addressing memorization effects and information leakage of the kinds mentioned above. In 2020 there were a number of exciting developments, from more efficient ways of computing private empirical risk minimizers to private clustering methods with tight approximation guarantees and private sketching algorithms. We also open sourced the differential privacy libraries that lie at the core of our internal tools, taking extra care to protect against leakage caused by the floating point representation of real numbers. These are the exact same tools that we use to produce differentially private COVID-19 mobility reports that have been a valuable source of anonymous data for researchers and policymakers.

To help developers assess the privacy properties of their classification models we released an ML privacy testing library in Tensorflow. We hope this library will be the starting point of a robust privacy testing suite that can be used by any machine learning developer around the world.

Membership inference attack on models for CIFAR10. The x-axis is the test accuracy of the model, and y-axis is vulnerability score (lower means more private). Vulnerability grows while test accuracy remains the same — better generalization could prevent privacy leakage.

In addition to pushing the state of the art in developing private algorithms, I am excited about the advances we made in weaving privacy into the fabric of our products. One of the best examples is Chrome’s Privacy Sandbox, which changes the underpinnings of the advertising ecosystem and helps systematically protect individuals’ privacy. As part of the project, we proposed and evaluated a number of different APIs, including federated learning of cohorts (FLoC) for interest based targeting, and aggregate APIs for differentially private measurement.

Launched in 2017, federated learning is now a complete research field unto itself, with over 3000 publications on federated learning appearing in 2020 alone. Our cross-institutional Advances and Open Problems in Federated Learning survey paper published in 2019 has been cited 367 times in the past year, and an updated version will soon be published in the Foundations & Trends in Machine Learning series. In July, we hosted a Workshop on Federated Learning and Analytics, and made all research talks and a TensorFlow Federated tutorial publicly available.

The lifecycle of an FL-trained model and the various actors in a federated learning system.

We continue to push the state of the art in federated learning, including the development of new federated optimization algorithms including adaptive learning algorithms, posterior averaging algorithms, and techniques for mimicking centralized algorithms in federated settings, substantial improvements in complimentary cryptographic protocols, and more. We announced and deployed federated analytics, enabling data science over raw data that is stored locally on users’ devices. New uses of federated learning in Google products include contextual emoji suggestions in Gboard, and pioneering privacy-preserving medical research with Google Health Studies. Furthermore, in Privacy Amplification via Random Check-Ins we presented the first privacy accounting mechanism for Federated Learning.

Security for our users is also an area of considerable interest for us. In 2020, we continued to improve protections for Gmail users, by deploying a new ML-based document scanner that provides protection against malicious documents, which increased malicious office document detection by 10% on a daily basis. Thanks to its ability to generalize, this tool has been very effective at blocking some adversarial malware campaigns that elude other detection mechanisms and increased our detection rate by 150% in some cases.

On the account protection side, we released a fully open-source security key firmware to help advance state of art in the two factor authentication space, staying focused on security keys as the best way to protect accounts against phishing.

Natural Language Understanding
Better understanding of language is an area where we saw considerable progress this year. Much of the work in this space from Google and elsewhere now relies on Transformers, a particular style of neural network model originally developed for language problems (but with a growing body of evidence that they are also useful for images, videos, speech, protein folding, and a wide variety of other domains).

One area of excitement is in dialog systems that can chat with a user about something of interest, often encompassing multiple turns of interaction. While successful work in this area to date has involved creating systems that are specialized around particular topics (e.g., Duplex) these systems cannot carry on general conversations. In pursuit of the general research goal of creating systems capable of much more open-ended dialog, in 2020 we described Meena, a learned conversational agent that aspirationally can chat about anything. Meena achieves high scores on a dialog system metric called SSA, which measures both sensibility and specificity of responses. We’ve seen that as we scale up the model size of Meena, it is able to achieve lower perplexity and, as shown in the paper, lower perplexity correlates extremely closely with improved SSA.

A chat between Meena (left) and a person (right).

One well-known issue with generative language models and dialog systems is that when discussing factual data, the model’s capacity may not be large enough to remember every specific detail about a topic, so they generate language that is plausible but incorrect. (This is not unique to machines — people can commit these errors too.) To address this in dialog systems, we are exploring ways to augment a conversational agent by giving it access to external information sources (e.g., a large corpus of documents or a search engine API), and developing learning techniques to use this as an additional resource in order to generate language that is consistent with the retrieved text. Work in this area includes integrating retrieval into language representation models (and a key underlying technology for this to work well is something like ScaNN, an efficient vector similarity search, to efficiently match the desired information to information in the corpus of text). Once appropriate content is found, it can be better understood with approaches like using neural networks to find answers in tables and extracting structured data from templatic documents. Our work on PEGASUS, a state-of-the-art model for abstractive text summarization can also help to create automatic summaries from any piece of text, a general technique useful in conversations, retrieval systems, and many other places.

Efficiency of NLP models has also been a significant focus for our work in 2020. Techniques like transfer learning and multi-task learning can dramatically help with making general NLP models usable for new tasks with modest amounts of computation. Work in this vein includes transfer learning explorations in T5, sparse activation of models (as in our GShard work mentioned below), and more efficient model pre-training with ELECTRA. Several threads of work also look to improve on the basic Transformer architecture, including Reformer, which uses locality-sensitive hashing and reversible computation to more efficiently support much larger attention windows, Performers, which use an approach for attention that scales linearly rather than quadratically (and discusses its use in the context of protein modeling), and ETC and BigBird, which utilize global and sparse random connections, to enable linear scaling for larger and structured sequences. We also explored techniques for creating very lightweight NLP models that are 100x smaller than a larger BERT model, but perform nearly as well for some tasks, making them very suitable for on-device NLP. In Encode, Tag and Realize, we also explored new approaches for generative text models that use edit operations rather than fully general text generation, which can have advantages in computation requirements for generation, more control over the generated text, and require less training data.

Language Translation
Effective language translation helps bring the world closer together by enabling us to all communicate, despite speaking different languages. To date, over a billion people around the world use Google Translate, and last year we added support for five new languages (Kinyarwanda, Odia, Tatar, Turkmen and Uyghur, collectively spoken by 75 million people). Translation quality continues to improve, showing an average +5 BLEU point gain across more than 100 languages from May 2019 to May 2020, through a wide variety of techniques like improved model architectures and training, better handling of noise in datasets, multilingual transfer and multi-task learning, and better use of monolingual data to improve low-resource languages (those without much written public content on the web), directly in line with our goals of improving ML fairness of machine learning systems to provide benefits to the greatest number of people possible.

We strongly believe that continued scaling of multilingual translation models will bring further quality improvements, especially to the billions of speakers of low-resource languages around the world. In GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Google researchers showed that training sparsely-activated multilingual translation models of up to 600 billion parameters leads to major improvements in translation quality for 100 languages as measured by BLEU score improvement over a baseline of a separate 400M parameter monolingual baseline model for each language. Three trends stood out in this work, illustrated by Figure 6 in the paper, reproduced below (see the paper for complete discussion):

  • The BLEU score improvements from multilingual training are high for all languages but are even higher for low-resource languages (right hand side of graph is higher than the left) whose speakers represent billions of people in some of the world’s most marginalized communities. Each rectangle on the figure represents languages with 1B speakers.
  • The larger and deeper the model, the larger the BLEU score improvements were across all languages (the lines hardly ever cross).
  • Large, sparse models also show a ~10x to 100x improvement in computational efficiency for model training over training a large, dense model, while simultaneously matching or significantly exceeding the BLEU scores of the large, dense model (computational efficiency discussed in paper).
An illustration of the significant gains in translation quality across 100 languages for large, sparsely-activated language models described in GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.

We’re actively working on bringing the benefits demonstrated in this GShard research work to Google Translate, as well as training single models that cover 1000 languages, including languages like Dhivehi and Sudanese Arabic (while sharing some challenges that needed solving along the way).

We also developed techniques to create language-agnostic representations of sentences for BERT models, which can help with developing better translation models. To more effectively evaluate translation quality, we introduced BLEURT, a new metric for evaluating language generation for tasks like translation that considers the semantics of the generated text, rather than just the amount of word overlap with ground-truth data, illustrated in the table below.

Machine Learning Algorithms
We continue to develop new machine learning algorithms and approaches for training that enable systems to learn more quickly and from less supervised data. By replaying intermediate results during training of neural networks, we find that we can fill idle time on ML accelerators and therefore can train neural networks faster. By changing the connectivity of neurons dynamically during training, we can find better solutions compared with statically-connected neural networks. We also developed SimCLR, a new self-supervised and semi-supervised learning technique that simultaneously maximizes agreement between differently transformed views of the same image and minimizes agreement between transformed views of different images. This approach significantly improves on the best self-supervised learning techniques.

ImageNet top-1 accuracy of linear classifiers trained on representations learned with different self-supervised methods (pretrained on ImageNet). Gray cross indicates supervised ResNet-50.

We also extended the idea of contrastive learning to the supervised regime, resulting in a loss function that significantly improves over cross-entropy for supervised classification problems.

Reinforcement Learning
Reinforcement learning (RL), which learns to make good long-term decisions from limited experience, has been an important focus area for us. An important challenge in RL is to learn to make decisions from few data points, and we’ve improved RL algorithm efficiency through learning from fixed datasets, learning from other agents, and improving exploration.

A major focus area this year has been around offline RL, which relies solely on fixed, previously collected datasets (for example, from previous experiments or human demonstrations), extending RL to the applications that can’t collect training data on-the-fly. We’ve introduced a duality approach to RL, developed improved algorithms for off-policy evaluation, estimating confidence intervals, and offline policy optimization. In addition, we’re collaborating with the broader community to tackle these problems by releasing open-source benchmark datasets, and DQN dataset for Atari.

Offline RL on Atari games using the DQN Replay Dataset.

Another line of research improved sample efficiency by learning from other agents through apprenticeship learning. We developed methods to learn from informed agents, matching other agent’s distribution, or learning from adversarial examples. To improve the exploration in RL, we explored bonus-based exploration methods including imitation techniques able to mimic structured exploration arising in agents having prior knowledge about their environment.

We’ve also made significant advances in the mathematical theory of reinforcement learning. One of our main areas of research was studying reinforcement learning as an optimization process. We found connections to the Frank-Wolfe algorithm, momentum methods, KL divergence regularization, operator theory, and convergence analysis; some of these insights led to an algorithm that achieves state-of-the-art performance in challenging RL benchmarks and discovery that polynomial transfer functions avoid convergence problems associated with softmax, both in RL and supervised learning. We’ve made some exciting progress on the topic of safe reinforcement learning, where one seeks to discover optimal control rules while respecting important experimental constraints. This includes a framework for safe policy optimization. We studied efficient RL-based algorithms for solving a class of problems known as mean field games, which model systems with a large number of decision-makers, from mobile networks to electric grids.

We’ve made breakthroughs toward generalization to new tasks and environments, an important challenge for scaling up RL to complex real-world problems. A 2020 focus area was population-based learning-to-learn methods, where another RL or evolutionary agent trained a population of RL agents to create a curriculum of emergent complexity, and discover new state-of-the-art RL algorithms. Learning to estimate the importance of data points in the training set and parts of visual input with selective attention resulted in significantly more skillful RL agents.

Overview of our method and illustration of data processing flow in AttentionAgent. Top: Input transformation — A sliding window segments an input image into smaller patches, and then “flattens” them for future processing. Middle: Patch election — The modified self-attention module holds votes between patches to generate a patch importance vector. Bottom: Action generation — AttentionAgent picks the patches of the highest importance, extracts corresponding features and makes decisions based on them.

Further, we made progress in model-based RL by showing that learning predictive behavior models accelerates RL learning, and enables decentralized cooperative multi-agent tasks in diverse teams, and learning long-term behavior models. Observing that skills bring predictable changes in the environment, we discover skills without supervision. Better representations stabilize RL learning, while hierarchical latent spaces and value-improvement paths yield better performance.

We shared open source tools for scaling up and productionizing RL. To expand the scope and problems tackled by users, we’ve introduced SEED, a massively parallel RL agent, released a library for measuring the RL algorithm reliability, and a new version of TF-Agents that includes distributed RL, TPU support, and a full set of bandit algorithms. In addition, we performed a large empirical study of RL algorithms to improve hyperparameter selection and algorithm design.

Finally, in collaboration with Loon, we trained and deployed RL to more efficiently control stratospheric balloons, improving both power usage and their ability to navigate.

AutoML
Using learning algorithms to develop new machine learning techniques and solutions, or meta-learning, is a very active and exciting area of research. In much of our previous work in this area, we’ve created search spaces that look at how to find ways to combine sophisticated hand-designed components together in interesting ways. In AutoML-Zero: Evolving Code that Learns, we took a different approach, by giving an evolutionary algorithm a search space consisting of very primitive operations (like addition, subtraction, variable assignment, and matrix multiplication) in order to see if it was possible to evolve modern ML algorithms from scratch. The presence of useful learning algorithms in this space is incredibly sparse, so it is remarkable that the system was able to progressively evolve more and more sophisticated ML algorithms. As shown in the figure below, the system reinvents many of the most important ML discoveries over the past 30 years, such as linear models, gradient descent, rectified linear units, effective learning rate settings and weight initializations, and gradient normalization.

We also used meta-learning to discover a variety of new efficient architectures for object detection in both still images and videos. Last year’s work on EfficientNet for efficient image classification architectures showed significant accuracy improvements and computational cost reductions for image classification. In follow-on work this year, EfficientDet: Towards Scalable and Efficient Object Detection builds on top of the EfficientNet work to derive new efficient architectures for object detection and localization, showing remarkable improvements in both highest absolute accuracy, as well as computational cost reductions of 13-42x over previous approaches to achieve a given level of accuracy.

EfficientDet achieves state-of-the-art 52.2 mAP, up 1.5 points from the prior state of the art (not shown since it is at 3045B FLOPs) on COCO test-dev under the same setting. Under the same accuracy constraint, EfficientDet models are 4x-9x smaller and use 13x-42x less computation than previous detectors.

Our work on SpineNet describes a meta-learned architecture that can retain spatial information more effectively, allowing detection to be done at finer resolution. We also focused on learning effective architectures for a variety of video classification problems. AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures, AssembleNet++: Assembling Modality Representations via Attention Connections, and AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification demonstrate how to use evolutionary algorithms to create novel state-of-the-art video processing machine learning architectures.

This approach can also be used to develop effective model architectures for time series forecasting. Using AutoML for Time Series Forecasting describes the system that discovers new forecasting models via an automated search over a search space involving many interesting kinds of low-level building blocks, and its effectiveness was demonstrated in the Kaggle M5 Forecasting Competition, by generating an algorithm and system that placed 138th out of 5558 participants (top 2.5%). While many of the competitive forecasting models required months of manual effort to create, our AutoML solution found the model in a short time with only a moderate compute cost (500 CPUs for 2 hours) and no human intervention.

Better Understanding of ML Algorithms and Models
Deeper understanding of machine learning algorithms and models is crucial for designing and training more effective models, as well as understanding when models may fail. Last year, we focused on fundamental questions around representation power, optimization, model generalization, and label noise, among others. As mentioned earlier in this post, Transformer networks have had a huge impact on modeling language, speech and vision problems, but what is the class of functions represented by these models? Recently we showed that transformers are universal approximators for sequence-to-sequence functions. Furthermore, sparse transformers also remain universal approximators even when they use just a linear number of interactions among the tokens. We have been developing new optimization techniques based on layerwise adaptive learning rates to improve the convergence speed of transformers, e.g., Large batch optimization for deep learning (LAMB): Training BERT in 76 minutes.

As neural networks are made wider and deeper, they often train faster and generalize better. This is a core mystery in deep learning since classical learning theory suggests that large networks should overfit more. We are working to understand neural networks in this overparameterized regime. In the limit of infinite width, neural networks take on a surprisingly simple form, and are described by a Neural Network Gaussian Process (NNGP) or Neural Tangent Kernel (NTK). We studied this phenomenon theoretically and experimentally, and released Neural Tangents, an open-source software library written in JAX that allows researchers to build and train infinite-width neural networks.

Left: A schematic showing how deep neural networks induce simple input / output maps as they become infinitely wide. Right: As the width of a neural network increases, we see that the distribution of outputs over different random instantiations of the network becomes Gaussian.

As finite width networks are made larger, they also demonstrate peculiar double descent phenomena — where they generalize better, then worse, then better again with increasing width. We have shown that this phenomenon can be explained by a novel bias-variance decomposition, and further that it can sometimes manifest as triple descent.

Lastly, in real-world problems, one often needs to deal with significant label noise. For instance, in large scale learning scenarios, weakly labeled data is available in abundance with large label noise. We have developed new techniques for distilling effective supervision from severe label noise leading to state-of-the-art results. We have further analyzed the effects of training neural networks with random labels, and shown that it leads to alignment between network parameters and input data, enabling faster downstream training than initializing from scratch. We have also explored questions such as whether label smoothing or gradient clipping can mitigate label noise, leading to new insights for developing robust training techniques with noisy labels.

Algorithmic Foundations and Theory
2020 was a productive year for our work in algorithmic foundations and theory, with several impactful research publications and notable results. On the optimization front, our paper on edge-weighted online bipartite matching develops a new technique for online competitive algorithms and solves a thirty-year old open problem for the edge-weighted variant with applications in efficient online ad allocation. Along with this work in online allocation, we developed dual mirror descent techniques that generalize to a variety of models with additional diversity and fairness constraints, and published a sequence of papers on the topic of online optimization with ML advice in online scheduling, online learning and online linear optimization. Another research result gave the first improvement in 50 years on the classic bipartite matching in dense graphs. Finally, another paper solves a long-standing open problem about chasing convex bodies online — using an algorithm from The Book, no less.

We also continued our work in scalable graph mining and graph-based learning and hosted the Graph Mining & Learning at Scale Workshop at NeurIPS’20, which covered work on scalable graph algorithms including graph clustering, graph embedding, causal inference, and graph neural networks. As part of the workshop, we showed how to solve several fundamental graph problems faster, both in theory and practice, by augmenting standard synchronous computation frameworks like MapReduce with a distributed hash-table similar to a BigTable. Our extensive empirical study validates the practical relevance of the AMPC model inspired by our use of distributed hash tables in massive parallel algorithms for hierarchical clustering and connected components, and our theoretical results show how to solve many of these problems in constant distributed rounds, greatly improving upon our previous results. We also achieved exponential speedup for computing PageRank and random walks. On the graph-based learning side, we presented Grale, our framework for designing graphs for use in machine learning. Furthermore, we presented our work on more scalable graph neural network models, where we show that PageRank can be used to greatly accelerate inference in GNNs.

In market algorithms, an area at the intersection of computer science and economics, we continued our research in designing improved online marketplaces, such as measuring incentive properties of ad auctions, two-sided markets, and optimizing order statistics in ad selection. In the area of repeated auctions, we developed frameworks to make dynamic mechanisms robust against lack of forecasting or estimation errors of the current market and/or the future market, leading to provably tight low-regret dynamic mechanisms. Later, we characterized when it is possible to achieve the asymptotically optimal objective through geometry-based criteria. We also compared the equilibrium outcome of a range of budget management strategies used in practice, showed their impact on the tradeoff between revenue and buyers’ utility and shed light on their incentive properties. Additionally, we continued our research in learning optimal auction parameters, and settled the complexity of batch-learning with revenue loss. We designed the optimal regret and studied combinatorial optimization for contextual auction pricing, and developed a new active learning framework for auctions and improved the approximation for posted-price auctions. Finally, motivated by the importance of incentives in ad auctions, and in the hope to help advertisers study the impact of incentives in auctions, we introduce a data-driven metric to quantify how much a mechanism deviates from incentive compatibility.

Machine Perception
Perceiving the world around us — understanding, modeling and acting on visual, auditory and multimodal input — continues to be a research area with tremendous potential to be beneficial in our everyday lives.

In 2020, deep learning powered new approaches that bring 3D computer vision and computer graphics closer together. CvxNet, deep implicit functions for 3D shapes, neural voxel rendering and CoReNet are a few examples of this direction. Furthermore, our research on representing scenes as neural radiance fields (aka NeRF, see also this blog post) is a good example of how Google Research’s academic collaborations stimulate rapid progress in the area of neural volume rendering.

In Learning to Factorize and Relight a City, a collaboration with UC Berkeley, we proposed a learning-based framework for disentangling outdoor scenes into temporally-varying illumination and permanent scene factors. This gives the ability to change lighting effects and scene geometry for any Street View panorama, or even turn it into a full-day timelapse video.

Our work on generative human shape and articulated pose models introduces a statistical, articulated 3D human shape modeling pipeline, within a fully trainable, modular, deep learning framework. Such models enable 3D human pose and shape reconstruction of people from a single photo to better understand the scene.

Overview of end-to-end statistical 3D articulated human shape model construction in GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models.

The growing area of media compression using neural networks continued to make strong progress in 2020, not only on learned image compression, but also in deep approaches to video compression, volume compression and nice results in deep distortion-agnostic image watermarking.

Samples of encoded and cover images for Distortion Agnostic Deep Watermarking. First row: Cover image with no embedded message. Second row: Encoded image from HiDDeN combined distortion model. Third row: Encoded images from our model. Fourth row: Normalized difference of the encoded image and cover image for the HiDDeN combined model. Fifth row: Normalized difference for our model

Additional important themes in perceptual research included:

Engaging with the broader research community through open sourcing of solutions and datasets is another important aspect of furthering perceptual research. In 2020, we open sourced multiple new perceptual inference capabilities and solutions in MediaPipe, such as on-device face, hand and pose prediction, real-time body pose tracking, real-time iris tracking and depth estimation, and real-time 3D object detection.

We continued to make strides to improve experiences and promote helpfulness on mobile devices through ML-based solutions. Our ability to run sophisticated natural language processing on-device, enabling more natural conversational features, continues to improve. In 2020, we expanded Call Screen and launched Hold for Me to allow users to save time when performing mundane tasks, and we also launched language-based actions and language navigability of our Recorder app to aid productivity.

We have used Google’s Duplex technology to make calls to businesses and confirm things like temporary closures. This has enabled us to make 3 million updates to business information globally, that have been seen over 20 billion times on Maps and Search. We also used text to speech technology for easier access to web pages, by enabling Google Assistant to read it aloud, supporting 42 languages.

We also continued to make meaningful improvements to imaging applications. We made it easier to capture precious moments on Pixel with innovative controls and new ways to relight, edit, enhance and relive them again in Google Photos. For the Pixel camera, beginning with Pixel 4 and 4a, we added Live HDR+, which uses machine learning to approximate the vibrance and balanced exposure and appearance of HDR+ burst photography in real time in the viewfinder. We also created dual exposure controls, which allow the brightness of shadows and highlights in a scene to be adjusted independently — live in the viewfinder.

More recently, we introduced Portrait Light, a new post-capture feature for the Pixel Camera and Google Photos apps that adds a simulated directional light source to portraits. This feature is again one that is powered by machine learning, having been trained on 70 different people, photographed one light at a time, in our pretty cool 331-LED Light Stage computational illumination system.

In the past year, Google researchers were excited to contribute to many new (and timely) ways of using Google products. Here are a few examples

Robotics
In the area of robotics research, we’ve made tremendous progress in our ability to learn more and more complex, safe and robust robot behaviors with less and less data, using many of the RL techniques described earlier in the post.

Transporter Networks are a novel approach to learning how to represent robotic tasks as spatial displacements. Representing relations between objects and the robot end-effectors, as opposed to absolute positions in the environment, makes learning robust transformations of the workspace very efficient.

In Grounding Language in Play, we demonstrated how a robot can be taught to follow natural language instructions (in many languages!). This required a scalable approach to collecting paired data of natural language instructions and robot behaviors. One key insight is that this can be accomplished by asking robot operators to simply play with the robot, and label after-the-fact what instructions would have led to the robot accomplishing the same task.

We also explored doing away with robots altogether (by having humans use a camera-equipped grasping stick) for even more scalable data collection, and how to efficiently transfer visual representations across robotic tasks.

We investigated how to learn very agile strategies for robot locomotion, by taking inspiration from nature, using evolutionary meta-learning strategies, human demonstrations, and various approaches to training data-efficient controllers using deep reinforcement learning.

One increased emphasis this year has been on safety: how do we deploy safe delivery drones in the real world? How do we explore the world in a way that always allows the robot to recover from its mistakes? How do we certify the stability of learned behaviors? This is a critical area of research on which we expect to see increased focus in the future.

Quantum Computing
Our Quantum AI team continued its work to establish practical uses of quantum computing. We ran experimental algorithms on our Sycamore processors to simulate systems relevant to chemistry and physics. These simulations are approaching a scale at which they can not be performed on classical computers anymore, making good on Feynman’s original idea of using quantum computers as an efficient means to simulate systems in which quantum effects are important. We published new quantum algorithms, for instance to perform precise processor calibration, to show an advantage for quantum machine learning or to test quantum enhanced optimization. We also worked on programming models to make it easier to express quantum algorithms. We released qsim, an efficient simulation tool to develop and test quantum algorithms with up to 40 qubits on Google Cloud.

We continued to follow our roadmap towards building a universal error-corrected quantum computer. Our next milestone is the demonstration that quantum error correction can work in practice. To achieve this, we will show that a larger grid of qubits can hold logical information exponentially longer than a smaller grid, even though individual components such as qubits, couplers or I/O devices have imperfections. We are also particularly excited that we now have our own cleanroom which should significantly increase the speed and quality of our processor fabrication.

Supporting the Broader Developer and Researcher Community
This year marked TensorFlow’s 5th birthday, passing 160M downloads. The TensorFlow community continued its impressive growth with new special interest groups, TensorFlow User Groups, TensorFlow Certificates, AI Service partners, and inspiring demos #TFCommunitySpotlight. We significantly improved TF 2.x with seamless TPU support, out of the box performance (and best-in-class performance on MLPerf 0.7), data preprocessing, distribution strategy, and a new NumPy API.

We also added many more capabilities to the TensorFlow Ecosystem to help developers and researchers in their workflows: Sounds of India demonstrated going from research to production in under 90 days, using TFX for training and TF.js for deployment in the browser. With Mesh TensorFlow, we pushed the boundaries of model parallelism to provide ultra-high image resolution image analysis. We open-sourced the new TF runtime, TF Profiler for model performance debugging, and tools for Responsible AI, such as the Model Card Toolkit for model transparency and a privacy testing library. With TensorBoard.dev we made it possible to easily host, track, and share your ML experiments for free.

In addition, we redoubled our investment in JAX, an open-source, research-focused ML system that has been actively developed over the past two years. Researchers at Google and beyond are now using JAX in a wide range of fields, including differential privacy, neural rendering, physics-informed networks, fast attention, molecular dynamics, tensor networks, neural tangent kernels, and neural ODEs. JAX accelerates research at DeepMind, powering a growing ecosystem of libraries and work on GANs, meta-gradients, reinforcement learning, and more. We also used JAX and the Flax neural network library to build record-setting MLPerf benchmark submissions, which we demonstrated live at NeurIPS on a large TPU Pod slice with a next-generation Cloud TPU user experience (slides, video, sign-up form). Finally, we’re ensuring that JAX works seamlessly with TF ecosystem tooling, from TF.data for data preprocessing and TensorBoard for experiment visualization to the TF Profiler for performance debugging, with more to come in 2021.

Many recent research breakthroughs have been enabled by increased computing power, and we make more than 500 petaflops of Cloud TPU computing power available for free to researchers around the world via the TFRC program to help broaden access to the machine learning research frontier. More than 120 TFRC-supported papers have been published to date, many of which would not have been possible without the computing resources that the program provides. For example, TFRC researchers have recently developed simulations of wildfire spread, helped analyze COVID-19 content and vaccine sentiment changes on social media networks, and advanced our collective understanding of the lottery ticket hypothesis and neural network pruning. Members of the TFRC community have also published experiments with Persian poetry, won a Kaggle contest on fine-grained fashion image segmentation, and shared tutorials and open-source tools as starting points for others. In 2021, we will change the name of the TFRC program to the TPU Research Cloud program to be more inclusive now that Cloud TPUs support JAX and PyTorch in addition to TensorFlow.

Finally, this was a huge year for Colab. Usage doubled, and we launched productivity features to help people do their work more efficiently, including improved Drive integration and access to the Colab VM via the terminal. And we launched Colab Pro to enable users to access faster GPUs, longer runtimes and more memory.

Open Datasets and Dataset Search
Open datasets with clear and measurable goals are often very helpful in driving forward the field of machine learning. To help the research community find interesting datasets, we continue to index a wide variety of open datasets sourced from many different organizations with Google Dataset Search. We also think it’s important to create new datasets for the community to explore and to develop new techniques, while ensuring that we share open data responsibly. This year, in addition to open datasets to help address the COVID crisis, we released a number of open datasets across many different areas:

Research Community Interaction
We are proud to enthusiastically support and participate in the broader research community. In 2020, Google researchers presented over 500 papers at leading research conferences, additionally serving on program committees, organizing workshops, tutorials and numerous other activities aimed at collectively progressing the state of the art in the field. To learn more about our contributions to some of the larger research conferences this year, please see our blog posts for ICLR 2020, CVPR 2020, ACL 2020, ICML 2020, ECCV 2020 and NeurIPS 2020.

In 2020 we supported external research with $37M in funding, including $8.5M in COVID research, $8M in research inclusion and equity, and $2M in responsible AI research. In February, we announced the 2019 Google Faculty Research Award Recipients, funding research proposals from 150 faculty members throughout the world. Among this group, 27% self-identified as members of historically underrepresented groups within technology. We also announced a new Research Scholar Program to support early-career professors who are pursuing research in fields relevant to Google via unrestricted gifts. As we have for more than a decade, we selected a group of incredibly talented PhD student researchers to receive Google PhD Fellowships, which provides funding for graduate studies, as well as mentorship as they pursue their research, and opportunities to interact with other Google PhD Fellows.

We are also expanding the ways that we support inclusion and bring new voices into the field of computer science. In 2020, we created a new Award for Inclusion Research program that supports academic research in computing and technology addressing the needs of underrepresented populations. In the inaugural set of awards, we selected 16 proposals for funding with 25 principal investigators, focused on topics around diversity and inclusion, algorithmic bias, education innovation, health tools, accessibility, gender bias, AI for social good, security, and social justice. We additionally partnered with the Computing Alliance of Hispanic-Serving Institutions (CAHSI) and the CMD-IT Diversifying Future Leadership in the Professoriate Alliance (FLIP) to create an award program for doctoral students from traditionally underrepresented backgrounds to support the last year of the completion of the dissertation requirements.

In 2019, Google’s CS Research Mentorship Program (CSRMP) helped provide mentoring to 37 undergraduate students to introduce them to conducting computer science research. Based on the success of the program in 2019/2020, we’re excited to greatly expand this program in 2020/2021 and will have hundreds of Google researchers mentoring hundreds of undergraduate students in order to encourage more people from underrepresented backgrounds to pursue computer science research careers. Finally, in October we provided exploreCSR awards to 50 institutions around the world for the 2020 academic year. These awards fund faculty to host workshops for undergraduates from underrepresented groups in order to encourage them to pursue CS research.

Looking Forward to 2021 and Beyond
I’m excited about what’s to come, from our technical work on next-generation AI models, to the very human work of growing our community of researchers.

We’ll keep ensuring our research is done responsibly and has a positive impact, using our AI Principles as a guiding framework and applying particular scrutiny to topics that can have broad societal impact. This post covers just a few of the many papers on responsible AI that Google published in the past year. While pursuing our research, we’ll focus on:

  • Promoting research integrity: We’ll make sure Google keeps conducting a wide range of research in an appropriate manner, and provides comprehensive, scientific views on a variety of challenging, interesting topics.
  • Responsible AI development: Tackling tough topics will remain core to our work, and Google will continue creating new ML algorithms to make machine learning more efficient and accessible, developing approaches to combat unfair bias in language models, devising new techniques for ensuring privacy in learning systems, and much more. And importantly, beyond looking at AI development with a suitably critical eye, we’re eager to see what techniques we and others in the community can develop to mitigate risks and make sure new technologies have equitable, positive impacts on society.
  • Advancing diversity, equity, and inclusion: We care deeply that the people who are building influential products and computing systems better reflect the people using these products all around the world. Our efforts here are both within Google Research, as well as within the wider research and academic communities — we’ll be calling upon the academic and industry partners we work with to advance these efforts together. On a personal level, I am deeply committed to improving representation in computer science, having spent hundreds of hours working towards these goals over the last few years, as well as supporting universities like Berkeley, CMU, Cornell, Georgia Tech, Howard, UW, and numerous other organizations that work to advance inclusiveness. This is important to me, to Google, and to the broader computer science community.

Finally, looking ahead to the year, I’m particularly enthusiastic about the possibilities of building more general-purpose machine learning models that can handle a variety of modalities and that can automatically learn to accomplish new tasks with very few training examples. Advances in this area will empower people with dramatically more capable products, bringing better translation, speech recognition, language understanding and creative tools to billions of people all around the world. This kind of exploration and impact is what keeps us excited about our work!

Acknowledgements
Thanks to Martin Abadi, Marc Bellemare, Elie Bursztein, Zhifeng Chen, Ed Chi, Charina Chou, Katherine Chou, Eli Collins, Greg Corrado, Corinna Cortes, Tiffany Deng, Tulsee Doshi, Robin Dua, Kemal El Moujahid, Aleksandra Faust, Orhan Firat, Jen Gennai, Till Hennig, Ben Hutchinson, Alex Ingerman, Tomáš Ižo, Matthew Johnson, Been Kim, Sanjiv Kumar, Yul Kwon, Steve Langdon, James Laudon, Quoc Le, Yossi Matias, Brendan McMahan, Aranyak Mehta, Vahab Mirrokni, Meg Mitchell, Hartmut Neven, Mohammad Norouzi, Timothy Novikoff, Michael Piatek, Florence Poirel, David Salesin, Nithya Sambasivan, Navin Sarma, Tom Small, Jascha Sohl-Dickstein, Zak Stone, Rahul Sukthankar, Mukund Sundararajan, Andreas Terzis, Sergei Vassilvitskii, Vincent Vanhoucke, and Leslie Yeh and others for helpful feedback and for drafting portions of this post, and to the entire Research and Health communities at Google for everyone’s contributions towards this work.

Categories
Offsites

Long-Term Stock Forecasting

Categories
Offsites

2020년 회고

올해는 늦지않게 회고를 해보려고 합니다. 갑자기 마스크가 생활의 일부가 되면서 자연스럽게 재택근무를 하게 되었고.. 프로젝트의 변화도 있었고, 올해는 여러모로 다양한 변화들이 있었던 한해였네요.

Work

2019년 회고에서 회고를 늦게 작성하면서.. 20년 상반기에 대한 내용들도 좀 포함을 했던 것 같습니다. 그래서 일에 대해서는 하반기를 위주로 이야기해보려고 합니다.

AiMD 프로젝트

이번에 새롭게 AiMD 라는 프로젝트를 맡게 되어서, 일을 진행하게 되었습니다. 먼저 MD 란 아래 모든 것들을 고려해서 ‘상품 기획’을 하는 사람을 의미합니다. 이 프로젝트는 이 MD가 하는 일들을 자동화하는 것 입니다.

image.png

MD 가 하는 일들

프로젝트를 맡은지 얼마되지 않았지만, 마침 기획되고 있는 서비스가 있어서 빠르게 서비스까지 나가볼 수 있었습니다. 아래와 같이 MR. 트랜드 매거진 서비스에 연관 상품 추천으로 주어진 상품의 트랜드 매거진에 맞춰서 상품을 추천해주는 것입니다.

image.png

MR. 트랜드 매거진의 연관상품 추천

이번에 그 동안은 해보지 못했던 추천 그리고 E-Commerce 라는 도메인을 처음 접하게 되면서 느꼈던 점은, 여기는 ‘특히 엔지니어링이 중요한 파트구나’ 였습니다. 기존에 NLP 라는 분야에서는 대부분의 입력값이 텍스트이기 때문에 이 텍스트에만 집중하면 되었지만, 추천에서는 입력이 굉장히 다양해집니다. 정리된 데이터셋이 아닌 로깅 데이터를 기반으로 다양한 행동 데이터 (클릭, 구매 등) 그리고 텍스트 뿐만 아니라 이미지까지 일반적으로 고려가 됩니다. 또한 로그를 다루기 때문에, DB에도 익숙해야 합니다.

범위가 넓은 만큼, 데이터를 다루는 능력이 무엇보다 중요합니다. 요즘에 모델은 Transformer 로 통일되어 가고 있고, 수 많은 라이브러리와 관련 코드들이 있습니다. 그래서 모델 보다는 이 데이터가 의미하는 바가 무엇인지, 모델에 데이터를 어떻게 밀어넣어서 무엇을 배우게 할 것인지, 또한 가설을 기반으로 빠르게 실험을 돌리는 등이 필요한 능력들이라 생각됩니다. 물론, 배워야할 것도 많고 아직은 부족한 점들이 매우 많지만요..!

여기에 하나 더 해서 ‘상품의 수’가 문제를 굉장히 어렵게 만듭니다. 기본 만단위에서 억단위 혹은 조단위까지 대용량의 데이터를 다루는 일이 기본이기 때문이죠. E-Commerce를 하는 회사의 개발팀들을 보면, 데이터 플랫폼 팀들이 존재하는 이유이기도 합니다.

원격근무

올해는 코로나로 인해서, 많은 기간을 재택근무를 하게 된 해이기도 합니다. 재택근무에 대한 경험이 많지 않은 상황에서 상황에 의하여, 연습 없이 진행을 하다보니 여러가지 문제점이 있기도 하였습니다. 특히 초반에는 회사로 출근하는 시간도 없어지고, 집이라는 공간이 휴식과 일이 공존하는 공간이 되다보니, 일하는 시간에 대한 조절이 어려울 때가 많았습니다. 식사시간까지 완전히 개인적으로 조절하게 되어서 더 그런 것 같네요. 자연스럽게 의지에 따라서 일을 한 번에 이어서 할 수 있는 시간들이 늘어나게 되었고, 저녁 늦게까지도 쭉 집중하면서 일을 한 적도 많았던 것 같습니다.

문제는 이 생활이 초반에는 지속되면서 늦게까지 일하고 아침에 늦게 일어나는 싸이클이 생겼던 것이죠. 그 당시에는 집중해서 효율이 좋다고 생각하였는데, 시간이 지나서 다시 뒤를 돌아보았을 때는 늦은 시간에 작업한 것들은 보통 다음날 다시 작업을 하거나 버그가 있던 적이 간혹 있었던 것 같습니다. 때로는 잠이 부족해서 그 날 하루 집중이 잘 되지 않는 날도 있었습니다.

그래서 재택근무가 장기화 되면서 일의 시작과 끝을 명확히 하고자 노력하였습니다. 의식적으로 신경을 쓰니까, 회사로 출퇴근하던 때보다 더 규칙적으로 생활을 하게 되었네요. 규칙적인 생활을 하다보니.. 그저 느낌일 수는 있지만, 전보다 일하는 시간을 더 효율적으로 사용하고 있다고 느껴졌습니다.
물론 정성적인 느낌이 강하고, 실제 결과물로 비교하는 것은 아니라서 부정확한 면이 있기는 합니다만.. 이번 기회에 정해진 시간 안에 퀄리티 있게 일을 마치는 훈련을 해보려고 합니다.

일을 한다는 것은 계속해서 달리는 마라톤에 가깝습니다. 더 멀리 가기 위해서는 페이스 조절을 하면서 앞으로 나아가는 것이 가장 좋은 전략입니다. 언택트시대가 끝나서 다시 출퇴근을 하게 된다고 해도 이런 규칙적인 리듬을 유지해야겠다는 생각이 드네요!

무엇보다도 원격근무로 인해서 가장 큰 영향을 받은 것은 ‘커뮤니케이션’ 입니다. 소통방식이 실시간 싱크에서 단발성으로 바뀌기도 하고, 미팅을 할 때는 회의실에 모였을 때 기본적으로 제공되던 비언어적인 부분들도 화상회의의 화면을 키지 않는 경우에는 확인을 할 수 없게 됩니다. 그래서 그런지.. 가끔은 미팅 후에 서로 잘못 이해하고 있는 부분들이 생기기도 했던 것 같습니다. 이렇게 소통방식이 바뀌면서, 많은 부분들이 이슈로 관리되는 등 커뮤니케이션이 주로 소리가 아닌 글자로 진행되게 됩니다. 글만으로 필요한 정보만 명료하게 담으면서, 오해 없이 전달하는 것은 매우 어려운 일임을 이번에 많이 느끼고 있습니다.

일을 하면서 소통과는 별개로, 잡담에 대해서도 느낀 점이 많았습니다. 어쩌다 일이 생겨서 회사에 한번씩 나가게 되는 경우가 있었는데, 이때 동료분들과 커피타임 하면서 ‘아.. 이 시간들이 참 소중한 시간이였구나’ 를 새삼 느끼곤 했습니다. 원격근무에서는 일을 하다가 따로 수다타임을 가지기에는 역치가 너무나도 높아진 상황이기 때문이죠. 그런 만큼 Slack의 잡담을 위해 만들어진 #random 채널과 같은, 단순히 ‘놀기 위한’ 소통의 중요성을 더 강하게 느끼고 있습니다. 사람들의 커뮤니케이션 행태만 보아도 그 팀의 결속력이 보이고, 이 결속력은 그 팀이 만들어낼 결과물에 큰 영향을 끼치는 요인이기 떄문입니다.

패턴은 금세 드러났다. 가장 성공적인 프로젝트는 ‘고수준 소통가 집단’ 이 독점했다. 그들의 케미와 화합은 래리 페이지와 제프 딘이 보여준 상호작용과 유사했다. … 적장 구성원들의 화합을 좌우한 요인은 다른 데 있었다. 바로 구성원들의 책상 간격이었다. … “가까운 거리에서 시선을 맞추고 서로의 흔적을 공유하는 단순한 행위들이 생각 이상으로 매우 중요합니다. 일하는 도중에 다른 사람의 물건이나 공간을 보는 것만으로도 상대방의 존재감을 떠올리게 되며, 이는 엄청난 효과를 불러옵니다.”

  • 책상 간격과 성과의 상관관계 중에서, ≪최고의 팀은 무엇이 다른가≫

Quantified Self

Task

image.png

그림 1 – 2020년의 월간 작업 시간들에 대한 누적막대차트

처음 Work 파트에서 이야기한 것처럼, 하반기에는 새롭게 AiMD 프로젝트를 진행하면서 다시 연구/개발에 시간을 쏟고 있습니다. 10월에는 MR. 서비스 릴리즈 한다고 야근하던 때가 보이기도 하네요. 대부분의 일을 개발에 쓰기는 해도 관리/미팅이 항상 반정도는 포함되고 있습니다. 사람이 부족한 상황에서 모델링에 대한 업무도 해야하고, 관리측면에서도 신경을 계속 써야했기 때문에.. 두 가지를 동시에 진행하고 있었기 때문입니다. 프로젝트가 작을 때는 어느정도 가능한 일이나, 사람이 늘어나고 규모가 커지면 또 다시 일의 양상이 달라지지 않을까 싶네요.

주어진 일이 많은 것과는 별개로 한번에 집중해서 일할 수 있는 시간들이 많이 확보되었기 때문에, 결과물들을 만들어 낼 수 있지 않았나 싶습니다.

올해는 집에서 많은 시간을 보내게 되면서, 미뤄왔던 좋은 습관 만들기에 조금 더 집중했던 해이기도 합니다.

image.png

그림 2 – 제2사분면에 해당하는 일들(Blog, Book, Exercise, MOOC, Review, Seminar)의 시간 누적막대도표

20년 8월부터는 Blog도 조금씩 시간을 차지하면서, 습관으로 조금씩 자리를 잡고 있고 책을 보는 시간들(Book)과 명상을 해주면서 Review에서도 시간이 많이 늘어나고 있습니다. 코로나가 끝나서 다시 출퇴근을 하게 되더라도 더 좋은 방향으로 습관들을 만들어가고 유지해가고 싶네요!

Book

요즘은 회사에서 월 15 만원의 컨텐츠용 복지를 지원해주고 있어서, 대부분 책을 구매하는데 쓰고 있습니다. 그러다보니.. 책을 사는 속도가 읽는 속도를 훨씬 상회하고 있네요. 미리 사놓은 책들이 20권은 넘게 쌓여있는 상황입니다.. 꾸준히 계속 읽어나가야겠습니다.

image.png

그림 3 – 2020년에 읽은 책들에 대한 시간 막대도표

올해에는 총 23권의 책들을 읽었고, 매니징과 팀에 관한 책들도 많이 있고 시간을 역행해서 듣고 있는 이동진의 빨간책방의 책들이 다수 있기도 합니다. 책을 읽어나간 시간들을 보면, 책이 가지는 페이지 수라던가, 내용의 어려움이 보이기도 합니다. 가장 많은 시간을 투자한 ≪불확실한 상황에서의 판단≫은 페이지도 많고 내용이 어렵던 케이스이기는 합니다. 이제는 고전이 된 많은 논문들을 모아놓은 책인데, 어렵지만 흥미롭게 읽을 수 있었습니다.

image.png

출처: 알라딘 https://www.aladin.co.kr/shop/wproduct.aspx?ItemId=45834844

개인적으로 ‘올해의 책’을 꼽으라고 하면, 애드 캣멀의 ≪창의성을 지휘하라≫를 뽑을 것입니다. 픽사라는 회사가 가지는 정체성을 지키기 위해, 창의성을 더 극대화하기 위한 수 많은 질문들과 실제 경험들이 정말 깊게 녹아들어있는 책이라고 생각합니다. 모든 상황들은 전부 각각 다르고 새롭기 때문에, 그대로 적용할 수 있는 방식이 있는 것은 아니지만 문제상황이 가지는 근본은 크게 다르지 않다고 생각합니다. 이 책에서는 조직이 겪는 문제 현상보다는 근본원인을 파악하려는 노력을 볼 수가 있고, 이에 대한 구체적인 행동까지 확인할 수 있습니다. 특히 픽사라는 회사가 가지게 되는 아이덴티티를 이해하고 그것을 지키기 위한 다양한 노력들, 지속가능한 문화를 이어가기 위한 프로세스들이 인상깊었습니다.

궁극적으로 ≪토이 스토리2≫는 다음과 같은 교훈을 픽사에 남겼다. 우리는 언제나 변화하는 역학관계에 주의를 기울여야 한다. 픽사의 미래는 이에 달려 있다. 당초 비디오 대여용 애니메이션으로 기획된 이 프로젝트는 픽사 직원들이 B급 작품에 안주하는 것을 용인하지 못한다는 점뿐 아니라 픽사가 제작하는 모든 작품이 훌륭하다는 사실을 입증했다. … 이 프로젝트는 모든 직원이 픽사의 최대 자산(작품의 품질)에 기여하는 주인이라는 점을 각인시켰다.

  • Chapter 4 픽사의 정체성 구축 중에서

이 책에서는 애니메이션을 만드는 픽사라는 회사를 기반으로 이야기하고 있지만, 제품을 만드는 모든 회사에 적용할 수 있다고 생각합니다. 책에서도 그렇게 말하는 내용이 있었던 것 같네요. 이 책에 대해서는 다음에 따로 글을 작성해보고 싶습니다.

Blog

블로그를 제대로 작성하자고 마음 먹은 것은 조금 더 내 생각들이 많이 담긴 글을 쓰고 싶었기 때문입니다. 예전에 재미있게 들었던 지대넓얕이라는 팟캐스트의 마지막화에서 했던 말이 있습니다. (제대로 찾지 못해서 인터뷰의 말로 대체합니다.)

사실 ‘잔을 채운다’는 이야기는 니체의 말이에요. 니체가 잔이 가득 찼으면 그걸 몰락시키는 시간이 필요하다고 했거든요. 세상으로 나가서 자신의 이야기를 하면서 잔을 비우는 거죠. 그리고 자기 내면 안에서 무언가를 쌓아 올리는 걸 ‘잔이 채워지는 시간’이라고 하는데 말씀하신 대로 책이나 영화, 연극을 보는 것도 좋은 방법이지만, 본질은 ‘고독한 시간’이에요. 자신과 대면하는 시간이요. – 채사장

그 동안 회사에서 일을 하고 스스로 사이드 프로젝트를 진행해보기도 하고, 여러가지 공부들을 하면서 나도 모르게 이 ‘잔’ 이라는 것이 채워진 것 같았습니다. 그래서 이제는 조금씩 나 자신의 이야기를 해보고 싶다는 생각을 하게 되었습니다. 단순하게 내용을 정리보다는 해당 주제를 기반으로 저 스스로의 생각들을 많이 담겨 있는 글들을 작성하기 위해서 노력해보고 있습니다.

올해 하반기부터 총 6개의 글을 작성할 수 있었습니다. 블로깅을 본격적으로 다시 시작하면서, 작성하는 포스트의 성격을 명확하게 하고 싶었습니다. 바로 ‘스스로 관심을 가지는 내용들에 대해서 조금 더 깊은 내용들을 적는 것’ 이였습니다. 그래서 자연스럽게 ‘Quantified Self’ 시리즈, 가장 최근에는 ‘Code Reading’ 시리즈의 글을 작성하게 되었네요.

확실히 이렇게 글을 작성해보니까, 스스로 공부가 많이 되는 것을 느끼곤 합니다. 가장 시간을 많이 쓴 포스트의 경우는 ‘Quantified Self Part 6 – 생산적인 하루에 대한 정량적인 표현과 4년간의 데이터 이야기’ 입니다. 약 15시간이 걸렸습니다..! 글을 작성하면서 새롭게 데이터 분석 뱡항이 생각나서, 새로 시각화를 그려보기도 하면서 어떤 시각화 도표가 더 이해하기가 쉬울까? 라는 생각을 하면서 글을 작성했던 것 같습니다. 사실 이렇게 고민을 하면서 작성했지만, 복잡하게 보이는 부분도 있고 가장 어려웠던 데이터 분석 부분은 지금도 어렵다고 생각이 되네요. 하지만 이렇게 글로 작성하지 않았다면, 훨씬 더 정리되지 않았을 것이라고 생각이 됩니다. 글을 쓰는 과정은 자연스럽게 논리를 가지고 흐름을 정돈하는 과정을 수반합니다.

아래는 방문자 수 별로 정리한 올해 작성한 글들입니다.

  1. Quantified Self Part 6 – 생산적인 하루에 대한 정량적인 표현과 4년간의 데이터 이야기
  2. 클린 아키텍처: 아름다운 코드에서 아키텍처까지
  3. 아주 늦은 2019년 회고
  4. Product Manager를 위한 교과서, 인스파이어드
  5. Quantified Self Part 5 – 데이터 시각화와 대쉬보드 with KPI
  6. CodeReading – 1. PyTorch

위 글들 중에서 유일하게 홍보를 했던 글이자, 작성에 가장 시간을 많이 쓴 ‘Quantified Self Part 6 – 생산적인 하루에 대한 정량적인 표현과 4년간의 데이터 이야기’ 가 역시 가장 페이지뷰가 높았습니다. 제가 스스로 페이스북의 생활코딩에 글을 공유하면서 사람들이 많이 유입된 경우입니다. 그 다음이 ‘클린 아키텍처: 아름다운 코드에서 아키텍처까지’ 입니다. 아무래도 널리 알려진 책을 리뷰하는 것이라서, 따로 홍보 없이 사람들이 유입되는 것을 볼 수 있었습니다.

확실히 생소한 주제에 대해서는 ‘유입’ 자체가 어렵다는 생각이 많이 들기도 하네요. 우선은 중요한 것이 꾸준하게 글을 쓰면서, 글을 쓰는 능력을 키우면서 누군가에게 계속해서 읽히는 블로그 자체가 되는 것이 중요하지 않을까 싶습니다.

여담이지만 이렇게 구글 애널리틱스로 지표를 보고 있으면.. 어떻게 글을 쓰고 홍보를 해야 CTR, CVR을 올릴 수 있을까라는 생각을 자연스럽게 하게 됩니다. (직업병이 아닌가 싶습니다..)

끝으로

올해는 모두가 똑같이 코로나로 인해서 많은 상실과 어려움들이 있었을 것 같습니다. 저 역시 그랬기 때문에.. 그래도 이 시간들을 통해서 조금 더 나 자신에 대해서 집중할 수 있는 시간들을 가질 수 있지 않았나 싶네요. 21년에는 의미도 있고, 재미도 있는 일들을 많이 하고 싶고, 발표들도 해보고 싶네요!

Categories
Offsites

The medical test paradox: Can redesigning Bayes rule help?

Categories
Offsites

End-to-End, Transferable Deep RL for Graph Optimization

An increasing number of applications are driven by large and complex neural networks trained on diverse sets of accelerators. This process is facilitated by ML compilers that map high-level computational graphs to low-level, device-specific executables. In doing so, ML compilers need to solve many optimization problems, including graph rewriting, assignment of operations on devices, operation fusion, layout and tiling of tensors, and scheduling. For example, in a device placement problem, the compiler needs to determine the mapping between operations in the computational graph to the target physical devices so that an objective function, such as training step time, can be minimized. The placement performance is determined by a mixture of intricate factors, including inter-device network bandwidth, peak device memory, co-location constraints, etc., making it challenging for heuristics or search-based algorithms, which typically settle for fast, but sub-optimal, solutions. Furthermore, heuristics are hard to develop and maintain, especially as newer model architectures emerge.

Recent attempts at using learning-based approaches have demonstrated promising results, but they have a number of limitations that make them infeasible to be deployed in practice. Firstly, these approaches do not easily generalize to unseen graphs, especially those arising from newer model architectures, and second, they have poor sample efficiency, leading to high resource consumption during training. Finally, they are only able to solve a single optimization task, and consequently, do not capture the dependencies across the tightly coupled optimization problems in the compilation stack.

In “Transferable Graph Optimizers for ML Compilers”, recently published as an oral paper at NeurIPS 2020, we propose an end-to-end, transferable deep reinforcement learning method for computational graph optimization (GO) that overcomes all of the above limitations. We demonstrate 33%-60% speedup on three graph optimization tasks compared to TensorFlow default optimization. On a diverse set of representative graphs consisting of up to 80,000 nodes, including Inception-v3, Transformer-XL, and WaveNet, GO achieves an average 21% improvement over expert optimization and an 18% improvement over the prior state of the art with 15x faster convergence.

Graph Optimization Problems in ML Compilers
There are three coupled optimization tasks that frequently arise in ML compilers, which we formulate as decision problems that can be solved using a learned policy. The decision problems for each of the tasks can be reframed as making a decision for each node in the computational graph.

The first optimization task is device placement, where the goal is to determine how best to assign the nodes of the graph to the physical devices on which it runs such that the end-to-end run time is minimized.

The second optimization task is operation scheduling. An operation in a computational graph is ready to run when its incoming tensors are present in the device memory. A frequently used scheduling strategy is to maintain a ready queue of operations for each device and schedule operations in first-in-first-out order. However, this scheduling strategy does not take into account the downstream operations placed on other devices that might be blocked by an operation, and often leads to schedules with underutilized devices. To find schedules that can keep track of such cross-device dependencies, our approach uses a priority-based scheduling algorithm that schedules operations in the ready queue based on the priority of each. Similar to device placement, operation scheduling can then be formulated as the problem of learning a policy that assigns a priority for each node in the graph to maximize a reward based on run time.

The third optimization task is operation fusion. For brevity we omit a detailed discussion of this problem here, and instead just note that similar to priority-based scheduling, operation fusion can also use a priority-based algorithm to decide which nodes to fuse. The goal of the policy network in this case is again to assign a priority for each node in the graph.

Finally, it is important to recognize that the decisions taken in each of the three optimization problems can affect the optimal decision for the other problems. For example, placing two nodes on two different devices effectively disables fusion and introduces a communication delay that can influence scheduling.

RL Policy Network Architecture
Our research presents GO, a deep RL framework that can be adapted to solve each of the aforementioned optimization problems — both individually as well as jointly. There are three key aspects of the proposed architecture:

First, we use graph neural networks (specifically GraphSAGE) to capture the topological information encoded in the computational graph. The inductive network of GraphSAGE leverages node attribute information to generalize to previously unseen graphs, which enables decision making for unseen data without incurring significant cost on training.

Second, computational graphs for many models often contain more than 10k nodes. Solving the optimization problems effectively over such large scales requires that the network is able to capture long-range dependencies between nodes. GO’s architecture includes a scalable attention network that uses segment-level recurrence to capture such long-range node dependencies.

Third, ML compilers need to solve optimization problems over a wide variety of graphs from different application domains. A naive strategy of training a shared policy network with heterogeneous graphs is unlikely to capture the idiosyncrasies of a particular class of graphs. To overcome this, GO uses a feature modulation mechanism that allows the network to specialize for specific graph types without increasing the number of parameters.

Overview of GO: An end-to-end graph policy network that combines graph embedding and sequential attention.

To jointly solve multiple dependent optimization tasks, GO has the ability to add additional recurrent attention layers for each task with parameters shared across different tasks. The recurrent attention layers with residual connections of actions enables tracking inter-task dependencies.

Multi-task policy network that extends GO’s policy network with additional recurrent attention layers for each task and residual connections. GE: Graph Embedding, FC: Fully-Connected Layer, Nxf: fusion action dimension, Fxd: placement action dimension, Nxs: scheduling action dimension.

Results
Next, we present evaluation results on a single-task speedup on a device placement task based on real-hardware measurements, generalization to unseen graphs with different GO variants, and multi-task performance jointly optimizing operations fusion, device placement, and scheduling.

Speedup:
To evaluate the performance of this architecture, we apply GO to a device placement problem based on real-hardware evaluation, where we first train the model separately on each of our workloads. This approach, called GO-one, consistently outperforms expert manual placement (HP), TensorFlow METIS placement, and Hierarchical Device Placement (HDP) — the current state-of-the-art reinforcement learning-based device placement. Importantly, with the efficient end-to-end single-shot placement, GO-one has a 15x speedup in convergence time of the placement network over HDP.

Our empirical results show that GO-one consistently outperforms expert placement, TensorFlow METIS placement, and hierarchical device placement (HDP). Because GO is designed in a way to scale up to extremely large graphs consisting of over 80,000 nodes like an 8-layer Google Neural Machine Translation (GNMT) model, it outperforms previous approaches, including HDP, REGAL, and Placeto. GO achieves optimized graph runtimes for large graphs like GNMT that are 21.7% and 36.5% faster than HP and HDP, respectively. Overall, GO-one achieves on average 20.5% and 18.2% run time reduction across a diverse set of 14 graphs, compared to HP and HDP respectively. Importantly, with the efficient end-to-end single-shot placement, GO-one has a 15x speedup in convergence time of the placement network over HDP.

Generalization:
GO generalizes to unseen graphs using offline pre-training followed by fine-tuning on the unseen graphs. During pre-training, we train GO on heterogeneous subsets of graphs from the training set. We train GO for 1000 steps on each such batch of graphs before switching to the next. This pretrained model is then fine-tuned (GO-generalization+finetune) on hold-out graphs for fewer than 50 steps, which typically takes less than one minute. GO-generalization+finetune for hold-out graphs outperforms both expert placement and HDP consistently on all datasets, and on average matches GO-one.

We also run inference directly on just the pre-trained model without any fine-tuning for the target hold-out graphs, and name this GO-generalization-zeroshot. The performance of this untuned model is only marginally worse than GO-generalization+finetune, while being slightly better than expert placement and HDP. This indicates that both graph embedding and the learned policies transfer efficiently, allowing the model to generalize to the unseen data.

Generalization across heterogeneous workload graphs. The figure shows a comparison of two different generalization strategies for GO when trained with graphs from 5 (except the held-out one) of the 6 workloads (Inception-v3, AmoebaNet, recurrent neural network language model (RNNLM), Google Neural Machine Translation (GNMT), Transformer-XL (TRFXL), WaveNet), and evaluated on the held-out workload (x-axis).

Co-optimizing placement, scheduling, and fusion (pl+sch+fu):
Optimizing simultaneously for placement, scheduling and fusion provides 30%-73% speedup compared to the single-gpu unoptimized case and 33%-60% speedup compared to TensorFlow default placement, scheduling, and fusion. Comparing to optimizing each tasks individually, multi-task GO (pl+sch+fu) outperforms single-task GO (p | sch | fu) — optimizing all tasks, one at a time — by an average of 7.8%. Furthermore, for all workloads, co-optimizing all three tasks offers faster run time than optimizing any two of them and using the default policy for the third.

Run time for various workloads on multi-task optimizations. TF-default: TF GPU default placement, fusion, and scheduling. hp-only: human placement only with default scheduling and fusion. pl-only: GO placement only with default scheduling and fusion. pl | sch: GO optimizes placement and scheduling individually with default fusion. pl+sch: multi-task GO co-optimizes placement and scheduling with default fusion. sch+fu: multi-task GO co-optimizes scheduling and fusion with human placement. pl | sch | fu: GO optimizes placement, scheduling, and fusion separately. pl+sch+fu: multi-task GO co-optimizes placement, scheduling, and fusion.

Conclusion
The increasing complexity and diversity of hardware accelerators has made the development of robust and adaptable ML frameworks onerous and time-consuming, often requiring multiple years of effort from hundreds of engineers. In this article, we demonstrated that many of the optimization problems in such frameworks can be solved efficiently and optimally using a carefully designed learned approach.

Acknowledgements
This is joint work with Daniel Wong, Amirali Abdolrashidi, Peter Ma, Qiumin Xu, Hanxiao Liu, Mangpo Phitchaya Phothilimtha, Shen Wang, Anna Goldie, Azalia Mirhoseini, and James Laudon.

Categories
Offsites

Convolutional LSTM for spatial forecasting

This post is the first in a loose series exploring forecasting
of spatially-determined data over time. By spatially-determined I
mean that whatever the quantities we’re trying to predict – be
they univariate or multivariate time series, of spatial
dimensionality or not – the input data are given on a spatial
grid.

For example, the input could be atmospheric measurements, such
as sea surface temperature or pressure, given at some set of
latitudes and longitudes. The target to be predicted could then
span that same (or another) grid. Alternatively, it could be a
univariate time series, like a meteorological index.

But wait a second, you may be thinking. For time-series
prediction, we have that time-honored set of recurrent
architectures (e.g., LSTM, GRU), right? Right. We do; but, once we
feed spatial data to an RNN, treating different locations as
different input features, we lose an essential structural
relationship. Importantly, we need to operate in both space and
time. We want both: recurrence relations and convolutional filters.
Enter convolutional RNNs.

What to expect from this post

Today, we won’t jump into real-world applications just yet.
Instead, we’ll take our time to build a convolutional LSTM
(henceforth: convLSTM) in torch. For one, we have to – there is
no official PyTorch implementation.

Keras, on the other hand, has one. If you’re interested in
quickly playing around with a Keras convLSTM, check out this
nice example
.

What’s more, this post can serve as an introduction to
building your own modules. This is something you may be familiar
with from Keras or not – depending on whether you’ve used
custom models or rather, preferred the declarative define ->
compile -> fit style. (Yes, I’m implying there’s some
transfer going on if one comes to torch from Keras custom training.
Syntactic and semantic details may be different, but both share the
object-oriented style that allows for great flexibility and
control.)

Last but not least, we’ll also use this as a hands-on
experience with RNN architectures (the LSTM, specifically). While
the general concept of recurrence may be easy to grasp, it is not
necessarily self-evident how those architectures should, or could,
be coded. Personally, I find that independent of the framework
used, RNN-related documentation leaves me confused. What exactly is
being returned from calling an LSTM, or a GRU? (In Keras this
depends on how you’ve defined the layer in question.) I suspect
that once we’ve decided what we want to return, the actual code
won’t be that complicated. Consequently, we’ll take a detour
clarifying what it is that torch and Keras are giving us.
Implementing our convLSTM will be a lot more straightforward
thereafter.

A torch convLSTM

The code discussed here may be found on GitHub. (Depending on
when you’re reading this, the code in that repository may have
evolved though.)

My starting point was one of the PyTorch implementations found
on the net, namely,
this one
. If you search for “PyTorch convGRU” or “PyTorch
convLSTM”, you will find stunning discrepancies in how these are
realized – discrepancies not just in syntax and/or engineering
ambition, but on the semantic level, right at the center of what
the architectures may be expected to do. As they say, let the buyer
beware. (Regarding the implementation I ended up porting, I am
confident that while numerous optimizations will be possible, the
basic mechanism matches my expectations.)

What do I expect? Let’s approach this task in a top-down
way.

Input and output

The convLSTM’s input will be a time series of spatial data,
each observation being of size (time steps, channels, height,
width).

Compare this with the usual RNN input format, be it in torch or
Keras. In both frameworks, RNNs expect tensors of size (timesteps,
input_dim)1. input_dim is
(1) for univariate time series and greater than (1) for
multivariate ones. Conceptually, we may match this to convLSTM’s
channels dimension: There could be a single channel, for
temperature, say – or there could be several, such as for
pressure, temperature, and humidity. The two additional dimensions
found in convLSTM, height and width, are spatial indexes into the
data.

In sum, we want to be able to pass data that:

  • consist of one or more features,

  • evolve in time, and

  • are indexed in two spatial dimensions.

How about the output? We want to be able to return forecasts for
as many time steps as we have in the input sequence. This is
something that torch RNNs do by default, while Keras equivalents do
not. (You have to pass return_sequences = TRUE to obtain that
effect.) If we’re interested in predictions for just a single
point in time, we can always pick the last time step in the output
tensor.

However, with RNNs, it is not all about outputs. RNN
architectures also carry through hidden states.

What are hidden states? I carefully phrased that sentence to be
as general as possible – deliberately circling around the
confusion that, in my view, often arises at this point. We’ll
attempt to clear up some of that confusion in a second, but let’s
first finish our high-level requirements specification.

We want our convLSTM to be usable in different contexts and
applications. Various architectures exist that make use of hidden
states, most prominently perhaps, encoder-decoder architectures.
Thus, we want our convLSTM to return those as well. Again, this is
something a torch LSTM does by default, while in Keras it is
achieved using return_state = TRUE.

Now though, it really is time for that interlude. We’ll sort
out the ways things are called by both torch and Keras, and inspect
what you get back from their respective GRUs and LSTMs.

Interlude: Outputs, states, hidden values … what’s what?

For this to remain an interlude, I summarize findings on a high
level. The code snippets in the appendix show how to arrive at
these results. Heavily commented, they probe return values from
both Keras and torch GRUs and LSTMs. Running these will make the
upcoming summaries seem a lot less abstract.

First, let’s look at the ways you create an LSTM in both
frameworks. (I will generally use LSTM as the “prototypical RNN
example”, and just mention GRUs when there are differences
significant in the context in question.)

In Keras, to create an LSTM you may write something like
this:

lstm <- layer_lstm(units = 1)

The torch equivalent would be:

lstm <- nn_lstm( input_size = 2, # number of input features hidden_size = 1 # number of hidden (and output!) features )

Don’t focus on torch‘s input_size parameter for this
discussion. (It’s the number of features in the input tensor.)
The parallel occurs between Keras’ units and torch’s
hidden_size. If you’ve been using Keras, you’re probably
thinking of units as the thing that determines output size
(equivalently, the number of features in the output). So when torch
lets us arrive at the same result using hidden_size, what does that
mean? It means that somehow we’re specifying the same thing,
using different terminology. And it does make sense, since at every
time step current input and previous hidden state are added2:

[ mathbf{h}_t = mathbf{W}_{x}mathbf{x}_t +
mathbf{W}_{h}mathbf{h}_{t-1} ]

Now, about those hidden states.

When a Keras LSTM is defined with return_state = TRUE, its
return value is a structure of three entities called output, memory
state, and carry state. In torch, the same entities are referred to
as output, hidden state, and cell state. (In torch, we always get
all of them.)

So are we dealing with three different types of entities? We are
not.

The cell, or carry state is that special thing that sets apart
LSTMs from GRUs deemed responsible for the “long” in “long
short-term memory”. Technically, it could be reported to the user
at all points in time; as we’ll see shortly though, it is
not.

What about outputs and hidden, or memory states? Confusingly,
these really are the same thing. Recall that for each input item in
the input sequence, we’re combining it with the previous state,
resulting in a new state, to be made used of in the next
step3:

[ mathbf{h}_t = mathbf{W}_{x}mathbf{x}_t +
mathbf{W}_{h}mathbf{h}_{t-1} ]

Now, say that we’re interested in looking at just the final
time step – that is, the default output of a Keras LSTM. From
that point of view, we can consider those intermediate computations
as “hidden”. Seen like that, output and hidden states feel
different.

However, we can also request to see the outputs for every time
step. If we do so, there is no difference – the
outputs (plural) equal the hidden states. This can
be verified using the code in the appendix.

Thus, of the three things returned by an LSTM, two are really
the same. How about the GRU, then? As there is no “cell state”,
we really have just one type of thing left over – call it outputs
or hidden states.

Let’s summarize this in a table.

Table 1: RNN terminology. Comparing torch-speak and
Keras-speak. In row 1, the terms are parameter names. In rows 2 and
3, they are pulled from current documentation.

Referring to this entity: torch says: Keras says:

Number of features in the output

This determines both how many output features there are and the
dimensionality of the hidden states.

hidden_size units

Per-time-step output; latent state; intermediate state …

This could be named “public state” in the sense that we, the
users, are able to obtain all values.

hidden state memory state

Cell state; inner state … (LSTM only)

This could be named “private state” in that we are able to
obtain a value only for the last time step. More on that in a
second.

cell state carry state

Now, about that public vs.private distinction. In both
frameworks, we can obtain outputs (hidden states) for every time
step. The cell state, however, we can access only for the very last
time step. This is purely an implementation decision. As we’ll
see when building our own recurrent module, there are no obstacles
inherent in keeping track of cell states and passing them back to
the user.

If you dislike the pragmatism of this distinction, you can
always go with the math. When a new cell state has been computed
(based on prior cell state, input, forget, and cell gates – the
specifics of which we are not going to get into here), it is
transformed to the hidden (a.k.a. output) state making use of yet
another, namely, the output gate:

[ h_t = o_t odot tanh(c_t) ]

Definitely, then, hidden state (output, resp.) builds on cell
state, adding additional modeling power.

Now it is time to get back to our original goal and build that
convLSTM. First though, let’s summarize the return values
obtainable from torch and Keras.

Table 2: Contrasting ways of obtaining various return values
in torch vs. Keras. Cf. the appendix for complete examples.

To achieve this goal: in torch do: in Keras do:
access all intermediate outputs ( = per-time-step outputs) ret[[1]] return_sequences = TRUE
access both “hidden state” (output) and “cell state”
from final time step (only!)
ret[[2]] return_state = TRUE
access all intermediate outputs and the final “cell
state”
both of the above return_sequences = TRUE, return_state = TRUE
access all intermediate outputs and “cell states” from all
time steps
no way no way

convLSTM, the plan

In both torch and Keras RNN architectures, single time steps are
processed by corresponding Cell classes: There is an LSTM Cell
matching the LSTM, a GRU Cell matching the GRU, and so on. We do
the same for ConvLSTM. In convlstm_cell(), we first define what
should happen to a single observation; then in convlstm(), we build
up the recurrence logic.

Once we’re done, we create a dummy dataset, as
reduced-to-the-essentials as can be. With more complex datasets,
even artificial ones, chances are that if we don’t see any
training progress, there are hundreds of possible explanations. We
want a sanity check that, if failed, leaves no excuses. Realistic
applications are left to future posts.

A single step: convlstm_cell

Our convlstm_cell’s constructor takes arguments input_dim ,
hidden_dim, and bias, just like a torch LSTM Cell.

But we’re processing two-dimensional input data. Instead of
the usual affine combination of new input and previous state, we
use a convolution of kernel size kernel_size. Inside convlstm_cell,
it is self$conv that takes care of this.

Note how the channels dimension, which in the original input
data would correspond to different variables, is creatively used to
consolidate four convolutions into one: Each channel output will be
passed to just one of the four cell gates. Once in possession of
the convolution output, forward() applies the gate logic, resulting
in the two types of states it needs to send back to the caller.

library(torch) library(zeallot) convlstm_cell <- nn_module( initialize = function(input_dim, hidden_dim, kernel_size, bias) { self$hidden_dim <- hidden_dim padding <- kernel_size %/% 2 self$conv <- nn_conv2d( in_channels = input_dim + self$hidden_dim, # for each of input, forget, output, and cell gates out_channels = 4 * self$hidden_dim, kernel_size = kernel_size, padding = padding, bias = bias ) }, forward = function(x, prev_states) { c(h_prev, c_prev) %<-% prev_states combined <- torch_cat(list(x, h_prev), dim = 2) # concatenate along channel axis combined_conv <- self$conv(combined) c(cc_i, cc_f, cc_o, cc_g) %<-% torch_split(combined_conv, self$hidden_dim, dim = 2) # input, forget, output, and cell gates (corresponding to torch's LSTM) i <- torch_sigmoid(cc_i) f <- torch_sigmoid(cc_f) o <- torch_sigmoid(cc_o) g <- torch_tanh(cc_g) # cell state c_next <- f * c_prev + i * g # hidden state h_next <- o * torch_tanh(c_next) list(h_next, c_next) }, init_hidden = function(batch_size, height, width) { list( torch_zeros(batch_size, self$hidden_dim, height, width, device = self$conv$weight$device), torch_zeros(batch_size, self$hidden_dim, height, width, device = self$conv$weight$device)) } )

Now convlstm_cell has to be called for every time step. This is
done by convlstm.

Iteration over time steps: convlstm

A convlstm may consist of several layers, just like a torch
LSTM. For each layer, we are able to specify hidden and kernel
sizes individually.

During initialization, each layer gets its own convlstm_cell. On
call, convlstm executes two loops. The outer one iterates over
layers. At the end of each iteration, we store the final pair
(hidden state, cell state) for later reporting. The inner loop runs
over input sequences, calling convlstm_cell at each time step.

We also keep track of intermediate outputs, so we’ll be able
to return the complete list of hidden_states seen during the
process. Unlike a torch LSTM, we do this for every layer.

convlstm <- nn_module( # hidden_dims and kernel_sizes are vectors, with one element for each layer in n_layers initialize = function(input_dim, hidden_dims, kernel_sizes, n_layers, bias = TRUE) { self$n_layers <- n_layers self$cell_list <- nn_module_list() for (i in 1:n_layers) { cur_input_dim <- if (i == 1) input_dim else hidden_dims[i - 1] self$cell_list$append(convlstm_cell(cur_input_dim, hidden_dims[i], kernel_sizes[i], bias)) } }, # we always assume batch-first forward = function(x) { c(batch_size, seq_len, num_channels, height, width) %<-% x$size() # initialize hidden states init_hidden <- vector(mode = "list", length = self$n_layers) for (i in 1:self$n_layers) { init_hidden[[i]] <- self$cell_list[[i]]$init_hidden(batch_size, height, width) } # list containing the outputs, of length seq_len, for each layer # this is the same as h, at each step in the sequence layer_output_list <- vector(mode = "list", length = self$n_layers) # list containing the last states (h, c) for each layer layer_state_list <- vector(mode = "list", length = self$n_layers) cur_layer_input <- x hidden_states <- init_hidden # loop over layers for (i in 1:self$n_layers) { # every layer's hidden state starts from 0 (non-stateful) c(h, c) %<-% hidden_states[[i]] # outputs, of length seq_len, for this layer # equivalently, list of h states for each time step output_sequence <- vector(mode = "list", length = seq_len) # loop over time steps for (t in 1:seq_len) { c(h, c) %<-% self$cell_list[[i]](cur_layer_input[ , t, , , ], list(h, c)) # keep track of output (h) for every time step # h has dim (batch_size, hidden_size, height, width) output_sequence[[t]] <- h } # stack hs for all time steps over seq_len dimension # stacked_outputs has dim (batch_size, seq_len, hidden_size, height, width) # same as input to forward (x) stacked_outputs <- torch_stack(output_sequence, dim = 2) # pass the list of outputs (hs) to next layer cur_layer_input <- stacked_outputs # keep track of list of outputs or this layer layer_output_list[[i]] <- stacked_outputs # keep track of last state for this layer layer_state_list[[i]] <- list(h, c) } list(layer_output_list, layer_state_list) } )

Calling the convlstm

Let’s see the input format expected by convlstm, and how to
access its different outputs.

Here is a suitable input tensor.

# batch_size, seq_len, channels, height, width x <- torch_rand(c(2, 4, 3, 16, 16))

First we make use of a single layer.

model <- convlstm(input_dim = 3, hidden_dims = 5, kernel_sizes = 3, n_layers = 1) c(layer_outputs, layer_last_states) %<-% model(x)

We get back a list of length two, which we immediately split up
into the two types of output returned: intermediate outputs from
all layers, and final states (of both types) for the last
layer.

With just a single layer, layer_outputs[[1]]holds all of the
layer’s intermediate outputs, stacked on dimension two.

dim(layer_outputs[[1]]) # [1] 2 4 5 16 16

layer_last_states[[1]]is a list of tensors, the first of which
holds the single layer’s final hidden state, and the second, its
final cell state.

dim(layer_last_states[[1]][[1]]) # [1] 2 5 16 16 dim(layer_last_states[[1]][[2]]) # [1] 2 5 16 16

For comparison, this is how return values look for a multi-layer
architecture.

model <- convlstm(input_dim = 3, hidden_dims = c(5, 5, 1), kernel_sizes = rep(3, 3), n_layers = 3) c(layer_outputs, layer_last_states) %<-% model(x) # for each layer, tensor of size (batch_size, seq_len, hidden_size, height, width) dim(layer_outputs[[1]]) # 2 4 5 16 16 dim(layer_outputs[[3]]) # 2 4 1 16 16 # list of 2 tensors for each layer str(layer_last_states) # List of 3 # $ :List of 2 # ..$ :Float [1:2, 1:5, 1:16, 1:16] # ..$ :Float [1:2, 1:5, 1:16, 1:16] # $ :List of 2 # ..$ :Float [1:2, 1:5, 1:16, 1:16] # ..$ :Float [1:2, 1:5, 1:16, 1:16] # $ :List of 2 # ..$ :Float [1:2, 1:1, 1:16, 1:16] # ..$ :Float [1:2, 1:1, 1:16, 1:16] # h, of size (batch_size, hidden_size, height, width) dim(layer_last_states[[3]][[1]]) # 2 1 16 16 # c, of size (batch_size, hidden_size, height, width) dim(layer_last_states[[3]][[2]]) # 2 1 16 16

Now we want to sanity-check this module with the
simplest-possible dummy data.

Sanity-checking the convlstm

We generate black-and-white “movies” of diagonal beams
successively translated in space.

Each sequence consists of six time steps, and each beam of six
pixels. Just a single sequence is created manually. To create that
one sequence, we start from a single beam:

library(torchvision) beams <- vector(mode = "list", length = 6) beam <- torch_eye(6) %>% nnf_pad(c(6, 12, 12, 6)) # left, right, top, bottom beams[[1]] <- beam

Using torch_roll() , we create a pattern where this beam moves
up diagonally, and stack the individual tensors along the timesteps
dimension.

for (i in 2:6) { beams[[i]] <- torch_roll(beam, c(-(i-1),i-1), c(1, 2)) } init_sequence <- torch_stack(beams, dim = 1)

That’s a single sequence. Thanks to
torchvision::transform_random_affine(), we almost effortlessly
produce a dataset of a hundred sequences. Moving beams start at
random points in the spatial frame, but they all share that
upward-diagonal motion.

sequences <- vector(mode = "list", length = 100) sequences[[1]] <- init_sequence for (i in 2:100) { sequences[[i]] <- transform_random_affine(init_sequence, degrees = 0, translate = c(0.5, 0.5)) } input <- torch_stack(sequences, dim = 1) # add channels dimension input <- input$unsqueeze(3) dim(input) # [1] 100 6 1 24 24

That’s it for the raw data. Now we still need a dataset and a
dataloader. Of the six time steps, we use the first five as input
and try to predict the last one.

dummy_ds <- dataset( initialize = function(data) { self$data <- data }, .getitem = function(i) { list(x = self$data[i, 1:5, ..], y = self$data[i, 6, ..]) }, .length = function() { nrow(self$data) } ) ds <- dummy_ds(input) dl <- dataloader(ds, batch_size = 100)

Here is a tiny-ish convLSTM, trained for motion prediction:

model <- convlstm(input_dim = 1, hidden_dims = c(64, 1), kernel_sizes = c(3, 3), n_layers = 2) optimizer <- optim_adam(model$parameters) num_epochs <- 100 for (epoch in 1:num_epochs) { model$train() batch_losses <- c() for (b in enumerate(dl)) { optimizer$zero_grad() # last-time-step output from last layer preds <- model(b$x)[[2]][[2]][[1]] loss <- nnf_mse_loss(preds, b$y) batch_losses <- c(batch_losses, loss$item()) loss$backward() optimizer$step() } if (epoch %% 10 == 0) cat(sprintf("nEpoch %d, training loss:%3fn", epoch, mean(batch_losses))) }
Epoch 10, training loss:0.008522 Epoch 20, training loss:0.008079 Epoch 30, training loss:0.006187 Epoch 40, training loss:0.003828 Epoch 50, training loss:0.002322 Epoch 60, training loss:0.001594 Epoch 70, training loss:0.001376 Epoch 80, training loss:0.001258 Epoch 90, training loss:0.001218 Epoch 100, training loss:0.001171

Loss decreases, but that in itself is not a guarantee the model
has learned anything. Has it? Let’s inspect its forecast for the
very first sequence and see.

For printing, I’m zooming in on the relevant region in the
24×24-pixel frame. Here is the ground truth for time step six:

b$y[1, 1, 6:15, 10:19]
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

And here is the forecast. This does not look bad at all, given
there was neither experimentation nor tuning involved.

round(as.matrix(preds[1, 1, 6:15, 10:19]), 2)
 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 [2,] -0.02 0.36 0.01 0.06 0.00 0.00 0.00 0.00 0.00 0 [3,] 0.00 -0.01 0.71 0.01 0.06 0.00 0.00 0.00 0.00 0 [4,] -0.01 0.04 0.00 0.75 0.01 0.06 0.00 0.00 0.00 0 [5,] 0.00 -0.01 -0.01 -0.01 0.75 0.01 0.06 0.00 0.00 0 [6,] 0.00 0.01 0.00 -0.07 -0.01 0.75 0.01 0.06 0.00 0 [7,] 0.00 0.01 -0.01 -0.01 -0.07 -0.01 0.75 0.01 0.06 0 [8,] 0.00 0.00 0.01 0.00 0.00 -0.01 0.00 0.71 0.00 0 [9,] 0.00 0.00 0.00 0.01 0.01 0.00 0.03 -0.01 0.37 0 [10,] 0.00 0.00 0.00 0.00 0.00 0.00 -0.01 -0.01 -0.01 0

This should suffice for a sanity check. If you made it till the
end, thanks for your patience! In the best case, you’ll be able
to apply this architecture (or a..

Categories
Offsites

torch 0.2.0 – Initial JIT support and many bug fixes

We are happy to announce that the version 0.2.0 of torch just
landed on CRAN.

This release includes many bug fixes and some nice new features
that we will present in this blog post. You can see the full
changelog in the NEWS.md
file.

The features that we will discuss in detail are:

  • Initial support for JIT tracing
  • Multi-worker dataloaders
  • Print methods for nn_modules

Multi-worker dataloaders

dataloaders now respond to the num_workers argument and will run
the pre-processing in parallel workers.

For example, say we have the following dummy dataset that does a
long computation:

library(torch) dat <- dataset( "mydataset", initialize = function(time, len = 10) { self$time <- time self$len <- len }, .getitem = function(i) { Sys.sleep(self$time) torch_randn(1) }, .length = function() { self$len } ) ds <- dat(1) system.time(ds[1])
 user system elapsed 0.029 0.005 1.027 

We will now create two dataloaders, one that executes
sequentially and another executing in parallel.

seq_dl <- dataloader(ds, batch_size = 5) par_dl <- dataloader(ds, batch_size = 5, num_workers = 2)

We can now compare the time it takes to process two batches
sequentially to the time it takes in parallel:

seq_it <- dataloader_make_iter(seq_dl) par_it <- dataloader_make_iter(par_dl) two_batches <- function(it) { dataloader_next(it) dataloader_next(it) "ok" } system.time(two_batches(seq_it)) system.time(two_batches(par_it))
 user system elapsed 0.098 0.032 10.086 user system elapsed 0.065 0.008 5.134 

Note that it is batches that are obtained in parallel, not
individual observations. Like that, we will be able to support
datasets with variable batch sizes in the future.

Using multiple workers is not necessarily
faster than serial execution because there’s a considerable
overhead when passing tensors from a worker to the main session as
well as when initializing the workers.

This feature is enabled by the powerful callr package and works in all
operating systems supported by torch. callr let’s us create
persistent R sessions, and thus, we only pay once the overhead of
transferring potentially large dataset objects to workers.

In the process of implementing this feature we have made
dataloaders behave like coro iterators. This means that
you can now use coro’s syntax for looping
through the dataloaders:

coro::loop(for(batch in par_dl) { print(batch$shape) })
[1] 5 1 [1] 5 1

This is the first torch release including the multi-worker
dataloaders feature, and you might run into edge cases when using
it. Do let us know if you find any problems.

Initial JIT support

Programs that make use of the torch package are inevitably R
programs and thus, they always need an R installation in order to
execute.

As of version 0.2.0, torch allows users to JIT trace torch R
functions into TorchScript. JIT (Just in time) tracing will invoke
an R function with example inputs, record all operations that
occured when the function was run and return a script_function
object containing the TorchScript representation.

The nice thing about this is that TorchScript programs are
easily serializable, optimizable, and they can be loaded by another
program written in PyTorch or LibTorch without requiring any R
dependency.

Suppose you have the following R function that takes a tensor,
and does a matrix multiplication with a fixed weight matrix and
then adds a bias term:

w <- torch_randn(10, 1) b <- torch_randn(1) fn <- function(x) { a <- torch_mm(x, w) a + b }

This function can be JIT-traced into TorchScript with jit_trace
by passing the function and example inputs:

x <- torch_ones(2, 10) tr_fn <- jit_trace(fn, x) tr_fn(x)
torch_tensor -0.6880 -0.6880 [ CPUFloatType{2,1} ]

Now all torch operations that happened when computing the result
of this function were traced and transformed into a graph:

tr_fn$graph
graph(%0 : Float(2:10, 10:1, requires_grad=0, device=cpu)): %1 : Float(10:1, 1:1, requires_grad=0, device=cpu) = prim::Constant[value=-0.3532 0.6490 -0.9255 0.9452 -1.2844 0.3011 0.4590 -0.2026 -1.2983 1.5800 [ CPUFloatType{10,1} ]]() %2 : Float(2:1, 1:1, requires_grad=0, device=cpu) = aten::mm(%0, %1) %3 : Float(1:1, requires_grad=0, device=cpu) = prim::Constant[value={-0.558343}]() %4 : int = prim::Constant[value=1]() %5 : Float(2:1, 1:1, requires_grad=0, device=cpu) = aten::add(%2, %3, %4) return (%5)

The traced function can be serialized with jit_save:

jit_save(tr_fn, "linear.pt")

It can be reloaded in R with jit_load, but it can also be
reloaded in Python with torch.jit.load:

import torch fn = torch.jit.load("linear.pt") fn(torch.ones(2, 10))
tensor([[-0.6880], [-0.6880]])

How cool is that?!

This is just the initial support for JIT in R. We will continue
developing this. Specifically, in the next version of torch we plan
to support tracing nn_modules directly. Currently, you need to
detach all parameters before tracing them; see an example
here
. This will allow you also to take benefit of TorchScript
to make your models run faster!

Also note that tracing has some limitations, especially when
your code has loops or control flow statements that depend on
tensor data. See ?jit_trace to learn more.

New print method for nn_modules

In this release we have also improved the nn_module printing
methods in order to make it easier to understand what’s
inside.

For example, if you create an instance of an nn_linear module
you will see:

nn_linear(10, 1)
An `nn_module` containing 11 parameters. ── Parameters ────────────────────────────────────────────────────────────────── ● weight: Float [1:1, 1:10] ● bias: Float [1:1]

You immediately see the total number of parameters in the module
as well as their names and shapes.

This also works for custom modules (possibly including
sub-modules). For example:

my_module <- nn_module( initialize = function() { self$linear <- nn_linear(10, 1) self$param <- nn_parameter(torch_randn(5,1)) self$buff <- nn_buffer(torch_randn(5)) } ) my_module()
An `nn_module` containing 16 parameters. ── Modules ───────────────────────────────────────────────────────────────────── ● linear: <nn_linear> #11 parameters ── Parameters ────────────────────────────────────────────────────────────────── ● param: Float [1:5, 1:1] ── Buffers ───────────────────────────────────────────────────────────────────── ● buff: Float [1:5]

We hope this makes it easier to understand nn_module objects. We
have also improved autocomplete support for nn_modules and we will
now show all sub-modules, parameters and buffers while you
type.

torchaudio

torchaudio
is an extension for torch developed by Athos Damiani (@athospd), providing audio
loading, transformations, common architectures for signal
processing, pre-trained weights and access to commonly used
datasets. An almost literal translation from PyTorch’s Torchaudio
library to R.

torchaudio is not yet on CRAN, but you can already try the
development version available here.

You can also visit the pkgdown website for examples
and reference documentation.

Other features and bug fixes

Thanks to community contributions we have found and fixed many
bugs in torch. We have also added new features including:

You can see the full list of changes in the NEWS.md
file.

Thanks very much for reading this blog post, and feel free to
reach out on GitHub for help or discussions!

The photo used in this post preview is by
Oleg Illarionov
on
Unsplash

Categories
Offsites

Privacy Considerations in Large Language Models

Machine learning-based language models trained to predict the next word in a sentence have become increasingly capable, common, and useful, leading to groundbreaking improvements in applications like question-answering, translation, and more. But as language models continue to advance, new and unexpected risks can be exposed, requiring the research community to proactively work to develop new ways to mitigate potential problems.

One such risk is the potential for models to leak details from the data on which they’re trained. While this may be a concern for all large language models, additional issues may arise if a model trained on private data were to be made publicly available. Because these datasets can be large (hundreds of gigabytes) and pull from a range of sources, they can sometimes contain sensitive data, including personally identifiable information (PII) — names, phone numbers, addresses, etc., even if trained on public data. This raises the possibility that a model trained using such data could reflect some of these private details in its output. It is therefore important to identify and minimize the risks of such leaks, and to develop strategies to address the issue for future models.

If one prompts the GPT-2 language model with the prefix “East Stroudsburg Stroudsburg…”, it will autocomplete a long block of text that contains the full name, phone number, email address, and physical address of a particular person whose information was included in GPT-2’s training data.

In “Extracting Training Data from Large Language Models”, a collaboration with OpenAI, Apple, Stanford, Berkeley, and Northeastern University, we demonstrate that, given only the ability to query a pre-trained language model, it is possible to extract specific pieces of training data that the model has memorized. As such, training data extraction attacks are realistic threats on state-of-the-art large language models. This research represents an early, critical step intended to inform researchers about this class of vulnerabilities, so that they may take steps to mitigate these weaknesses.

Ethics of Language Model Attacks
A training data extraction attack has the greatest potential for harm when applied to a model that is available to the public, but for which the dataset used in training is not. However, since conducting this research on such a dataset could have harmful consequences, we instead mount a proof of concept training data extraction attack on GPT-2, a large, publicly available language model developed by OpenAI, that was trained using only public data. While this work focuses on GPT-2 specifically, the results apply to understanding what privacy threats are possible on large language models generally.

As with other privacy- and security-related research, it is important to consider the ethics of such attacks before actually performing them. To minimize the potential risk of this work, the training data extraction attack in this work was developed using publicly available data. Furthermore, the GPT-2 model itself was made public by OpenAI in 2019, and the training data used to train GPT-2 was collected from the public internet, and is available for download by anyone who follows the data collection process documented in the GPT-2 paper.

Additionally, in accordance with responsible computer security disclosure norms, we followed up with individuals whose PII was extracted, and secured their permission before including references to this data in publication. Further, in all publications of this work, we have redacted any personally identifying information that may identify individuals. We have also worked closely with OpenAI in the analysis of GPT-2.

The Training Data Extraction Attack
By design, language models make it very easy to generate a large amount of output data. By seeding the model with random short phrases, the model can generate millions of continuations, i.e., probable phrases that complete the sentence. Most of the time, these continuations will be benign strings of sensible text. For example, when asked to predict the continuation of the string “Mary had a little…”, a language model will have high confidence that the next token is the word “lamb”. However, if one particular training document happened to repeat the string “Mary had a little wombat” many times, the model might predict that phrase instead.

The goal of a training data extraction attack is then to sift through the millions of output sequences from the language model and predict which text is memorized. To accomplish this, our approach leverages the fact that models tend to be more confident on results captured directly from their training data. These membership inference attacks enable us to predict if a result was used in the training data by checking the confidence of the model on a particular sequence.

The main technical contribution of this work is the development of a method for inferring membership with high accuracy along with techniques for sampling from models in a way that encourages the output of memorized content. We tested a number of different sampling strategies, the most successful of which generates text conditioned on a wide variety of input phrases. We then compare the output of two different language models. When one model has high confidence in a sequence, but the other (equally accurate) model has low confidence in a sequence, it’s likely that the first model has memorized the data.

Results
Out of 1800 candidate sequences from the GPT-2 language model, we extracted over 600 that were memorized from the public training data, with the total number limited by the need for manual verification. The memorized examples cover a wide range of content, including news headlines, log messages, JavaScript code, PII, and more. Many of these examples are memorized even though they appear infrequently in the training dataset. For example, for many samples of PII we extract are found in only a single document in the dataset. However, in most of these cases, the originating document contains multiple instances of the PII, and as a result, the model still learns it as high likelihood text.

Finally, we also find that the larger the language model, the more easily it memorizes training data. For example, in one experiment we find that the 1.5 billion parameter GPT-2 XL model memorizes 10 times more information than the 124 million parameter GPT-2 Small model. Given that the research community has already trained models 10 to 100 times larger, this means that as time goes by, more work will be required to monitor and mitigate this problem in increasingly large language models.

Lessons
While we demonstrate these attacks on GPT-2 specifically, they show potential flaws in all large generative language models. The fact that these attacks are possible has important consequences for the future of machine learning research using these types of models.

Fortunately, there are several ways to mitigate this issue. The most straightforward solution is to ensure that models do not train on any potentially problematic data. But this can be difficult to do in practice.

The use of differential privacy, which allows training on a dataset without revealing any details of individual training examples, is one of the most principled techniques to train machine learning models with privacy. In TensorFlow, this can be achieved with the use of the tensorflow/privacy module (or similar for PyTorch or JAX) that is a drop-in replacement for existing optimizers. Even this can have limitations and won’t prevent memorization of content that is repeated often enough. If this is not possible, we recommend at least measuring how much memorization occurs so appropriate action can be taken.

Language models continue to demonstrate great utility and flexibility—yet, like all innovations, they can also pose risks. Developing them responsibly means proactively identifying those risks and developing ways to mitigate them. We hope that this effort to highlight current weaknesses in large language modeling will raise awareness of this challenge in the broader machine learning community and motivate researchers to continue to develop effective techniques to train models with reduced memorization.

Acknowledgements
This work was performed jointly with Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel.

Categories
Offsites

Personal Assistant Kino Part 4 – 자주 읽은 글들은 자동으로 저장하는 Smart Feed

Kino 프로젝트는 QS를 통해서 자신에 대해서 알고, 불필요한 일들을 자동화시키고 삶의 질을 증진시키기 위한 프로젝트 입니다. 이번 편에서는 자동으로 자주 읽는 글들을 저장해주는 Smart Feed 에 대해서 다뤄보고자 합니다.

images

출처 : http://quantifiedself.com/

지금까지의 시리즈

Github: https://github.com/DongjunLee/quantified-self

저번 편에서 Kino의 T3, Task들에 대해서 자동으로 기록하고, 리포팅도 해주는 Task Master 로서의 기능을 살펴보았습니다. 이번 편에는 제가 애용하고 있는 또 하나의 기능. Feed & Pocket 에 대해서 다뤄보고자 합니다.

RSS Feed

RSS Feed는 많은 웹사이트에서 제공하는 RSS를 사용해서 새로운 Article이 등록 되었을 때, 알림을 받을 수 있는 기능을 말합니다. 여기서 잠시 RSS에 대해서 알고 넘어가겠습니다.

RSS(Rich Site Summary)는 뉴스나 블로그 사이트에서 주로 사용하는 콘텐츠 표현 방식이다. 웹 사이트 관리자는 RSS 형식으로 웹 사이트 내용을 보여 준다. 이 정보를 받는 사람은 다른 형식으로 이용할 수 있다.RSS 리더에는 웹기반형과 설치형이 있다. 웹기반형 리더는 간단한 계정등록으로 어디에서든 이용할 수 있다는 장점을 가지고 있다. – 위키백과 RSS

기본적으로 많은 웹사이트들이 RSS를 제공하고 있습니다. 그리고 이것을 이용하는 서비스들도 많이 있지요. 그 중 하나가 Feedly 라는 서비스 입니다. 자주 들어가서 보는 사이트들을 등록해두면, 편하게 새로운 글들을 볼 수 있습니다. 저는 이 서비스를 잘 사용하고 있었지만, 제가 원하는 기능들을 전부 지원하고 있지는 않았습니다.

Pocket

그리고 제가 애용하는 또 하나의 서비스는 Pocket 입니다. 이 서비스가 하는 일은 아주 간단합니다.

When you find something you want to view later, put it in Pocket.

무언가 나중에 읽고 싶은 Article이 생기면, Pocket 에 넣고 아무때나 편하게 보면 되는 것이죠. 저는 유심히 읽고 싶은 Article에 대해서는 Pocket에 저장을 하곤 합니다. 그리고 읽다가 정말 좋은 글이면 Favorite로 옮겨놓곤 하죠.

Smart Feed

저는 이렇게 새로운 글들을 훑어보고, 관심있는 글들을 Pocket에 저장하고, 읽다가 좋다고 느껴지는 글을 Favorite로 옮기는 저의 패턴을 자동화하고 싶었습니다. 그래서 생각하고 만들게 된 기능이 Smart Feed 입니다.

먼저 이 기능에 필요한 것은 RSS 주소들 입니다. 그래야 여기서 RSS를 읽고 새로운 글이 나오면 저장을 하던 알림을 주던 할 수 있겠죠. 그래서 만들게 된 awesome-feeds Repository 입니다. 자주 보는 웹사이트들의 RSS를 Git으로 관리를 하면 편할 것 같기도 하고, 여러 좋은 RSS 주소를 가지고 있는 awesome 시리즈로 만들고 싶었습니다.

이제 RSS가 준비 되었으니, 최신 글이 등록되면 알림을 주면 됩니다!
여기에서는 feedparser를 사용했습니다.

f = feedparser.parse(feed_url)

f.entries = sorted(
    f.entries, key=lambda x: x.get("updated_parsed", 0), reverse=True
)

# get Latest Feed
noti_list = []
if feed_url in cache_data:
    previous_update_date = arrow.get(cache_data[feed_url])
    for e in f.entries:
        e_updated_date = arrow.get(e.updated_parsed)
        if e_updated_date > previous_update_date:
            noti_list.append(self.__make_entry_tuple(category, e, feed_name))

스케쥴 기능은 2편 Skill & Scheduller 에서 다룬 것처럼 지정할 수 있습니다. 매분마다 Feed를 새로 확인하는 것은 과부하가 크기 때문에, 제가 테스트를 해봤을 때는 20분 정도의 interval이면 충분하다고 느껴졌습니다.

def __excute_feed_schedule(self, interval):
    schedule.every(interval).minutes.do(
        self.__run_threaded,
        self.function_runner,
        {
            "repeat": True,
            "func_name": "feed_notify",
            "params": {},
            "day_of_week": [0],
            "not_holiday": False,
        },
    )

이제 Kino가 최신 RSS Feed 들을 바로바로 알려주고 있습니다. 지금도 유용하기는 하지만, 여기서 더 나아가 만들고 싶은 기능이 있었습니다. 제가 무조건 Pocket에 저장을 하는 이미 신뢰받고 있는 웹사이트들은 바로 자동으로 저장을 하는 것!

이것 역시 Pocket 을 연동하고, 간단한 Classification 알고리즘이면 똑똑하게 만들 수 있습니다. 기계학습에서 가장 중요한 것은 Data 입니다. 이런 데이터는 Log들을 이용하면 간단히 만들 수 있습니다. 먼저 Feed 기능에서 알림을 주는 모든 글을 전체 data로 볼 수 있습니다. 이 중에서 Pocket에 저장되는 글만 label 값을 1로 주면, 자연스럽게 전체 데이터들이 관심있는 글 / 관심 없는 글로 나뉘게 됩니다. 여기에 웹사이트의 이름까지 정보로 준다면, 간단한 Decision Tree를 만들 수 있습니다.

images

출처: 위키백과

예를 들어, Google AI Blog 웹사이트에서 새로운 글이 등록 되었을 때, 제가 그 동안 여기서 봤던 글이 총 5개이고, 그 중 4개를 Pocket에 저장했다면, 새로운 글도 관심을 가질만한 글이라고 보는 것이죠.

Decision Tree는 scikit-learn 을 이용하면 아주 간단하게 사용할 수 있습니다.

class FeedClassifier:
    def __init__(self):
        train_X = FeedData().train_X
        train_y = FeedData().train_y
        
        model = tree.DecisionTreeClassifier()
        model.fit(train_X, train_y)  # Training
        self.clf = model

    def predict(self, link, category):
        result = self.clf.predict(category_id)[0]
        if result == FeedDataLoader.TRUE_LABEL:
            ...
        else:
            ...

Online Learning

다음으로 중요한 것은, online learning 입니다. 제가 Pocket에 넣는 Feed들은 그때그때 달라지게 됩니다. 그에 맞춰서 모델 또한 이러한 변화를 감지하고 최신의 정보를 가지고 판단을 해야합니다. 이때 사용되는 방법이 online learning 입니다.

지속적으로 새로운 데이터를 모형에 적용해 모형이 항상 최신의 상태로 유지되기 하는 방식

키노의 Smart Feed는 이 방식을 통해서, 더 똑똑해지고 있습니다. online learning은 하나의 싸이클을 만들어주는 것으로 가능해집니다.

images

  1. Logging: 알람을 받고 있는 Feed의 모든 정보들, 그 중에서 Pocket에 저장한 Feed들 정보
  2. Data Processing: Log를 파싱하여 카테고리, 제목, 날짜, 링크 등의 정보로 가공하고, 라벨 또한 추가해줍니다. (0: Pocket에 추가하지 않음 / 1: Pocket에 추가)
  3. Model: 준비된 데이터를 모델에 Fit 시킵니다. (Training)
  4. Predict: 훈련된 모델을 기반으로 새로운 Feed를 보고 Pocket에 저장할지 말지 판단합니다. 그리고 이때 모델이 잘못 내린 판단에 대해서 Feedback을 제공하여 올바른 라벨이 저장되도록 합니다.

여기서 실시간으로 학습하는 것이 부담이 된다면, 하루에 한번 새로 학습시키는 것도 방법이 될 수 있을 것 입니다.

Conclusion

이번에는 아주 간단한 기능이지만, 정말 유용한 Smart Feed 기능을 살펴보았습니다. 현재는 단순하게 Count를 기반으로 하고 있기 때문에 좀 더 정교한 예측을 하지는 못 합니다. 추후에 Text Classification 문제로서 제목이나 소개글을 통해서 제가 관심을 가질만 한 글인지 예측하도록 만들 생각입니다. 또한 Text Summarization 문제로 다가선다면, 바쁜 저를 위해서 요점만 쏙쏙 정리해줄 수도 있을 것 입니다. 이렇게 Smart Feed 기능의 발전가능성은 열려있다고 생각이 듭니다. 데이터를 많이 모아서, 얼른 Deep Learning 모델로 교체를 해야겠네요!

모든 코드는 여기서 확인하실 수 있습니다.

Categories
Offsites

Personal Assistant Kino Part 3 – 작업을 최대한 간편하게 관리하고, 데이터를 기록하자

Kino 프로젝트는 QS를 통해서 자신에 대해서 알고, 불필요한 일들을 자동화시키고 삶의 질을 증진시키기 위한 프로젝트 입니다. 이번 편에는 사용하고 있는 다양한 앱들을 연결하여 사용하는 작업 간편 관리 기능에 대해서 이야기 합니다.

images

출처 : http://quantifiedself.com/

지금까지의 시리즈

Github: https://github.com/DongjunLee/quantified-self

저번 편인 Part 2 - Skill & Scheduler 은 Kino가 내가 원하는 대로 돌아가도록 준비하는 과정이였습니다. 이제 이 프로젝트의 목표인 Quantified Self를 다루려고 합니다. 나 자신에 대한 데이터를 손쉽게 모으고, 차트를 보며 피드백을 나 자신에게 주고, 이를 통해 삶의 질을 증진시키는 과정을 말이지요.

T3 (Todoist + Toggl + Trello)

오늘 다루려는 이야기는 Task에 관한 이야기입니다. 저는 Todoist를 굉장히 애용하고 있습니다. Premium 기능으로 업그레이드해서 사용할 정도로 말이죠. 모바일, 데스크탑 전부 사용할 수 있는 app이 있어서 언제 어디서든 간편하게 To do list를 관리할 수 있었습니다.

여기서 더 나아가 나는 작업들을 진행할 때, 시간이 얼마나 걸리는지.. 그리고 이 작업에 대해서는 얼마나 집중을 했는지, 하루의 시간을 어떻게 보냈는지 알고 싶었습니다.

제가 원하는 것들을 충족하기 위해서는 Todoist를 사용하는 것만으로는 부족했습니다. 그래서 필요한 서비스들을 찾아보게 되었습니다. 시간 측정에는 Toggl이 가장 잘 만들어진 서비스였고, 작업을 시작하고 (Doing), 끝내는 것(Done)을 가장 쉽게 다룰 수 있는 것은 Trello칸반 보드를 통해서 Task, Doing, Done 의 리스트로 관리 하는 것이라는 결론을 낼 수 있었습니다.

그렇게해서 만들어진 것이 T3

즉, T3 = Todoist + Toggl + Trello 를 통한 Task 간편 관리 기능입니다.
아래는 T3에 사용되는 Skill들 입니다.

Todoist

  • 🌆 today_briefing : Todoist에 등록된 Task들을 브리핑합니다.
  • 📃 todoist_remain : 남은 작업들에 대해서 안내합니다.

Toggl

  • ⌚️ toggl_timer : Toggl Timer 시작 혹은 정지.
  • 🔔 toggl_checker : 30분 마다 시간 체크. (150분 이상 작업 시, 휴식 추천)
  • 📊 toggl_report : Toggl task 리포트.

Trello

  • 📋 kanban_init : Trello 보드를 초기화 합니다.
  • 📋 kanban_sync : Todoist의 Task들과 Trello 보드의 싱크를 맞춥니다.

Question

  • ✍️ attention_question : 작업 후, 집중도 물어보기 (100점 만점)

  • ✍️ attention_report : 집중도 리포트

작업 관리 시나리오

이렇게 Kino Slack Bot에 연결된 스킬들을 통해서 유기적으로 작업들이 돌아가게 됩니다. 이 프로젝트에서 가장 초점을 맞추고 있는 것 중에 하나가 직접 데이터를 기록하기 위한 노력을 최대한 줄이는 것이기 때문이죠.

위의 스킬을 기준으로 조금 더 자세히 설명드리면 다음과 같습니다.
아침이 되면, Todoist 에 등록된 일감들이 Trello 보드에 자동으로 추가가 됩니다. (kanban_init)

images

여기에서 Tasks 에 있는 작업을 Doing 으로 옮기게 되면, 그때 Toggl의 시간 기록이 동작하게 됩니다.

images.png

단순하게 해당 작업을 끝낸 후에는 Done으로 옮기면 Toggl 시간 기록 역시 정지가 되고, 집중도를 물어보게 되어있습니다.

images.png

그리고 이렇게 카드를 옮겨주면서 작업을 진행해주면, 밤에 이런 리포트를 생성해주도록 되어 있습니다.

images

toggl_report Skill

이상이 제가 하루의 Task를 관리하는 시나리오입니다.
제가 하는 것이란 Trello의 카드를 옮기는 것 뿐이에요. 참 간단하죠?

Quantified Self에서 가장 중요한 것은 데이터를 손쉽게 모을 수 있어야하고, 한눈에 볼 수 있는 차트를 제공함으로서 간편하게 자기자신에게 피드백을 줄 수 있어야 한다는 것 입니다. 그런 의미로 T3는 굉장히 간편하고 유용한 기능입니다. 그리고 Task 관련 데이터들을 Toggl에 쌓여있으므로, 언제든 데이터를 받을 수 있고 더 복잡한 분석 또한 할 수 있을 것 입니다.

Trello Webhook

Skill들을 Python으로 wrapping 되어있는 package 들을 사용하면 간편하게 내가 원하는 커스텀 스킬들을 만들 수 있습니다. 각각 서비스에 필요한 TOKEN, ACCESS_KEY 등을 준비하고 연결하면 손쉽게 끝낼 수 있습니다.

그 외에 작업이 필요한 부분은 Webhook입니다. IFTTT에서도 Trello를 연결해서 사용할 수 있지만, 기본적으로 IFTTT는 실시간이 아닙니다. 하지만 Trello로 Task들을 관리하려면 실시간으로 반응을 해야합니다. 작업을 시작한다고 Doing에 올린지 10분이나 지나서 Toggl Timer가 작동하는 것은 너무나도 불편하니까요.

Trello에서는 Webhook를 추가할 수 있도록 지원하고 있습니다. 이 Webhook을 처리하려면 callback을 처리하는 간단한 서버가 있어야 합니다. 이 callback을 처리하기 위해서 Server를 빌리는 것은 너무 아깝습니다. 이럴때는 Serverless Framework를 사용해서 간단하게 처리할 수 있습니다. AWS에 대한 간략한 설정을 마치고, callback에 대한 함수를 정의한다음 배포를 하면 AWS API Gateway + Lambda가 자동설정 되는 것을 보실 수 있습니다.

def kanban_webhook(event, context):
    input_body = json.loads(event['body'])
    print(event['body'])

    action = input_body["action"]
    action_type = action["type"]

    if action_type == "createCard":
        list_name, card_name = get_create_card(action["data"])
    elif action_type == "updateCard":
        list_name, card_name = get_update_card(action["data"])

    kanban_list = ["DOING", "BREAK", "DONE"]
    if list_name in kanban_list:
        payload = make_payload(action=list_name, msg=card_name)
        r = send_to_kino({"text": payload})
    ...

kino-webhook 여기에서 kanban_webhook이 구현되어 있습니다.

이제 Webhook을 다 정의했으니, 이 Webhook을 Kino가 처리하면 모든 문제는 끝이 납니다! 아래는 카드 이동에 따라서 Skill들의 연결을 정리해놓은 것입니다.

   def KANBAN_handle(self, event):
        toggl_manager = TogglManager()

        action = event['action']
        description = event['msg']
        if action.endswith("DOING"):
            toggl_manager.timer(
                description=description,
                doing=True,
                done=False)
        elif action.endswith("BREAK"):
            toggl_manager.timer(doing=False, done=False)
        elif action.endswith("DONE"):
            toggl_manager.timer(doing=False, done=True) ## Todoist 연결되어 있음
  • Doing : Toggl Timer 시작

  • Done : Toggl Timer 정지 & Todoist 작업 완료

  • Break : Toggl Timer 정지

Monotasking, 한 번에 한가지 일에만 집중을

T3 라는 이름의 작업 관리기능을 만들게 된 것에 대해서는 데이터 수집 이외에도 이유가 하나 더 있습니다.
이렇게 칸반으로 작업을 관리하게 되면, 자연스럽게 강제되는 것이 있습니다. 바로 멀티테스킹을 하지 않도록 되는 것이죠.
멀티태스킹은 여러가지 작업을 동시에 하면서 더 빠르게 작업들을 할 수 있다고 생각을 하게 되지만, 실상은 그렇지 않다고 합니다.

≪정리하는 뇌≫ 의 저자 대니얼 J. 래비틴에 말에 따르면 이렇습니다

우리는 자기가 멀티태스킹을 하고 있다고 생각하지만, 이것은 강력하고도 사악한 착각이다. MIT의 신경과학자이자 분할 주의(divided attention)의 세계적 권위자인 얼 밀러는 우리 뇌가 멀티태스킹에는 별로 적합하지 않게 만들어져 있다고 말했다. 사람들은 자기가 멀티태스킹을 하고 있다고 생각하지만, 실제로는 한 과제에서 다른 과제로 아주 신속하게 전환하고 있을 뿐이라는 것이다.
(중략 …)
멀티태스킹은 투쟁-도피 호르몬인 아드레날린은 물론 스트레스 호르몬인 코르티솔의 생산도 증가시킨다. 또한 뇌를 과도하게 자극해 생각을 뒤죽박죽으로 만든다. 멀티태스킹은 도파인 중독 피드백 고리를 만들어내고, 보상작용을 통해 뇌가 초점을 잃고 끊임없이 외부자극을 찾아 나서게 만든다. 설상가상으로 전전두엽피질은 새로움 편향이 있다. 무언가 새로운 것이 등장하면 쉽게 주의를 뺏긴다는 의미다. – 154 페이지 중에서

물론, 멀티태스킹이 실제로 가능한 사람들도 있다고 합니다. 굉장히 소수의 사람들이라고 하고 대부분의 사람들에게는 한 번에 하나의 일에 집중하는 것이 더 효율적이라고 합니다. 저 역시 한번에 여러가지를 잘 하지는 못하는 사람이기에 이렇게 한번에 한가지 일에 집중하는 시스템을 통해서 강제해보려고 합니다. Doing 에 개발 작업을 옮겼다면, 딱 그 작업만 할 수 있게 말이죠.

끝으로

이번 T3에서는 여러가지 서비스들과 Kino를 전부 연결해서 작업을 간편하게 관리하고, 차트를 제공하는 T3 기능에 대해서 살펴보았습니다. 이렇게 하루하루 Task에 대한 데이터들을 모아서 내가 어떤 작업에 잘 집중하는지, 어느 시간대에 집중을 잘 하는지.. 이런 여러가지 분석을 할 수 있을 것 입니다. 저도 데이터를 모으는 중이라, 나중에 꼭 분석하려고 합니다. 😀

모든 코드는 여기서 확인하실 수 있습니다.

다음에는 최신 Feed를 바로바로 알려주고, 자동으로 모은 데이터를 통해서 분류까지 알아서 하는 Smart Feed 기능에 대해서 알아보겠습니다.