First Insight: Modulus, WellSaid, Applied AI
Welcome to the first technology newsletter from the AI2 incubator team. Every month starting July 2021, we will share technology/AI/ML updates from AI2 and around the Web. We will pay special attention to developments that are relevant to startups (and hopefully address the question why yet another AI/tech newsletter). Please let us know in the comments if you have inputs/suggestions.
First, a couple of quick updates on the incubator. This month we welcomed Michael Carlon
to the incubator team. Michael is a full stack engineering extraordinaire and a top Kaggle competitor with a soft spot for deep learning and computer vision. Second, two of our companies announced funding: Modulus Therapeutics' seed
round and WellSaid's series A
This month, we are spotlighting a few areas: modeling with limited labeled data, neural networks gaining vs tree-based methods, and ML tools and infrastructure.
Modeling with limited labeled data
As scrappy startups, we often face a familiar foe: the lack of labeled data. We thus always keep an eye on the latest development of techniques to help us train better models with less data. As Facebook AI’s Chief Scientist, Yann LeCun, has recently noted
, the future of AI research is in building more intelligent generalist models that can acquire new skills across different tasks, domains and languages without massive amounts of labeled data. In NLP, we have a working document
where we share best practices in building NLP models with limited training data.
Noteworthy updates this month are:
- Facebook releases the latest research and data sets for conversational AI. Focusing on training NLU models with as little labeled data as possible. The blog post claimed “10x less training data to create state-of-the-art conversational AI systems that can perform unfamiliar, complex tasks”.
- Google’s AI blog discusses semi-supervised distillation, a semi-supervised learning technique, showing gain in query understanding and search relevance for Google’s search, which is no small feat given their current strong baseline. This technique combines ideas from noisy student and knowledge distillation techniques.
- Google published a paper on How to Train Vision Transformer (ViT). TL;DR: ViT has been shown to be competitive with CNN when there’s plenty of training data. If training data is small/medium, then special attention is needed for data augmentation and regularization which could be worth as much as 10x the amount of training data.
Neural networks vs tree-based methods
Given the dizzying pace of deep learning/neural network research and development, it’s easy to forget that non-neural network techniques are still relevant in applied settings. Tree-based methods such as XGBoost and LambdaMart are still often the preferred choices (e.g. in Kaggle competitions) in tabular data-based or learning to rank (LTR) problems. This month, we came across a couple of blog posts that hint toward DL/NN’s relentless progress in these stronghold areas for tree methods as well. We wonder if it’s only a question of time until DL/NN achieves total domination.
First, the website PaperWithCode’s July newsletter gave a summary
of two recent papers on deep learning for tabular data.
- In the paper Revisiting Deep Learning Models for Tabular Data, the main findings are 1) there’s no clear winner between deep learning and tree-based models - a lot depends on the data set at hand, 2) Resnet-like architecture provides a very effective baseline, and 3) an adaptation of the Transformer architecture helped reduce the gap against gradient boosted tree methods on datasets where they dominate.
- In the paper SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training, the authors explore extensive use of attention and contrastive self-supervised pre-training (which is effective when the labels are scarce). SAINT consistently improves performance over previous deep learning methods, and it even outperforms gradient boosting methods, including XGBoost, CatBoost, and LightGBM, on average over a variety of benchmark tasks.
Second, Google releases updates to TF-ranking
, which is an open source tensorflow library for training LTR models. While search and recommendation systems are the most common applications of LTR models, Google reported seeing the use of LTR models in diverse domains from smart city planning to SAT solvers. The latest update added support for Keras API, a new architecture that incorporates BERT, and a neural ranking generalized additive model (GAM) for interpretability. Finally, the Google AI team spent significant time trying to improve the performance of neural LTR models in comparison to tree-based models, and reported achieving parity and in some cases superior performance over strong LambdaMART baselines on open LTR datasets. They used a combination of techniques, which include data augmentation, neural feature transformation, self-attention for modeling document interactions, listwise ranking loss, and model ensembling similar to boosting in XGBoost.
Tools and infrastructure
- AI2 researchers published Neural Extractive Search (prototype and demo video). In information extraction problems, syntactic structure-based extraction is high-precision and low-recall. The key idea is to improve recall with neural retrieval and alignment. The work is done by the AI2 Israel team who has been building Spike, a tool for information extraction and knowledge base construction.
- Bubble raised $100M A round: “Building tech is slow and expensive. Bubble is the most powerful no-code platform for creating digital products. Build better and faster.” Relevant to any startup that wants to rapid prototype without 10x engineer co-founder. Even for teams with strong technical co-founders, it’s worth exploring how/where Bubble can help with building an MVP.
- OpenAI releases Triton 1.0: open-source Python-like programming language for writing efficient GPU code. OpenAI researchers with no GPU programming experience have used Triton to produce kernels that are 2x faster than their PyTorch equivalents.
- Hugging Face continues with interesting updates. We can now easily share spaCy pipelines to Hugging Face’s Hub, showcase ML demo apps on Hugging Face’s spaces (with support for Streamlit and Gradio), and deploy Transformers in SageMaker. If you are building a demo on top of HF’s libraries and tools, HF Spaces is a good way to build it fast.
- Summvis is an open-source interactive visualization tool for text summarization. Maybe useful if you encounter significant hallucination in summarization models.
- Deep Genomics (AI discovery platform for 'Programmable' RNA therapeutics) raised $180M C round.
- Jeremy Howard (of fastai fame) shares his thoughts on GitHub copilot. TL;DR: it’s impressive, but too early to be truly useful.
- Andrej Karpathy’s CVPR 2021 talk on autonomous vehicles, describing the switch to a pure vision-based (no radar) approach with semi-auto labeling.
Stay up to date with the latest
A.I. and deep tech reports.