Databricks gobbles up MosaicML - spicing up the battle for "AI moats"
Will a $1.3B bet on Generative AI play out well in the long-term for a darling private company? Let's explore.
Author’s Note
In this post, I delve into the potential consequences and implications of Databricks' acquisition of MosaicML for data cloud companies in the emerging era of Generative AI and Large Language Models (LLMs). While there have been numerous discussions on this topic, I aim to contribute to this discourse by sharing insights from my experience as a practitioner in a Fortune 500 company as a Data Scientist.
Databricks - Background
Databricks, a collaborative big data and machine learning analytics platform, was established by the creators of Apache Spark, a powerful open-source cluster computing framework. Databricks offers a cloud-based environment that seamlessly integrates with various data sources and tools, facilitating efficient collaboration among data engineers, data scientists, and data analysts.
Apache Spark originated in 2009 at UC Berkeley as a fast, multi-purpose cluster computing system intended to enhance MapReduce. Spark is exceptionally fast, with up to 100 times faster processing in memory and 10 times faster on disk than Hadoop MapReduce. Spark's API supports multiple programming languages, including Java, Scala, Python, and R, making it simple to develop applications.
The primary abstraction in Spark is Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of objects partitioned across a cluster. RDDs enable data manipulation with operations such as map, filter, and reduce, and they allow for data to be cached in memory for efficient reuse across multiple operations.
Databricks was founded to make Spark more accessible to data professionals by creating a cloud-based platform on top of it. This approach proved successful, and Databricks quickly grew to support thousands of customers.
Databricks - The Product & The Capabilities
1. Data Lakehouses: A data lakehouse is a modern data architecture that combines the best of data lakes and data warehouses. It provides:
Flexibility: Like data lakes, a lakehouse allows you to store all your raw and structured data in one place. You can ingest data in any format.
Governance: Similar to data warehouses, a lakehouse enforces data quality rules, security policies, and ACID transactions on your data. This ensures trust and reproducibility.
Speed: You can query and analyze your data in real time using tools like SQL, notebooks, and BI reports. There is no ETL overhead.
Scale: The lakehouse architecture, based on Apache Spark and Delta Lake, can handle petabytes of data and thousands of users.
2. Delta Lake: Delta Lake is an open source storage layer that sits at the foundation of Databricks' lakehouse Platform. Databricks initially created Delta Lake and continues to contribute to the open-source project. Many optimizations and products on the lakehouse Platform rely on the reliability of Spark and Delta features. It extends the Apache Parquet file format to provide features like:
ACID transactions: Through a transaction log, Delta Lake enables inserting, updating, and deleting data from Lakehouse without rewriting entire files.
Scalable metadata: It handles metadata about schema, location, and transaction timeline for tables in a scalable and fault-tolerant way.
Spark API compatibility: Delta Lake is fully compatible with Apache Spark APIs, which makes it easy to use with existing Spark pipelines.
Streaming support: It integrates tightly with Spark Structured Streaming to allow processing both batch and streaming data in the same data copy. This provides faster, incremental processing at scale.
3. MLflow: MLflow is an open-source platform for the entire machine learning lifecycle. It helps you build better models faster by tracking runs, versioning models, and deploying them into production. The core components are:
Tracking: Record the parameters and results of your ML experiments. Compare multiple runs to find the best model.
Models: Package trained models in a standard format. Deploy models to different frameworks and serving systems.
Projects: Bundle ML code to share with others. reproduce the experiments exactly.
Registry: Store, annotate, and manage multiple model versions in one place. Transition models from development to production.
Model Serving: Serving host-trained models as REST API endpoints for real-time prediction.
Databricks - Core Market & User Persona
Databricks' core market is in the big data and cloud computing industries. Its platform is designed to help organizations manage and process large volumes of data in a scalable and efficient manner. At the time of writing this, Databricks' official website stated that the company has a global clientele of over 9,000 organizations. The company boasts an impressive customer list, which includes Comcast, Condé Nast, and over half of the Fortune 500 companies.
These companies rely on Databricks for their machine learning and advanced analytics needs. For instance:
AT&T uses it for their predictive analytics capabilities.
Credit Suisse is now able to feed various stakeholders including their sales teams that need real-time product recommendations to serve their clients, business users.
Burberry unlocked collaboration among its data engineers, data analysts, and business users using it.
It’s interesting to note that most of their major customer success stories are on Microsoft Azure. We’ll come to why this is interesting at a later part of this blog.
The typical user of the Databricks platform includes data engineers, data scientists, and analysts who use the platform to build and deploy machine learning models, perform advanced analytics, and manage data pipelines. The platform is well-suited for organizations that require real-time processing and analysis of large volumes of data and who prefer to leverage cloud-based infrastructure for their big data needs and strategy.
Competitive Landscape
Databricks, a popular data analytics and artificial intelligence platform, has several competitors in the market.
Data Warehousing: On the data warehousing front, Databricks competes with Amazon Redshift, Azure Synapse, Google BigQuery and Snowflake. The personas all of them are trying to serve are Data Analysts.
Data Engineering: When it comes to data engineering, the common alternates for the customers include Hadoop, Apache Airflow, Amazon EMR and Apache Spark. The persona in focus here is data engineers.
Streaming: Handling streaming data requires different data engineering capabilities. The common options for users/customers include Apache Kafka, Confluent, Apache Flink and Amazon Kinesis. The personas being served here are data engineers as well.
Data Science and ML: For Data Science and ML workflows revolving around Python and R, customers have options like Amazon SageMaker and Azure ML Studio. The personas working on these workflows are Data Scientists and ML engineers inside companies.
While all of these competitors offer similar functionalities around big data processing, machine learning, and data warehousing, Databricks focuses specifically on Apache Spark. Their main differentiator is a cloud-optimized version of Spark with tools to make it easier to use for data engineering, data science, and analytics.
Databricks' main competition comes from Snowflake and the big cloud providers (AWS, Google, and Microsoft) and enterprise data platform vendors (Cloudera, Hortonworks, and IBM) that offer a broader suite of big data and analytics tools. But Databricks' focus on Spark optimization and ease of use sets them apart to some degree.
Potential Disruption - the soft belly
Databricks has seen tremendous success with its lakehouse platform, which offers a blend of data lake flexibility and warehouse speed. There are some concerns looming over Databricks. To begin with, the platform has a steep learning curve, particularly for those who are new to distributed computing. Large data transfer volumes can introduce latency and costs, which is another challenge. It is worth noting that Databricks primarily supports Python, R, and Scala, potentially limiting other language options and developer audience that it can potentially serve.
The rise of Large Language Models (LLMs) and other Generative AI technologies poses some additional challenges for the company:
1. Portfolio of generative AI models: Developing its own portfolio of LLMs and Generative AI models could help Databricks win enterprise customers. Corporations often prefer the added security of building and deploying AI models in-house rather than using external services. As enterprises demand more secure, customized, and governed solutions for Generative AI, there is an opportunity for Databricks to develop in-house models that meet these needs. While challenging, doing so could help Databricks better serve the enterprise market and differentiate its platform for the lakehouse era. Dolly 2.0 showed initial promise; Databricks may now need to scale that effort to develop a full portfolio of LLMs and AI pipelines.
2. Complex data requirements: Generative AI models require large amounts of training data, necessitating improved data management and preparation tools.
3. Model interpretability: As Generative AI models become more like "black boxes", tools for interpreting, validating, and debugging them will become important.
4. Model monitoring: Dedicated monitoring of Generative AI models will likely be needed to detect bias, drift, degradation, and other issues.
Interestingly, Databricks announced a major foray into the world of Generative AI via their $1.3B acquisition of MosaicML. This is very interesting for many reasons which we will delve into in the next section and analyze the potential implications of the same.
Potential implications for Databricks post MosaicML acquisition
Databricks already houses a lot of enterprise data (including those of Fortune 500) via their data lakehouse and warehouse offerings. Until now, they have a strong suite of offerings for Machine Learning workflows. However, the wave of Generative AI is a new platform shift that they’d need to navigate.
Generative AI still requires companies to be able to work with data effectively. This requires rethinking of the current data tooling stack that companies use as it is largely centered around structured data. Generative AI models allow enterprises to be able to work with and reason over large amount of unstructured data. This is an unchartered territory for Databricks and this is where their inorganic product expansion efforts via the MosaicML acquisition could have interesting implications.
Some key features that could benefit Databricks users include:
Model management: MosaicML streamlines the machine learning workflow from model development to deployment and monitoring. This helps make Generative AI more scalable and sustainable.
Open-source models: MosaicML recently released two large open-source models that are easy to deploy, even with limited GPU resources. These models can be commercialized by Databricks for its customers.
Custom model training: MosaicML's training feature allows organizations to train their own models while maintaining full control over their sensitive data. This fits well with Databricks' data governance and security capabilities and something that their current customer base will value.
Mosaic’s MPT 30B offers flexible solutions to make generative AI more efficient and affordable for organizations:
Training: MosaicML Training allows you to customize their generative AI models like MPT-30B through fine-tuning, domain-specific pretraining, or training from scratch on your own data.
They never store your private data. You only provide the final trained model weights.
The cost is based on GPU minutes consumed during training.
Inference: The Starter Edition allows you to call APIs for their pre-trained generative AI models like MPT-30B-Instruct and MPT-7B-Instruct from their hosted endpoints.
The cost is based on the number of generated tokens.
The Enterprise Edition lets you deploy and optimize custom-trained models in your own infrastructure, or VPC.
The cost of Enterprise Edition is based on GPU minutes consumed during inference.
Databricks is betting that enterprises will want to build their own Large Language Models (LLMs) for security and control. By integrating MosaicML's capabilities into its platform, Databricks’ customers now have the ability to fine-tune and efficiently scale state-of-the-art natural language processing models in a secure, proprietary environment.
With access to MosaicML's advanced models as a starting point, organizations can customize them for their specific needs. The integration with Mosaic Inference also brings pay-as-you-go pricing that is substantially cheaper than other options for production deployment of large language models.
Overall, the partnership demonstrates Databricks' recognition of the rising enterprise demand for performant and cost-effective customized LLMs. By leveraging MosaicML's innovative model management platform, Databricks can now offer its users cutting-edge Generative AI capabilities in a secure, optimized, and scalable environment tailored for customers.
Key Takeaways
1. Model, but which one? Databricks has acquired MosaicML, a company that builds large language models (LLMs). This acquisition has sparked questions about the effectiveness of Databricks' own LLM, Dolly 2.0, and whether Databricks needs MosaicML to complement its own portfolio of models or to maintain a competitive edge in the Generative AI space.
2. Microsoft Azure x Databricks vs. OpenAI x Microsoft: With Microsoft Azure and Databricks seamlessly integrated and targeting the enterprise market, along with OpenAI offering a new API, it remains to be seen whether Databricks and MosaicML will become the go-to options or if OpenAI's offering will prevail. However, Microsoft's investment in both companies makes it a potential winner in this space.
3. Snowflake x Nvidia: The Snowflake x Nvidia partnership allows enterprises to securely build their own LLMs using their own data on the Snowflake data cloud. NVIDIA NeMo, a cloud-native platform for building, customizing, and deploying Generative AI models, will be hosted on the Snowflake data cloud, enabling customers to build, customize, and deploy custom LLMs for Generative AI applications.
4. Snowflake x Reka: Snowflake has also partnered with Reka, a builder of purpose-built generative models for enterprises. Reka's first product, Yasa, is a multimodal AI assistant that understands images, videos, tabular data, words, and phrases. It can generate ideas, answer basic questions, and derive insights from a company's internal data.
5. Data Cloud and LLM companies a perfect match?: Data cloud companies are increasingly acquiring/partnering with companies that build LLM models to offer an end-to-end solution for enterprises. By providing a trusted cloud platform that includes both data engineering and model-interfacing capabilities, these companies can create an end-to-end toolchain that targets enterprise customers.
6. Pricing wars are brewing: Competition in AI is intensifying, which will likely benefit end users through lower pricing and better solutions. Recent announcements from companies like MosaicML suggesting their costs are significantly lower than OpenAI indicate pricing wars may be brewing. While initially this may create challenges for some companies, the longer-term effect will likely be reduced costs and more options for users.
In summary, Databricks faces challenges from its primary GTM partner, Microsoft, independent LLM API providers like OpenAI, and integrated solutions from data cloud and LLM partnerships like Snowflake + Nvidia and Snowflake + Reka. It remains to be seen who stays ahead of the curve and which bets play out in the long run.
Disclaimer: This article reflects my personal views as a practitioner and does not represent the opinions of my employer.