Sergey Galushko - Fotolia
SAN FRANCISCO -- Databricks open sourced its Databricks Delta Lake, a tool for structuring data in data lakes, just over a year after first officially introducing it.
Code for the open source product is available on GitHub, and is freely available to run on premises, on laptops, or in the cloud, as long as it falls under the Apache License v2.0. Delta Lake data is stored in the open source Apache Parquet, a Hadoop-based columnar storage tool.
Databricks unveiled the open source data lake tool at the Spark + AI Summit 2019 here April 24. For Databricks and Spark users, it might not have come as a surprise.
"Databricks is open sourcing Delta because it aligns with the Spark open source model," said Tony Baer, principal analyst at Ovum.
"It also aligns with their strategy to monetize the integrated runtime, taking much of the effort to trigger frameworks like TensorFlow or MLflow off the shoulders of the data scientist," Baer continued, referring to the widely used machine learning library and Databricks machine learning tool.
Databricks Delta Lake integrates with Spark, as well as with MLflow, another open source, Spark-based tool developed primarily by Databricks. The integration, according to Ali Ghodsi, CEO and co-founder of Databricks, enables users to more easily perform machine learning tasks on data in their data lakes.
Delta Lake sits on top of data lakes, Ghodsi explained in an interview after his conference keynote, to "ensure you have high-quality data."
Ease of use was one of the major factors that played into Databricks' decision to make the tool open source, Ghodsi said.
"People have data problems in many different environments, and Databricks only exists in the cloud, he said." The original Delta product, introduced in late 2017, ran only on the cloud, Ghodsi noted.
Now, with Delta Lake, users can run the tool in more environments, and can get more value out of Delta now that they have access to the source code, he said.
"We want this data revolution to succeed," Ghodsi said. "It's in the best interest of everybody that these projects are successful," he continued. "By open sourcing it, you can have a much bigger impact."
Databricks Delta Lake can handle data and metadata at scale, enabling users to work with petabyte-scale tables. Databricks Delta Lake also features what the Databricks team calls "time travel," a type of data versioning that enables users to take snapshots of data as they work on it. Users are able to recall and revert back to snapshots as necessary.
Tony Baer Principal analyst, Ovum
The system also enables ACID transaction on users' data lakes, providing the data with more security and longevity, Databricks said.
For machine learning data platform vendor Splice Machine, a Databricks partner, Delta Lake aligns with Splice Machine's "overarching theme of operationalizing AI and making machine learning easier to put into production," as well as its interest in ACID compliance, said Monte Zweben, co-founder and CEO of Splice Machine, in an interview at the conference.
However, Zweben said, Databricks Delta Lake "hasn't gone all the way."
"The way Delta works is that it is keeping track of Delta files. It's keeping track of all of these Parquet files that are taking place," Zweben said. It tracks changes to data over time, but, "it's at a very large level of granularity," as opposed to the small levels he said Splice Machine's technology can track.
"Delta is on the right path to be able to track changes to data over time," Zweben said. It's not quite there yet, he said.
More from Databricks
Also at the Spark + AI Summit 2019, Databricks unveiled a new open source big data tool: Koalas, now available on GitHub. The software enables users to more directly import their Pandas code into Spark environments, without having to change much, if any, of the original code.
Many data scientists are trained on Pandas, a Python library that is effective for small to medium-sized databases but isn't scalable to enterprise-level ones. Koalas is meant to ease the transition from Pandas to Spark environments, enabling users to code in them without necessarily learning a new language.
Databricks also previewed the long-expected Spark 3.0. Equipped with a set of new capabilities, including Kubernetes as a native mode, Spark 3.0 will likely be released later in 2019, according to Databricks.