Access Delta Lake Tables
In Wherobots Cloud, you can read your existing Databricks Delta Lake tables and leverage them for advanced geospatial analysis within a Wherobots notebook.
This means that you can perform powerful geospatial operations like raster inference and map matching without moving or duplicating your datasets.
This guide provides a detailed example of how to prepare your data for analysis, focusing on a map matching use case.
Use Wherobots for spatial ETL and analytics on your Databricks Delta Lake tables
This workflow demonstrates how Wherobots can directly read and process data from your existing
Delta Lake tables, allowing you to leverage powerful spatial analytics without a separate ingestion step.
While reading is fully supported, writing to Delta Lake tables is not.
Benefits¶
Reading Databricks' Delta Lake tables with Wherobots Cloud allows you to:
- Avoid data migration: Access and process your Delta Lake tables directly in Databricks without needing to move them.
- Leverage existing data infrastructure: Integrate your Databricks-managed data with Wherobots' geospatial capabilities.
- Enable advanced geospatial analysis: Perform operations like Map Matching and Raster Inference on data stored in Delta Lake tables, which may not have native geospatial support.
- Handle large-scale datasets: Read and process large, partitioned Parquet files (even those without explicit geometry columns) stored within Delta Lake tables.
Before you start¶
Before using this feature, ensure that you have the following required resources:
- An Account within a Community, Professional, or Enterprise Edition Organization. For more information, see Create a Wherobots Account.
- A pre-existing Delta Lake table: This table should be set up and managed within the Databricks platform.
- A Databricks Personal Access Token (PAT): For more information, see Databricks personal access tokens for workspace users or Databricks personal access tokens for service principals in the official Databricks Documentation.
-
A pre-existing Unity Catalog: In Databricks, Unity Catalog is the governance layer for your data.
Understand the Unity Catalog namespace format
You interact with the Unity Catalog through a three-level namespace:
CATALOG.SCHEMA.TABLE
.
Set your Databricks permissions¶
This table outlines the specific Unity Catalog privileges required for a user or service principal to read data from Databricks tables and volumes.
Databricks permissions are hierarchical
To access a table, you must have privileges for the catalog and schema that contain it. For more information, consult the official Databricks Documentation or your Workspace Admin.
Ensure that you have established the following permissions within Databricks:
Access Privilege/Permission | Applies To (Object) | Purpose |
---|---|---|
USE CATALOG |
Catalog | Allows the user or service principal to see and access the specified catalog. This is the top-level requirement. |
USE SCHEMA |
Schema | Grants the ability to see and access the schema (database) that contains the desired table or function. |
SELECT |
Table / View | Grants read-only permission to query data from a specific table or view within the schema. |
EXECUTE |
Function / Notebook | Grants the ability to run a user-defined function (UDF) or execute a notebook. This does not grant read access to tables. |
READ VOLUME |
External Volume | Grants read-only permission to access files stored within a specific Unity Catalog external volume. |
EXTERNAL USE SCHEMA |
Schema | Grants an external system the ability to read data from tables within a schema. For more information, see Enable external data access to Unity Catalog |
More information on EXTERNAL USE SCHEMA
The user or service principal authenticating the external request must be explicitly granted the EXTERNAL USE SCHEMA privilege on the schema containing the target table. This is a critical security measure to prevent accidental data exfiltration and is not included in broader privileges like ALL PRIVILEGES.
Reading a Delta Lake table in a Wherobots Notebook¶
This section outlines how a user can access and read data from an existing Delta Lake table.
- Log in to Wherobots Cloud.
-
Start a runtime.
Which runtime should I choose?
For reading large tables, consider using General Purpose-Small or General Purpose-Medium.
-
Once the runtime has loaded, open your Wherobots Notebook.
-
In your Wherobots Notebook:
-
Define Connection Parameters and Credentials.
Configuring the Sedona Context 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
from sedona.spark import * from pyspark.sql.functions import * import os import getpass from datetime import date # Define your Unity Catalog name and URI UNITY_CATALOG_URI = "https://your-databricks-workspace.cloud.databricks.com" UNITY_CATALOG_NAME = "YOUR_CATALOG_NAME" SCHEMA_NAME = "YOUR_DATABRICKS_SCHEMA_NAME" TABLE_NAME = "YOUR_DATABRICKS_TABLE_NAME" # The Databricks PAT will be hidden as you type or paste. try: databricks_pat = getpass.getpass("Enter your Databricks PAT: ") except Exception as e: print(f"Could not read the token: {e}") databricks_pat = None # `getpass` only obscures the token during input. # The 'databricks_pat' variable below now holds the secret in plain text. # Accidentally printing or logging this variable (e.g., `print(databricks_pat)`) # will expose the secret in your console or notebook output. # Checks if the token was entered if not databricks_pat: raise ValueError("A Databricks PAT was not entered. Halting script.") print("Databricks PAT received successfully.")
Finding Your Catalog, Schema, and Table Names
To locate the names for your catalog, schema, and table, see the official Databricks guide on how to explore database objects using the Catalog Explorer.
- On line 9, replace
https://your-databricks-workspace.cloud.databricks.com
with the URI for your Databricks Catalog. - On line 10, replace
YOUR_CATALOG_NAME
with your Unity Catalog name. - On line 11, replace
YOUR_DATABRICKS_SCHEMA_NAME
with your Databricks Schema Name. - On line 12, replace
YOUR_DATABRICKS_TABLE_NAME
with your Databricks Table Name.
- On line 9, replace
-
Configure the Sedona Context to access the specific Delta Lake table.
31 32 33 34 35 36 37 38 39 40 41
# Spark Session Initialization spark_builder = (SedonaContext.builder() .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.3.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "io.unitycatalog.spark.UCSingleCatalog") .config(f"spark.sql.catalog.{UNITY_CATALOG_NAME}", "io.unitycatalog.spark.UCSingleCatalog") .config(f"spark.sql.catalog.{UNITY_CATALOG_NAME}.uri", UNITY_CATALOG_URI) # The databricks_pat variable from getpass is now used here .config(f"spark.sql.catalog.{UNITY_CATALOG_NAME}.token", databricks_pat) ) sedona = SedonaContext.create(spark_builder)
-
Read the Delta Lake table using its catalog name.
Reading the Delta table 42
df = sedona.table(f"{UNITY_CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}")
-
Performing geospatial analysis on data from a Delta Lake table¶
Once the Delta Lake table is loaded into a Wherobots Notebook, you can prepare your data for advanced geospatial operations like Map Matching and Raster Inference, as you would with any data stored directly within Wherobots Cloud.
To use spatial functions, the DataFrame needs a column with the actual geometry
type, not just a String of WKT
or a Binary
representation of a shape.
If your data is currently stored in one of these formats, you must first convert it into a true geometry
object.
Preprocessing for Spatial Analysis¶
The method of preprocessing your data for spatial analysis depends on the initial state of your data.
The following examples demonstrate this by transforming raw GPS records from a Delta Lake table into a DataFrame with a specialized geometry
column.
While ST_Point
creates geometry from a numerical data type (like FLOAT
), ST_GeomFromEWKB
activates the geometry that was stored in a dormant, non-spatial format like Binary
. This activation still requires creating a new column to hold the finished, active object.
-
Load Data: Follow steps from Reading a Delta Lake table in a Wherobots Notebook to load your data.
-
Prepare Data: Create a
geometry
column using the coordinate fields, depending on the initial data type.Consider a DataFrame loaded from a Delta Lake table that contains separate numerical columns for longitude (x-coordinate) and latitude (y-coordinate):
trip_id start_lon start_lat 101 -122.4194 37.7749 102 -122.4080 37.7850 - Input: A PySpark DataFrame (read from a Delta table) where the location data is stored in separate, non-spatial columns.
- Data Types: The
start_lon
andstart_lat
columns are standard numerical types, such asDouble
,Decimal
, orFloat
, not spatial geometry. - Spatial Awareness: At this stage, the DataFrame is not spatially aware. The database sees
start_lon
andstart_lat
as two independent numbers; it has no intrinsic understanding that they represent a single geographic location.
Use
ST_Point(start_lon, start_lat)
to create a newgeometry
column from these two columns.ST_Point Examplefrom pyspark.sql.functions import expr # Assume UNITY_CATALOG_NAME, SCHEMA_NAME, and TABLE_NAME are predefined variables # that point to a table containing trip start/end coordinates. # The source table has columns 'start_lon' and 'start_lat'. # We'll create a new column named 'geometry' using ST_Point. trip_data = ( sedona .table(f"{UNITY_CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}") .withColumn( "geometry", expr("ST_Point(start_lon, start_lat)") ) ) print("DataFrame with a 2D 'geometry' column created from x and y coordinates:") trip_data.select("trip_id", "geometry").show(truncate=False)
Output# trip_data.select("trip_id", "geometry").show(truncate=False) +-------+--------------------------+ |trip_id|geometry | +-------+--------------------------+ |101 |POINT (-122.4194 37.7749) | |102 |POINT (-122.4080 37.7850) | +-------+--------------------------+
- Output: A new PySpark DataFrame that now includes a spatially-aware
geometry
column. - Structure: A table containing all the original columns plus the new
geometry
column. - Data Types: The data type of the new
geometry
column is a specializedGeometryUDT
(User-Defined Type). This is a complex object that Apache Sedona understands as a spatial object. When you display the column using the.show()
command, thisGeometryUDT
object is presented in a human-readable string format called Well-Known Text (WKT). - Spatial Awareness: The output DataFrame is now spatially aware. You can run spatial functions (e.g.,
ST_Distance
,ST_Contains)
on the geometry column.
The
ST_POINT
function is essential for spatial analysis because it transforms simple numerical coordinates into a powerful geometry object, as this table illustrates:Feature Before ST_POINT
After ST_POINT
Data Format FLOAT
geometry
Spatial Awareness No. Seen as independent numerical columns. Yes. Acknowledged as a spatial object. Functionality Cannot use spatial functions. Can use functions like ST_Distance
,ST_Area
, etc.Consider a DataFrame loaded from a Delta Lake table that contains
ewkb_geometry
:trip_object_id ewkb_geometry 950 0101000020E61000004A8F16945A485EC04E1A2B43706F4240... 951 0101000020E6100000D7A3703D40485EC0C3F5285C6F6F4240... - Input: A PySpark DataFrame where the geometric data is stored in a single column in a dormant, encoded format.
- Data Types: The
ewkb_geometry
column is a binary format. It contains a long sequence of characters that represents a complete geometry (Point, Line, or Polygon). - Spatial Awareness: At this stage, the DataFrame is not spatially aware. Although the
ewkb_geometry
column contains all the necessary geometric information, it essentially functions much like text or simple numerical data within the database because you cannot use this column directly for spatial functions.
ST_GeomFromEWKB examplefrom pyspark.sql.functions import expr # Assume the source table contains a column 'ewkb_geometry' with data # in Extended Well-Known Binary format. # ST_GeomFromEWKB parses this binary data into a usable geometry object. trip_data = ( sedona .table(f"{UNITY_CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}") .withColumn( "geometry", expr("ST_GeomFromEWKB(ewkb_geometry)") ) ) print("DataFrame with 'geometry' column parsed from an EWKB source:") trip_data.select("trip_object_id", "geometry").show(truncate=False)
Output# trip_data.select("trip_object_id", "geometry").show(truncate=False) +---------+--------------------------+ |trip_object_id|geometry | +---------+--------------------------+ |950 |POINT (-122.4194 37.7749) | |951 |POINT (-122.4080 37.7850) | +---------+--------------------------+
ST_GeomFromEWKB
parses, interprets, and constructs the usable geometry object from its given input. This function assumes you have a single column that contains geometric data encoded in the Extended Well-Known Binary (EWKB) format.EWKB is a representation of a geometry (like a point, line, or polygon) that often includes the Spatial Reference System Identifier (SRID). It looks like a long string of hexadecimal characters in a database.
- Output: The primary output is a new PySpark DataFrame where the encoded geometry has been parsed into a live, spatially-aware
geometry
column. - Structure: A table containing all the original columns plus the new
geometry
column. - Data Types: The underlying data type of the new
geometry
column is the specializedGeometryUDT
(User-Defined Type). This is the spatial object that Apache Sedona can use for analysis. - Spatial Awareness: The output DataFrame is now spatially aware. You can run any applicable spatial function on the geometry column.
Here’s a summary of how the
ST_GeomFromEWKB
function transforms opaque hexadecimal data into a useful, spatially-aware object:Feature Before ST_GeomFromEWKB
After ST_GeomFromEWKB
Data Format A long hexadecimal string (EWKB format). geometry
Spatial Awareness No. Seen as dormant text or bytes. Yes. Acknowledged as a spatial object. Functionality Cannot use spatial functions. Can use functions like ST_Distance
,ST_Area
, etc.
Next Steps¶
Now that you have created a spatially-aware geometry
column, you can leverage it to perform powerful geospatial analysis.
Analyze Your Data with Spatial SQL Functions¶
After you create a geometry
column, you can use it with different kinds of spatial functions. You can now analyze the relationships between your geometries using powerful and highly-optimized functions like ST_Distance
, ST_Contains
, and ST_Intersects
.
Perform Map Matching on a Real-World Road Network¶
Wherobots offers powerful map matching capabilities to take your prepared GPS data and link it to actual roads.
For a full, hands-on example, open a Wherobots Notebook and navigate to the tutorial at examples/Analyzing_Data/GPS_Map_Matching.ipynb
.
You can also review the static version on GitHub: GPS_Map_Matching.ipynb
Usage considerations¶
- Read-Only Access to Delta tables: Wherobots Cloud currently supports reading from Delta Lake tables. Writing directly to Delta Lake tables from Wherobots is not supported.
- Large File Handling: This process is designed to handle large, partitioned Parquet files commonly found in Delta Lake Tables, including those without pre-existing geometry columns.
- Permissions Management: All permissions and access controls for the Delta Lake tables themselves must be managed on the Databricks side.