Access Delta Lake Tables

In Wherobots Cloud, you can read your existing Databricks Delta Lake tables and leverage them for advanced geospatial analysis within a Wherobots notebook.

This means that you can perform powerful geospatial operations like raster inference and map matching without moving or duplicating your datasets.

This guide provides a detailed example of how to prepare your data for analysis, focusing on a map matching use case.

Use Wherobots for spatial ETL and analytics on your Databricks Delta Lake tables

This workflow demonstrates how Wherobots can directly read and process data from your existing Delta Lake tables, allowing you to leverage powerful spatial analytics without a separate ingestion step.

While reading is fully supported, writing to Delta Lake tables is not.

Benefits¶

Reading Databricks' Delta Lake tables with Wherobots Cloud allows you to:

Avoid data migration: Access and process your Delta Lake tables directly in Databricks without needing to move them.
Leverage existing data infrastructure: Integrate your Databricks-managed data with Wherobots' geospatial capabilities.
Enable advanced geospatial analysis: Perform operations like Map Matching and Raster Inference on data stored in Delta Lake tables, which may not have native geospatial support.
Handle large-scale datasets: Read and process large, partitioned Parquet files (even those without explicit geometry columns) stored within Delta Lake tables.

Before you start¶

Before using this feature, ensure that you have the following required resources:

An Account within a Community, Professional, or Enterprise Edition Organization. For more information, see Create a Wherobots Account.
A pre-existing Delta Lake table: This table should be set up and managed within the Databricks platform.
A Databricks Personal Access Token (PAT): For more information, see Databricks personal access tokens for workspace users or Databricks personal access tokens for service principals in the official Databricks Documentation.
A pre-existing Unity Catalog: In Databricks, Unity Catalog is the governance layer for your data.

Understand the Unity Catalog namespace format

You interact with the Unity Catalog through a three-level namespace: CATALOG.SCHEMA.TABLE.

Set your Databricks permissions¶

This table outlines the specific Unity Catalog privileges required for a user or service principal to read data from Databricks tables and volumes.

Databricks permissions are hierarchical

To access a table, you must have privileges for the catalog and schema that contain it. For more information, consult the official Databricks Documentation or your Workspace Admin.

Ensure that you have established the following permissions within Databricks:

Access Privilege/Permission	Applies To (Object)	Purpose
`USE CATALOG`	Catalog	Allows the user or service principal to see and access the specified catalog. This is the top-level requirement.
`USE SCHEMA`	Schema	Grants the ability to see and access the schema (database) that contains the desired table or function.
`SELECT`	Table / View	Grants read-only permission to query data from a specific table or view within the schema.
`EXECUTE`	Function / Notebook	Grants the ability to run a user-defined function (UDF) or execute a notebook. This does not grant read access to tables.
`READ VOLUME`	External Volume	Grants read-only permission to access files stored within a specific Unity Catalog external volume.
`EXTERNAL USE SCHEMA`	Schema	Grants an external system the ability to read data from tables within a schema. For more information, see Enable external data access to Unity Catalog

More information on EXTERNAL USE SCHEMA

The user or service principal authenticating the external request must be explicitly granted the EXTERNAL USE SCHEMA privilege on the schema containing the target table. This is a critical security measure to prevent accidental data exfiltration and is not included in broader privileges like ALL PRIVILEGES.

Reading a Delta Lake table in a Wherobots Notebook¶

This section outlines how a user can access and read data from an existing Delta Lake table.

Log in to Wherobots Cloud.
Start a runtime.

Which runtime should I choose?

For reading large tables, consider using General Purpose-Small or General Purpose-Medium.
Once the runtime has loaded, open your Wherobots Notebook.

In your Wherobots Notebook:

Define Connection Parameters and Credentials.

Configuring the Sedona Context
from sedona.spark import *
from pyspark.sql.functions import *

import os
import getpass
from datetime import date

# Define your Unity Catalog name and URI
UNITY_CATALOG_URI = "https://your-databricks-workspace.cloud.databricks.com"
UNITY_CATALOG_NAME = "YOUR_CATALOG_NAME"
SCHEMA_NAME = "YOUR_DATABRICKS_SCHEMA_NAME"
TABLE_NAME = "YOUR_DATABRICKS_TABLE_NAME"

# The Databricks PAT will be hidden as you type or paste.
try:
    databricks_pat = getpass.getpass("Enter your Databricks PAT: ")
except Exception as e:
    print(f"Could not read the token: {e}")
    databricks_pat = None

# `getpass` only obscures the token during input.
# The 'databricks_pat' variable below now holds the secret in plain text.
# Accidentally printing or logging this variable (e.g., `print(databricks_pat)`)
# will expose the secret in your console or notebook output.

# Checks if the token was entered
if not databricks_pat:
    raise ValueError("A Databricks PAT was not entered. Halting script.")

print("Databricks PAT received successfully.")

Finding Your Catalog, Schema, and Table Names

To locate the names for your catalog, schema, and table, see the official Databricks guide on how to explore database objects using the Catalog Explorer.

On line 9, replace https://your-databricks-workspace.cloud.databricks.com with the URI for your Databricks Catalog.
On line 10, replace YOUR_CATALOG_NAME with your Unity Catalog name.
On line 11, replace YOUR_DATABRICKS_SCHEMA_NAME with your Databricks Schema Name.
On line 12, replace YOUR_DATABRICKS_TABLE_NAME with your Databricks Table Name.

Configure the Sedona Context to access the specific Delta Lake table.

# Spark Session Initialization
spark_builder = (SedonaContext.builder()
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.3.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "io.unitycatalog.spark.UCSingleCatalog")
    .config(f"spark.sql.catalog.{UNITY_CATALOG_NAME}", "io.unitycatalog.spark.UCSingleCatalog")
    .config(f"spark.sql.catalog.{UNITY_CATALOG_NAME}.uri", UNITY_CATALOG_URI)
    # The databricks_pat variable from getpass is now used here
    .config(f"spark.sql.catalog.{UNITY_CATALOG_NAME}.token", databricks_pat)
)
sedona = SedonaContext.create(spark_builder)

Read the Delta Lake table using its catalog name.

Reading the Delta table
df = sedona.table(f"{UNITY_CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}")

Performing geospatial analysis on data from a Delta Lake table¶

Once the Delta Lake table is loaded into a Wherobots Notebook, you can prepare your data for advanced geospatial operations like Map Matching and Raster Inference, as you would with any data stored directly within Wherobots Cloud.

To use spatial functions, the DataFrame needs a column with the actual geometry type, not just a String of WKT or a Binary representation of a shape. If your data is currently stored in one of these formats, you must first convert it into a true geometry object.

Preprocessing for Spatial Analysis¶

The method of preprocessing your data for spatial analysis depends on the initial state of your data.

The following examples demonstrate this by transforming raw GPS records from a Delta Lake table into a DataFrame with a specialized geometry column.

While ST_Point creates geometry from a numerical data type (like FLOAT), ST_GeomFromEWKB activates the geometry that was stored in a dormant, non-spatial format like Binary. This activation still requires creating a new column to hold the finished, active object.

Load Data: Follow steps from Reading a Delta Lake table in a Wherobots Notebook to load your data.

Prepare Data: Create a geometry column using the coordinate fields, depending on the initial data type.

Data stored as a Numerical TypeData stored as a Binary Type

Consider a DataFrame loaded from a Delta Lake table that contains separate numerical columns for longitude (x-coordinate) and latitude (y-coordinate):

trip_id	start_lon	start_lat
101	-122.4194	37.7749
102	-122.4080	37.7850

Input: A PySpark DataFrame (read from a Delta table) where the location data is stored in separate, non-spatial columns.
Data Types: The start_lon and start_lat columns are standard numerical types, such as Double, Decimal, or Float, not spatial geometry.
Spatial Awareness: At this stage, the DataFrame is not spatially aware. The database sees start_lon and start_lat as two independent numbers; it has no intrinsic understanding that they represent a single geographic location.

Use ST_Point(start_lon, start_lat) to create a new geometry column from these two columns.

ST_Point Example

    from pyspark.sql.functions import expr

    # Assume UNITY_CATALOG_NAME, SCHEMA_NAME, and TABLE_NAME are predefined variables
    # that point to a table containing trip start/end coordinates.

    # The source table has columns 'start_lon' and 'start_lat'.
    # We'll create a new column named 'geometry' using ST_Point.
    trip_data = (
        sedona
        .table(f"{UNITY_CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}")
        .withColumn(
            "geometry",
            expr("ST_Point(start_lon, start_lat)")
        )
    )

    print("DataFrame with a 2D 'geometry' column created from x and y coordinates:")
    trip_data.select("trip_id", "geometry").show(truncate=False)

Output

# trip_data.select("trip_id", "geometry").show(truncate=False)
+-------+--------------------------+
|trip_id|geometry                  |
+-------+--------------------------+
|101    |POINT (-122.4194 37.7749) |
|102    |POINT (-122.4080 37.7850) |
+-------+--------------------------+

Output: A new PySpark DataFrame that now includes a spatially-aware geometry column.
Structure: A table containing all the original columns plus the new geometry column.
Data Types: The data type of the new geometry column is a specialized GeometryUDT (User-Defined Type). This is a complex object that Apache Sedona understands as a spatial object. When you display the column using the .show() command, this GeometryUDT object is presented in a human-readable string format called Well-Known Text (WKT).
Spatial Awareness: The output DataFrame is now spatially aware. You can run spatial functions (e.g., ST_Distance, ST_Contains) on the geometry column.

The ST_POINT function is essential for spatial analysis because it transforms simple numerical coordinates into a powerful geometry object, as this table illustrates:

Feature	Before `ST_POINT`	After `ST_POINT`
Data Format	`FLOAT`	`geometry`
Spatial Awareness	No. Seen as independent numerical columns.	Yes. Acknowledged as a spatial object.
Functionality	Cannot use spatial functions.	Can use functions like `ST_Distance`, `ST_Area`, etc.

Consider a DataFrame loaded from a Delta Lake table that contains ewkb_geometry:

trip_object_id	ewkb_geometry
950	0101000020E61000004A8F16945A485EC04E1A2B43706F4240...
951	0101000020E6100000D7A3703D40485EC0C3F5285C6F6F4240...

Input: A PySpark DataFrame where the geometric data is stored in a single column in a dormant, encoded format.
Data Types: The ewkb_geometry column is a binary format. It contains a long sequence of characters that represents a complete geometry (Point, Line, or Polygon).
Spatial Awareness: At this stage, the DataFrame is not spatially aware. Although the ewkb_geometry column contains all the necessary geometric information, it essentially functions much like text or simple numerical data within the database because you cannot use this column directly for spatial functions.

ST_GeomFromEWKB example

from pyspark.sql.functions import expr

# Assume the source table contains a column 'ewkb_geometry' with data
# in Extended Well-Known Binary format.

# ST_GeomFromEWKB parses this binary data into a usable geometry object.
trip_data = (
    sedona
    .table(f"{UNITY_CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}")
    .withColumn(
        "geometry",
        expr("ST_GeomFromEWKB(ewkb_geometry)")
    )
)

print("DataFrame with 'geometry' column parsed from an EWKB source:")
trip_data.select("trip_object_id", "geometry").show(truncate=False)

Output

# trip_data.select("trip_object_id", "geometry").show(truncate=False)
+---------+--------------------------+
|trip_object_id|geometry             |
+---------+--------------------------+
|950      |POINT (-122.4194 37.7749) |
|951      |POINT (-122.4080 37.7850) |
+---------+--------------------------+

ST_GeomFromEWKB parses, interprets, and constructs the usable geometry object from its given input. This function assumes you have a single column that contains geometric data encoded in the Extended Well-Known Binary (EWKB) format.

EWKB is a representation of a geometry (like a point, line, or polygon) that often includes the Spatial Reference System Identifier (SRID). It looks like a long string of hexadecimal characters in a database.

Output: The primary output is a new PySpark DataFrame where the encoded geometry has been parsed into a live, spatially-aware geometry column.
Structure: A table containing all the original columns plus the new geometry column.
Data Types: The underlying data type of the new geometry column is the specialized GeometryUDT (User-Defined Type). This is the spatial object that Apache Sedona can use for analysis.
Spatial Awareness: The output DataFrame is now spatially aware. You can run any applicable spatial function on the geometry column.

Here’s a summary of how the ST_GeomFromEWKB function transforms opaque hexadecimal data into a useful, spatially-aware object:

Feature	Before `ST_GeomFromEWKB`	After `ST_GeomFromEWKB`
Data Format	A long hexadecimal string (EWKB format).	`geometry`
Spatial Awareness	No. Seen as dormant text or bytes.	Yes. Acknowledged as a spatial object.
Functionality	Cannot use spatial functions.	Can use functions like `ST_Distance`, `ST_Area`, etc.

Next Steps¶

Now that you have created a spatially-aware geometry column, you can leverage it to perform powerful geospatial analysis.

Analyze Your Data with Spatial SQL Functions¶

After you create a geometry column, you can use it with different kinds of spatial functions. You can now analyze the relationships between your geometries using powerful and highly-optimized functions like ST_Distance, ST_Contains, and ST_Intersects.

Perform Map Matching on a Real-World Road Network¶

Wherobots offers powerful map matching capabilities to take your prepared GPS data and link it to actual roads.

For a full, hands-on example, open a Wherobots Notebook and navigate to the tutorial at examples/Analyzing_Data/GPS_Map_Matching.ipynb. You can also review the static version on GitHub: GPS_Map_Matching.ipynb

Usage considerations¶

Read-Only Access to Delta tables: Wherobots Cloud currently supports reading from Delta Lake tables. Writing directly to Delta Lake tables from Wherobots is not supported.
Large File Handling: This process is designed to handle large, partitioned Parquet files commonly found in Delta Lake Tables, including those without pre-existing geometry columns.
Permissions Management: All permissions and access controls for the Delta Lake tables themselves must be managed on the Databricks side.