How do I connect to delta table that stored on Linode S3 using Python(pyspark) and convert it to DataFrame?

I'm using the following python code:
storage_path = f"s3a://{os.environ.get('LINODE_BUCKET_NAME')}@{os.environ.get('LINODE_REGION_NAME')}.linodeobjects.com"

    # Create a SparkSession with the necessary configuration
    spark = SparkSession.builder.appName("Delta Table Example").config(
        "spark.master",
        "local").config(
        "com.amazonaws.services.s3.enableV4",
        "true").config(
        "spark.hadoop.fs.s3a.impl",
        "org.apache.hadoop.fs.s3a.S3AFileSystem").config(
        "spark.jars",
        f"{os.environ.get('HADOOP_AWS_JAR')},{os.environ.get('AWS_JAVA_SDK')},{os.environ.get('AWS_JAVA_SDK_CORE')}").config(
        f"fs.s3a.bucket.{os.environ.get('LINODE_BUCKET_NAME')}.access.key",
        aws_access_key).config(
        f"fs.s3a.bucket.{os.environ.get('LINODE_BUCKET_NAME')}.secret.key",
        aws_secret_key).config(
        f"fs.s3a.bucket.{os.environ.get('LINODE_BUCKET_NAME')}.endpoint",
        f"{os.environ.get('LINODE_REGION_NAME')}.linodeobjects.com").config(
        'spark.jars.packages',
        'io.delta:delta-core_2.12:2.0.0').config(
        "spark.sql.extensions",
        "io.delta.sql.DeltaSparkSessionExtension").config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog").config(
        "spark.sql.session.timeZone",
        "UTC")
    spark = configure_spark_with_delta_pip(spark).getOrCreate()

    DeltaTable.forPath(
        spark, f"s3a://{os.environ.get('LINODE_BUCKET_NAME')}/{path_to_delta}").toDF()

getting the following error:
{Py4JJavaError}An error occurred while calling z:io.delta.tables.DeltaTable.forPath.
: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/impl/prefetch/PrefetchingStatistics

1 Reply

An error occurred while calling z:io.delta.tables.DeltaTable.forPath.
: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/impl/prefetch/PrefetchingStatistics

It looks like there's a missing class in the Hadoop filesystem. Essentially, it seems that Spark is trying to use a class related to PrefetchingStatistics that's not currently present in your current Hadoop setup.

This may be a version compatibility issue. Otherwise, I threw this error into Google and found a similar error within this Github thread:

https://github.com/delta-io/delta/issues/895

That user specifically found a few missing delta jars to be the issue.

Hope that helps! If anyone else has a bit more working experience with these specific packages, please feel free to chime in! :)

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct