How do I connect to delta table that stored on Linode S3 using Python(pyspark) and convert it to DataFrame?
I'm using the following python code:
storage_path = f"s3a://{os.environ.get('LINODE_BUCKET_NAME')}@{os.environ.get('LINODE_REGION_NAME')}.linodeobjects.com"
# Create a SparkSession with the necessary configuration
spark = SparkSession.builder.appName("Delta Table Example").config(
"spark.master",
"local").config(
"com.amazonaws.services.s3.enableV4",
"true").config(
"spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem").config(
"spark.jars",
f"{os.environ.get('HADOOP_AWS_JAR')},{os.environ.get('AWS_JAVA_SDK')},{os.environ.get('AWS_JAVA_SDK_CORE')}").config(
f"fs.s3a.bucket.{os.environ.get('LINODE_BUCKET_NAME')}.access.key",
aws_access_key).config(
f"fs.s3a.bucket.{os.environ.get('LINODE_BUCKET_NAME')}.secret.key",
aws_secret_key).config(
f"fs.s3a.bucket.{os.environ.get('LINODE_BUCKET_NAME')}.endpoint",
f"{os.environ.get('LINODE_REGION_NAME')}.linodeobjects.com").config(
'spark.jars.packages',
'io.delta:delta-core_2.12:2.0.0').config(
"spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension").config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog").config(
"spark.sql.session.timeZone",
"UTC")
spark = configure_spark_with_delta_pip(spark).getOrCreate()
DeltaTable.forPath(
spark, f"s3a://{os.environ.get('LINODE_BUCKET_NAME')}/{path_to_delta}").toDF()
getting the following error:
{Py4JJavaError}An error occurred while calling z:io.delta.tables.DeltaTable.forPath.
: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/impl/prefetch/PrefetchingStatistics
1 Reply
An error occurred while calling z:io.delta.tables.DeltaTable.forPath.
: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/impl/prefetch/PrefetchingStatistics
It looks like there's a missing class in the Hadoop filesystem. Essentially, it seems that Spark is trying to use a class related to PrefetchingStatistics that's not currently present in your current Hadoop setup.
This may be a version compatibility issue. Otherwise, I threw this error into Google and found a similar error within this Github thread:
https://github.com/delta-io/delta/issues/895
That user specifically found a few missing delta jars
to be the issue.
Hope that helps! If anyone else has a bit more working experience with these specific packages, please feel free to chime in! :)