Redshift icon

The most common way to create DataFrame in Spark is reading directly from S3 in JSON or Parquet format. However Redshift does not support Parquet and maybe we don’t want to do the unload step from Redshift to S3. So I will explain how to read from Redshift with Spark jobs. Before launch your spark-shell or your job you will need to add the Redshift jdbc driver into spark-default.conf file. Download it and put inside $SPARK_HOME/lib.

Redshift with Spark:

Note that connection can be used with PostgreSQL databases too given that it’s also a jdbc connection.  Read More →