BigData - Hadoop

BigData

Coding

Tech Staff

Latest articles

Spark logs – Be quiet please!

Spark logs Spark can be so useful but one of the bad things that has is having annoying logs. If you are digging your data with spark-shell, pyspark shell or trying to follow your app logs, spark logs can be a nightmare. To deactivate that the proper way will be putting akka & hadoop classes in ERROR log mode. So only critical & error logs will be showed. By default /hadoop/spark/conf/log4j.properties silence some logs, but not all. So You will need to append following lines into your log4j.properties file.

And here is how looks my complete log4j.properties file.

Read More →

Posted in BigData, Spark | Tagged , , | 1 Comment

Pysolr – Import items to Solr with Python

Pysolr – Python client for Solr Solr is an open source search platform that allows us to full-text search, highlighting, faceted search & real-time indexing. I love to use it in my own projects (ex. efimeres.com) where I need to search entities by text, categories or some facets that define them. It’s quite easy to set up a initial cluster and start querying.Read More →

Posted in Solr, Tech Staff | Tagged , , , | Leave a comment

Redshift with Spark

The most common way to create DataFrame in Spark is reading directly from S3 in JSON or Parquet format. However Redshift does not support Parquet and maybe we don’t want to do the unload step from Redshift to S3. So I will explain how to read from Redshift with Spark jobs. Before launch your spark-shell or your job you will need to add the Redshift jdbc driver into spark-default.conf file. Download it and put inside $SPARK_HOME/lib.

Redshift with Spark:

Note that connection can be used with PostgreSQL databases too given that it’s also a jdbc connection.  Read More →

Posted in Spark | Leave a comment