Skip to content
Tech Stuff, BigData & more

Primary Navigation Menu

Menu
  • Home
  • BigData
    • Spark
    • AWS
  • Tech Staff
    • Scrappy
    • Solr
  • Project Management

Blog

Analyzing Redshift Spectrum ‘Patronus’

2017-11-19
By: Albert Franzi
On: 19th November 2017
In: AWS

Redshift Spectrum ‘Patronus’ This post aims to analyze how Redshift Spectrum works and how we can take advantage of using it. I will try to load data from S3 such as Sessions (Parquet) & Raw Data (JSON). First of all, we will follow the Getting started using spectrum guide. To use Redshift Spectrum, the cluster needs to be at version 1.0.1294 or later. We can validate that executing select version();

1
2
3
4
dwh_sch=# select version();
                                                         version
--------------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.1499

To keep in mind – Redshift Pricing. Price Per Redshift Spectrum Query With Redshift Spectrum, you are billed at $5 per terabyte of data scanned, rounded up to the next megabyte, with a 10 megabyte minimum per query. For example, if you scan 10Read More →

sandbox

Creating a Redshift Sandbox for our Analysts

2017-04-04
By: Albert Franzi
On: 4th April 2017
In: AWS, BigData, Uncategorised

Redshift Sandbox It’s quite common that Analysts and Data Scientist want to analyze and process the latest data from production. Besides, they want and need to create their own schemas and tables to feed their models and dashboards also just for testing and discovering. That means that we need to provide an environment to handle all these needs without littering our production one. The first approach could be providing a Sandbox with sample data, however in our case that’s not enough, we have to provide a full copy from Production otherwise they would not be able to fit their needs and studies properly. Providing a fullRead More →

AWS - Athena

Learning Session – AWS Athena

2017-03-22
By: Albert Franzi
On: 22nd March 2017
In: AWS, BigData

AWS – Athena What is Athena? Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service using standard SQL. With a few actions in the AWS Management Console, customers can point Athena at their data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds. Athena is serverless, so there is no infrastructure to set up or manage, and customers pay only for the queries they run. Athena scales automatically—executing queries in parallel—so results are fast, even with large datasets and complex queries. When should I useRead More →

Redshift workload distribution

Checking your Redshift users

2016-11-24
By: Albert Franzi
On: 24th November 2016
In: AWS, BigData

Redshift usage Are your users experimenting high latencies in their queries? Are production jobs getting stuck and taking more time to execute than usual? Our users & jobs are. And they even have the guts to complain!!  😮 Our working hours in Barcelona are, more or less, comprehended between 9:00 to 19:00. That means that Product Analysts, Data Scientists & Data Engineers are launching queries, testing, discovering & digging in our data over that period. As consequence of that, their are overloading the Redshift queue raising the execution time. Sometimes, they launch a bunch of queries or write queries without LIMIT X. Even INSERT queries ONERead More →

Tableau fed by Presto with S3 Parquets

2016-10-05
By: Albert Franzi
On: 5th October 2016
In: AWS, BigData

Tableau + Presto & S3 Parquets Clickstream data is increasing fast, day by day, as more sites & features are being added. Our team provides new aggregated entities on top of that, meaning more data is produced. From Clickstream events we are able to produce Sessions, Visitors & Page Views objects that are stored in S3 with Parquet format. In order to stay up to date with new technologies, we wanted to do a PoC with Presto since some companies like Airbnb are using it instead of Redshift, our current solution. In this post, I’m going to explain my experience with Presto. First of all, we needRead More →

Confluence Rest API

Uploading documentation with Confluence API Rest

2016-07-27
By: Albert Franzi
On: 27th July 2016
In: Project Management, Tech Staff

Confluence API Rest In this post I will talk about how uploading documentation using the Confluence API Rest. Projects are in constant growth and documentation in constant deprecation. To avoid that, some times it’s required to automatise documentation procedures so documentation shows latest changes. In our case we want to inform which data entities provide our Data Warehouse and basic properties that clients will be grateful to know like distribution, data types or column purpose descriptions. Due we are using Okta integration, we have to ask for a Confluence Service User with Creation rights in our desired space. We don’t want Delete rights to avoid problems removing wrong pages. OnceRead More →

tableau

Tableau – Web Connector ElasticSearch

2016-06-12
By: Albert Franzi
On: 12th June 2016
In: AWS, BigData, Tech Staff

Tableau – Web Connector Elasticsearch Recently we managed to add a WebConnector to connect Tableau to our AWS Elasticsearch instances. Tableau provides a Web Data Connector SDK to play & test different connectors from community. WebConnector A Tableau web data connector gives you a way to connect to data that doesn’t already have a connector. Using a web data connector, you can create and use a connection to almost any data that is accessible over HTTP. This can include internal web services, JSON data, XML data, REST APIs, and many other sources. Because you control how the data is fetched, you can even combine data from multipleRead More →

Project Management Tools

Taming your project management tools

2016-05-22
By: Albert Franzi
On: 22nd May 2016
In: Project Management, Tech Staff

Recently I asked myself how to improve the way I manage my projects & daily tasks. Currently I’m using JIRA to manage our Sprints, issues, features and ideas to evolve our projects. Besides, I’m used to writing my own tasks, notes and crazy occurrences on pen and paper. I have my own notebook where I draw everything, but it became too unstructured. So I moved to two notebooks, one for keeping my actions to take and meeting minutes and another one for drawing algorithms and arch ideas. I prefer to capture my fleeting ideas on paper, it’s easier & more flexible to get out than drawing boxes andRead More →

WordCount

MapReduce – WordCount with Maven & S3

2016-04-01
By: Albert Franzi
On: 1st April 2016
In: BigData, MapReduce

WordCount with Maven & S3 In that post I will explain how start a basic WordCount example using Maven to generate our jar and S3 for input/output. If you use IntelliJ you can start a new maven project. In a basic project, that does not has external dependencies, is not required to build a jar-with-dependencies option that brings maven.

pom.xml
XHTML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
 
    <groupId>com.efimeres</groupId>
    <artifactId>pocs</artifactId>
    <name>Staff</name>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <hadoop.version>2.7.1</hadoop.version>
        <!-- Maven Plugins Available https://maven.apache.org/plugins/ -->
        <maven-compiler-plugin.version>3.5.1</maven-compiler-plugin.version>
        <maven-jar-plugin.version>2.6</maven-jar-plugin.version>
    </properties>
 
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>${hadoop.version}</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>
 
    <build>
        <finalName>WordCount</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>${maven-compiler-plugin.version}</version>
                <configuration>
                    <!-- JAVA 7 -->
                    <source>1.7</source>
                    <target>1.7</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>${maven-jar-plugin.version}</version>
                <configuration>
                    <archive>
                        <manifest>
                            <!-- Entry Point -->
                            <mainClass>com.efimeres.wordcount.WordCount</mainClass>
                        </manifest>
                    </archive>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

finalName – Indicate the name for generated jar. mainClass – Tells to hadoop which is the main class. In that case will be com.efimeres.wordcount.WordCount.Read More →

Setup Spark Standalone

2016-03-17
By: Albert Franzi
On: 17th March 2016
In: BigData, Spark

Spark Standalone I will explain how to setup properly Spark Standalone mode. Sometimes we want to develop POCs or try small changes and we don’t want to try it on production or wait until release them. For that case and learning purposes I prefer to try Spark staff in my laptop. First of all, download Latests/Desired stable version from Spark – Downloads. Here I will use pre-build version for Hadoop. Once we download it we can proceed to setup the environment.

Setup Spark Commands
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#- Create a Hadoop Folder in your $HOME to group all tools that you setup.
mkdir $HOME/hadoop
 
#- Copy from Download folder to Hadoop folder
cp $HOME/Downloads/spark-1.6.1-bin-hadoop2.6.tgz $HOME/hadoop
 
#- Uncompress tgz
tar -zxf $HOME/hadoop/spark-1.6.1-bin-hadoop2.6.tgz -C $HOME/hadoop
 
#- Create a link to point Spark current folder, so we don't need to know which version is running to execute commands.
ln -s $HOME/hadoop/spark-1.6.1-bin-hadoop2.6/ $HOME/hadoop/spark
 
#- Configure logs
cp $HOME/hadoop/spark/conf/log4j.properties.template $HOME/hadoop/spark/conf/log4j.properties
 
#- Configure properties, by default spark-defaults.conf comes empty.
cat <<EOT >> $HOME/hadoop/spark/conf/spark-defaults.conf
spark.master            spark://${HOSTNAME}:7077
spark.eventLog.enabled  true
spark.eventLog.dir      $HOME/hadoop/spark/logs/events
EOT
 
#- Create logs folder
mkdir -p $HOME/hadoop/spark/logs/events

Read More →

Posts navigation

1 2 Next

Recent Posts

  • Analyzing Redshift Spectrum ‘Patronus’
  • Creating a Redshift Sandbox for our Analysts
  • Learning Session – AWS Athena
  • Checking your Redshift users
  • Tableau fed by Presto with S3 Parquets

Categories

  • AWS
  • BigData
  • MapReduce
  • Project Management
  • Solr
  • Spark
  • Tech Staff
  • Uncategorised

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org

© 2015 - 2019 Efimeres