BigData - Hadoop



Tech Staff

Latest articles

Analyzing Redshift Spectrum ‘Patronus’

Redshift Spectrum ‘Patronus’ This post aims to analyze how Redshift Spectrum works and how we can take advantage of using it. I will try to load data from S3 such as Sessions (Parquet) & Raw Data (JSON). First of all, we will follow the Getting started using spectrum guide. To use Redshift Spectrum, the cluster needs to be at version 1.0.1294 or later. We can validate that executing select version();

To keep in mind – Redshift Pricing. Price Per Redshift Spectrum Query With Redshift Spectrum, you are billed at $5 per terabyte of data scanned, rounded up to the next megabyte, with a 10 megabyte minimum per query. For example, if you scan 10Read More →

Posted in AWS | Tagged , , , , | Leave a comment

Creating a Redshift Sandbox for our Analysts

Redshift Sandbox It’s quite common that Analysts and Data Scientist want to analyze and process the latest data from production. Besides, they want and need to create their own schemas and tables to feed their models and dashboards also just for testing and discovering. That means that we need to provide an environment to handle all these needs without littering our production one. The first approach could be providing a Sandbox with sample data, however in our case that’s not enough, we have to provide a full copy from Production otherwise they would not be able to fit their needs and studies properly. Providing a fullRead More →

Posted in AWS, BigData, Uncategorised | Tagged , , , | Leave a comment

Learning Session – AWS Athena

AWS – Athena What is Athena? Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service using standard SQL. With a few actions in the AWS Management Console, customers can point Athena at their data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds. Athena is serverless, so there is no infrastructure to set up or manage, and customers pay only for the queries they run. Athena scales automatically—executing queries in parallel—so results are fast, even with large datasets and complex queries. When should I useRead More →

Posted in AWS, BigData | Tagged , , , , , | Leave a comment

Checking your Redshift users

Redshift usage Are your users experimenting high latencies in their queries? Are production jobs getting stuck and taking more time to execute than usual? Our users & jobs are. And they even have the guts to complain!!  😮 Our working hours in Barcelona are, more or less, comprehended between 9:00 to 19:00. That means that Product Analysts, Data Scientists & Data Engineers are launching queries, testing, discovering & digging in our data over that period. As consequence of that, their are overloading the Redshift queue raising the execution time. Sometimes, they launch a bunch of queries or write queries without LIMIT X. Even INSERT queries ONERead More →

Posted in AWS, BigData | Tagged , , , , , | Leave a comment

Tableau fed by Presto with S3 Parquets

Tableau + Presto & S3 Parquets Clickstream data is increasing fast, day by day, as more sites & features are being added. Our team provides new aggregated entities on top of that, meaning more data is produced. From Clickstream events we are able to produce Sessions, Visitors & Page Views objects that are stored in S3 with Parquet format. In order to stay up to date with new technologies, we wanted to do a PoC with Presto since some companies like Airbnb are using it instead of Redshift, our current solution. In this post, I’m going to explain my experience with Presto. First of all, we needRead More →

Posted in AWS, BigData | Tagged , , , , | 1 Comment

Uploading documentation with Confluence API Rest

Confluence API Rest In this post I will talk about how uploading documentation using the Confluence API Rest. Projects are in constant growth and documentation in constant deprecation. To avoid that, some times it’s required to automatise documentation procedures so documentation shows latest changes. In our case we want to inform which data entities provide our Data Warehouse and basic properties that clients will be grateful to know like distribution, data types or column purpose descriptions. Due we are using Okta integration, we have to ask for a Confluence Service User with Creation rights in our desired space. We don’t want Delete rights to avoid problems removing wrong pages. OnceRead More →

Posted in Project Management, Tech Staff | Tagged , , | Leave a comment

Tableau – Web Connector ElasticSearch

Tableau – Web Connector Elasticsearch Recently we managed to add a WebConnector to connect Tableau to our AWS Elasticsearch instances. Tableau provides a Web Data Connector SDK to play & test different connectors from community. WebConnector A Tableau web data connector gives you a way to connect to data that doesn’t already have a connector. Using a web data connector, you can create and use a connection to almost any data that is accessible over HTTP. This can include internal web services, JSON data, XML data, REST APIs, and many other sources. Because you control how the data is fetched, you can even combine data from multipleRead More →

Posted in AWS, BigData, Tech Staff | Tagged , , , , | Leave a comment

Taming your project management tools

Recently I asked myself how to improve the way I manage my projects & daily tasks. Currently I’m using JIRA to manage our Sprints, issues, features and ideas to evolve our projects. Besides, I’m used to writing my own tasks, notes and crazy occurrences on pen and paper. I have my own notebook where I draw everything, but it became too unstructured. So I moved to two notebooks, one for keeping my actions to take and meeting minutes and another one for drawing algorithms and arch ideas. I prefer to capture my fleeting ideas on paper, it’s easier & more flexible to get out than drawing boxes andRead More →

Posted in Project Management, Tech Staff | Tagged , , , , , , , , | Leave a comment

MapReduce – WordCount with Maven & S3

WordCount with Maven & S3 In that post I will explain how start a basic WordCount example using Maven to generate our jar and S3 for input/output. If you use IntelliJ you can start a new maven project. In a basic project, that does not has external dependencies, is not required to build a jar-with-dependencies option that brings maven.

finalName – Indicate the name for generated jar. mainClass – Tells to hadoop which is the main class. In that case will be com.efimeres.wordcount.WordCount.Read More →

Posted in BigData, MapReduce | Tagged , , , , | Leave a comment

Setup Spark Standalone

Spark Standalone I will explain how to setup properly Spark Standalone mode. Sometimes we want to develop POCs or try small changes and we don’t want to try it on production or wait until release them. For that case and learning purposes I prefer to try Spark staff in my laptop. First of all, download Latests/Desired stable version from Spark – Downloads. Here I will use pre-build version for Hadoop. Once we download it we can proceed to setup the environment.

Read More →

Posted in BigData, Spark | Tagged , , , , | Leave a comment