Learn all the joins — inner, outer, cross, and semi joins with DataFrames.jl

What is a join? Why would we do it? And how would we do it using DataFrames.jl? In this post, I’ll show some practical but simple examples on how to join DataFrames.

Photo by Tim Johnson on Unsplash

Simple Joins

Last time, we figured out how to index, sort, and aggregate our data using DataFrames.jl. Joins is another very common and important operation that arises in the world of tabulated data. A join across two DataFrames is the action of combining the two datasets based on shared column values that exist across the two tables. We call this column (or columns) the key. So, each record from the…

Tutorial for common data analytics using DataFrames.jl

Photo by Jeswin Thomas on Unsplash

Diving deeper into DataFrames.jl, we’ll explore how to do boolean indexing on DataFrames, learn how to sort our data by column values and aggregate the tables to our hearts’ content. In the final section, we’ll also introduce a super-powerful analytics method called: split-apply-combine.

If you need a refresher on DataFrames.jl check out these articles first:

Getting some data

First, we need to pick a dataset from the RDatasets package. This will save us the trouble of downloading and reading in a file. If you want to know how to read csvs, check out my earlier post on CSV.jl and data importing — link…

Poke at your data with DataFrames.jl

Photo by Rosalind Chang on Unsplash

Let’s explore some of the basic functionalities of DataFrames.jl in Julia. If you’ve had some experience with R’s DataFrames or Python’s Pandas then this should be smooth sailing for you. If you have no previous dataframes experience, don’t worry, this is the most basic intro you can imagine! 🌈

If you’re looking for something more advanced? Check out my other articles on Julia:

Custom formatting for currencies, booleans etc

Jean-Honoré Fragonard, Public domain, via Wikimedia Commons

Previously, I’ve shown how to read basic delimited files — that is files where values are separated by common characters such as commas, semi-colons or tabs. Now it’s time to up our game and handle some more exotic edge cases using CSV.jl.

We’ll focus on understanding how we can parse data types correctly so that our DataFrames are as clean as possible from the start.

This is part 2 of the Reading CSV with Julia articles, so if you’re new here, check out part 1:

As before, we start by importing packages and simulating some dummy data. …

Learn how to use CSV.jl to read all kinds of comma-separated files

See page for author, CC BY 4.0 <https://creativecommons.org/licenses/by/4.0>, via Wikimedia Commons

Have you ever received a .csv file with semicolons (;) as separators? Or a file without headers? Or maybe you have some colleagues in Europe who use , instead of . to indicate decimals? Oh, the joys of working with CSV files…

Continue reading to learn how you can read in a variety of delimiter separated file formats in Julia using CSV.jl

Generating Data

We will generate all the examples ourselves, so you can easily download the code and play around with the results in your own environment. Let’s get started! First of all, we need to load the packages that we…

Control flow basics with Julia

Learn to go with the flow -Photo by Mike Lewis HeadSmart Media on Unsplash

Let’s continue our exploration of Julia basics. Previously I talked about for loops and vectorization. Here, we will talk about how to use control flow operators inside Julia.

What are control flow operators?

As the name suggests control flow operators help us shape the flow of the program. You can return from a function, you can break from a loop, you can skip an iteration of the loop with continue.

A simple task

To understand these concepts, we’ll attempt to solve a problem. Nothing better than some hands-on experience, right? Our challenge is as follows:

Given 2 integers (a, b) print the smallest (up to) 5 integers between…

Say goodbye to for loops and broadcast all the things

Do you ever feel like for loops are taking over your life and there’s no escape from them? Do you feel trapped by all those loops? Well, fear not! There’s a way out! I’ll show you how to do the FizzBuzz challenge without any for loops at all.

Vectorize all the things! — SOURCE

The task of FizzBuzz is to print every number up to 100, but replace numbers divisible by 3 with “Fizz”, numbers divisible by 5 by “Buzz” and numbers that are divisible by both 3 and 5 have to be replaced by “FizzBuzz”.

Solving FizzBuzz with for…

Photo by Alexander Dummer on Unsplash

How to do functions, for loops and conditionals — using FizzBuzz

In this post, we’ll create a function to solve the overly popular FizzBuzz programming challenge.

By the end of this article, you will know:

  • How to create a function in Julia
  • How to do a for loop
  • How to create if-else blocks
  • What 1:5 means
  • How to calculate the remainder of a number when divided

This post is meant for beginner programmers or for those who never heard of Julia before. Don’t expect to do earth-shattering massively parallel scientific workloads after reading this. Consider this as your humble beginnings in the awesome world of Julia.

What is FizzBuzz?

If you’ve never heard of…

Learn how to use BigQuery scripting to calculate Fibonacci numbers with for loops

Ever wanted to loop around stuff in SQL? Well, you can with scripting. Let’s see how we can calculate Fibonacci numbers in BigQuery with loops.

Do the loop in BigQuery! — Photo by Claire Satera on Unsplash

I previously showed how to do Fibonacci in BigQuery with JavaScript UDFs (user-defined functions) and also talked about arrays in BigQuery so if you’re new to BigQuery, check those out first.

Basics of Scripting

Before we begin calculating Fibonacci numbers, let’s talk about the building blocks of SQL scripting. …

Benchmarking CSV, GZIP, AVRO and PARQUET file types for ingestion

Google Cloud Platform’s BigQuery is a managed large scale data warehouse for analytics. It supports JSON, CSV, PARQUET, OCR and AVRO file formats for importing tables. Each of these file types has its pros and cons and I already talked about why I prefer PARQUET for Data Science workflows here. But one question remains:

Which file extension gives us the quickest load times into BigQuery?

Let’s hope our files will load quicker than it took to capture this picture… — Photo by Anders Jildén on Unsplash

The Experiment

GCP recommends AVRO for fast ingestion times due to the way it compresses data, but who believes documentation these days? I want to see for myself which file format will come out on top.


Bence Komarniczky

Lead data scientist building machine learning products with an awesome team. Follow me for tutorials on data science, machine learning and cloud computing.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store