What is a join? Why would we do it? And how would we do it using DataFrames.jl? In this post, I’ll show some practical but simple examples on how to join DataFrames.
Last time, we figured out how to index, sort, and aggregate our data using DataFrames.jl. Joins is another very common and important operation that arises in the world of tabulated data. A join across two DataFrames is the action of combining the two datasets based on shared column values that exist across the two tables. We call this column (or columns) the key. So, each record from the…
Diving deeper into DataFrames.jl, we’ll explore how to do boolean indexing on DataFrames, learn how to sort our data by column values and aggregate the tables to our hearts’ content. In the final section, we’ll also introduce a super-powerful analytics method called: split-apply-combine.
If you need a refresher on DataFrames.jl check out these articles first:
First, we need to pick a dataset from the
RDatasets package. This will save us the trouble of downloading and reading in a file. If you want to know how to read csvs, check out my earlier post on CSV.jl and data importing — link…
Let’s explore some of the basic functionalities of DataFrames.jl in Julia. If you’ve had some experience with R’s DataFrames or Python’s Pandas then this should be smooth sailing for you. If you have no previous dataframes experience, don’t worry, this is the most basic intro you can imagine! 🌈
If you’re looking for something more advanced? Check out my other articles on Julia:
Previously, I’ve shown how to read basic delimited files — that is files where values are separated by common characters such as commas, semi-colons or tabs. Now it’s time to up our game and handle some more exotic edge cases using
We’ll focus on understanding how we can parse data types correctly so that our DataFrames are as clean as possible from the start.
This is part 2 of the Reading CSV with Julia articles, so if you’re new here, check out part 1:
As before, we start by importing packages and simulating some dummy data. …
Have you ever received a
.csv file with semicolons (
;) as separators? Or a file without headers? Or maybe you have some colleagues in Europe who use
, instead of
. to indicate decimals? Oh, the joys of working with CSV files…
Continue reading to learn how you can read in a variety of delimiter separated file formats in Julia using
We will generate all the examples ourselves, so you can easily download the code and play around with the results in your own environment. Let’s get started! First of all, we need to load the packages that we…
As the name suggests control flow operators help us shape the flow of the program. You can return from a function, you can break from a loop, you can skip an iteration of the loop with continue.
To understand these concepts, we’ll attempt to solve a problem. Nothing better than some hands-on experience, right? Our challenge is as follows:
Given 2 integers (a, b) print the smallest (up to) 5 integers between…
Do you ever feel like for loops are taking over your life and there’s no escape from them? Do you feel trapped by all those loops? Well, fear not! There’s a way out! I’ll show you how to do the FizzBuzz challenge without any for loops at all.
The task of FizzBuzz is to print every number up to 100, but replace numbers divisible by 3 with “Fizz”, numbers divisible by 5 by “Buzz” and numbers that are divisible by both 3 and 5 have to be replaced by “FizzBuzz”.
Solving FizzBuzz with for…
In this post, we’ll create a function to solve the overly popular FizzBuzz programming challenge.
By the end of this article, you will know:
This post is meant for beginner programmers or for those who never heard of Julia before. Don’t expect to do earth-shattering massively parallel scientific workloads after reading this. Consider this as your humble beginnings in the awesome world of Julia.
If you’ve never heard of…
Ever wanted to loop around stuff in SQL? Well, you can with scripting. Let’s see how we can calculate Fibonacci numbers in BigQuery with loops.
Before we begin calculating Fibonacci numbers, let’s talk about the building blocks of SQL scripting. …
Google Cloud Platform’s BigQuery is a managed large scale data warehouse for analytics. It supports JSON, CSV, PARQUET, OCR and AVRO file formats for importing tables. Each of these file types has its pros and cons and I already talked about why I prefer PARQUET for Data Science workflows here. But one question remains:
Which file extension gives us the quickest load times into BigQuery?
GCP recommends AVRO for fast ingestion times due to the way it compresses data, but who believes documentation these days? I want to see for myself which file format will come out on top.
Lead data scientist building machine learning products with an awesome team. Follow me for tutorials on data science, machine learning and cloud computing.