Simple ETL Using Luigi

Photo by Florian Wächter on Unsplash

What is Luigi?

Luigi is a Python (2.7, 3.6, 3.7 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Why Luigi?

  • Luigi helps you to build complex pipelines of batch jobs.
  • It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
  • Luigi was built using Python.
  • It is one of the most common ETL tools for workflow management

Installation

Luigi is a Python module. To install it, you just need to to type:

pip install luigi

Building Blocks

There are two fundamental building blocks of Luigi:

  • Target class corresponds to a file on a disk, a file on HDFS or some kind of a checkpoint, like an entry in a database. Actually, the only method that Targets have to implement is the exists method which returns True if and only if the Target exists.
  • Task class is a bit more conceptually interesting because this is where computation is done. There are a few methods that can be implemented to alter its behavior, most notably run() , output() and requires()

Task’s methods

  • requires() is the first method executed, if exists. In order to execute the process, previous Luigi task(s) need to be executed first (as the requirement).
  • Tasks consume Targets that were created by some other task called input(). They usually also output targets called output() .
  • run() contains the actions you want to execute.

Practice Case

Let’s write an ETL story.

Imagine you want to create a report. The data you used comes from websites. You want to extract those data first before making the report. Here is the workflow diagram:

Luigi workflow diagram example

In this example, the steps to generate report are:

  • Take the websites as the input to extract its data by scraping it
  • The result from scraping is extracted data in a csv file
  • Use that csv file to make report and save it as an excel file

Code

A code example for story above would be:

class ScrapeData(luigi.Task):def output(self):
return luigi.LocalTarget('scrape_table.csv')
def run(self):
# write a scraping script to extract data from websites
class GenerateReport(luigi.Task):def requires(self):
return ScrapeData()
def output(self):
return luigi.LocalTarget('report.xlsx')
def run(self):
# read extracted data in csv
data = pd.read_csv('file://scrape_table.csv')
# do some data manipulation and save it as excel file
pd.to_xslx(data, 'report.xslx')

Sources

--

--

--

Just a guy who has passion in data

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Deep Dive: Salesforce Industries for Insurance

Why Small Programs are Better

First steps at MVC

Optimising your screen real estate for live webinar facilitation

Screenshot of a computer desktop, with a numbered set of application windows to illustrate the facilitator’s POV of a webinar

Adding Sound Effects

Location In Flutter

Flash Stock Rom on THL T100s Ironman

Flash Stock Rom on THL

Google Cloud Build — Push Docker Images via GitHub Trigger

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rio Anggara Sufilin

Rio Anggara Sufilin

Just a guy who has passion in data

More from Medium

Data Engineering: Data Lake vs Data Warehouse vs Data Mart

Data Engineering Project Retail Store — Part 4 — Analyzing the Data

#13 Data Engineering — TRANSFORM DATA — Missing Data (ETL Pipeline)

Getting Started with Data Selection