Rio Anggara Sufilin

Feb 20, 2021

2 min read

Simple ETL Using Luigi

Photo by Florian Wächter on Unsplash

What is Luigi?

Luigi is a Python (2.7, 3.6, 3.7 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Why Luigi?

Installation

Luigi is a Python module. To install it, you just need to to type:

pip install luigi

Building Blocks

There are two fundamental building blocks of Luigi:

Task’s methods

Practice Case

Let’s write an ETL story.

Imagine you want to create a report. The data you used comes from websites. You want to extract those data first before making the report. Here is the workflow diagram:

Luigi workflow diagram example

In this example, the steps to generate report are:

Code

A code example for story above would be:

class ScrapeData(luigi.Task):def output(self):
return luigi.LocalTarget('scrape_table.csv')
def run(self):
# write a scraping script to extract data from websites
class GenerateReport(luigi.Task):def requires(self):
return ScrapeData()
def output(self):
return luigi.LocalTarget('report.xlsx')
def run(self):
# read extracted data in csv
data = pd.read_csv('file://scrape_table.csv')
# do some data manipulation and save it as excel file
pd.to_xslx(data, 'report.xslx')

Sources