A union of curiosity and data science

Knowledgebase and brain dump of a database engineer


Setup and Install Apache Airflow on a Ubuntu 18 GCP (Google Cloud) VM

 

 First we log into GCP. 

Next create a VM within "Compute Engine". 

I create a small VM named Airflow for this demo.  

I choose Ubuntu 18.04 LTS Minimal. Create the VM

Connect to the VM using the browser SSH client.

sudo su
apt-get update
apt install python
apt-get install software-properties-common
apt-get install python-pip
export SLUGIFY_USES_TEXT_UNIDECODE=yes
pip install apache-airflow
pip uninstall marshmallow-sqlalchemy
pip install marshmallow-sqlalchemy==0.17.1
airflow initdb
airflow webserver -p 8080

 

The first thing I'll do when connected is elevate my user. 

Next I'll update the OS. 

Next Install Python. 

Next we'll install software-properties-common. This will help manage the repo's that we install software from. 

Next let's install Pip

 

 

We also want to export an environment variable for UNIDECODE to prevent errors. 

You can read more on this here : https://stackoverflow.com/questions/52203441/error-while-install-airflow-by-default-one-of-airflows-dependencies-installs-a

Now install apache airflow using pip

Currently in October 2019, you'll get a Marshmallow-SQLalchemy error if you attempt to initialize the default SQLite Database.

To prevent this error install an earlier version of Marshmallow-SQLalchemy.

Initialize the database

Run the web server on port 8080

Open the GCP Firewall to allow traffic to the airflow server. 

 

At this point you may be wondering ,  why is there an warning at the top of the page related to the scheduler. This is due to a "Max Threads" setting in the airflow config being greater than 1. With Sqlite as the DB , this setting will need to be set to 1 and the scheduler will need to be started. 

 

Ok, I'm going to log back into the console and use the browser to SSH into my instance. 
Once I'm in , I'll switch users and open the airflow config file. Once the config file is open, scroll down until you see  "max_threads". If you're using SQLite change this value to 1. Save the file.

Now we can start the scheduler. 

 

 

 

 

Airflow docs: https://airflow.apache.org/start.html

 

 

 

 

 

 

Add comment