A union of curiosity and data science

Knowledgebase and brain dump of a database engineer


Presto Cheat Sheet

Presto : https://prestodb.io/overview.html

 

Quick "Presto" docker setup from FB 2019 Demo: 
https://github.com/prestodb/f8-2019-demo

git clone https://github.com/prestodb/f8-2019-demo
cd f8-2019-demo
docker-compose up
open localhost:8888

 

Presto is a distributed system that runs on a cluster of machines. 
A full installation includes a coordinator and multiple workers.
Queries are submitted from a client such 
as the Presto CLI, to the coordinator. 
The coordinator parses, analyzes and plans the 
query execution, then distributes the 
processing to the workers.



installing presto : 
https://prestodb.io/docs/current/installation/deployment.html

installing the client (cli):
https://prestodb.io/docs/current/installation/cli.html

High level commands for the server:

bin/launcher run   # runs the server in the foreground. 
bin/launcher start # runs the server in the background
bin/launcher status #gives you the current spid and status
bin/launcher restart #restarts the service. 

When connecting to a JDBC SQL Server Database. 

Your connection URL should contain the database name you'd like to connect to. 

connector.name=sqlserver
connection-url=jdbc:sqlserver://localhost:port;databaseName=AdventureWorks;
connection-user=<username>
connection-password=<pass>

Presto schemas do not include sql databases only sql owners/schemas. 

 Launching the presto client:
https://prestodb.io/docs/current/installation/cli.html

bin/presto --server <IP>:<port> --catalog <name of config file catalog> --schema <default schema>

 

Obtain a damn good starter "etc" directory from sundog because presto doesn't come with one but it's needed:

wget http://media.sundog-soft.com/hadoop/presto-hdp-config.tgz

prerequisites:

Unix/Linux/Mac
python >2.4
java 8 "yum install java-1.8.0-openjdk" 


good SQL Server reference for Presto: https://docs.starburstdata.com/latest/connector/sqlserver.html

 

Web Interface info:
https://prestodb.io/docs/current/admin/web-interface.html

 

 

 

Python Environments Setup.

#PYTHON  - Windows
python -m venv env
env\Scripts\activate or env\Scripts\activate.bat
cd YOUR_PROJECT
pip install -r requirements.txt 


#Unix
$ python -m venv mysite/env
$ source mysite/env/bin/activate



requirements file sample format:requirements.txt
boto==2.49.0
boto3==1.9.74
botocore==1.12.74
click==6.7
docutils==0.14
dominate==2.3.1
Flask==0.12.2
Flask-Bootstrap==3.3.7.1
Flask-Login==0.4.1
Flask-Migrate==2.3.1
Flask-SQLAlchemy==2.3.2
Flask-WTF==0.14.2
itsdangerous==0.24
Jinja2==2.10
jmespath==0.9.3
Mako==1.0.7
MarkupSafe==1.0
psycopg2==2.7.6.1
psycopg2-binary==2.7.6.1


#ANACONDA
conda create -n <environment name>
conda activate <environment name>
conda install pip
pip install -r requirements.txt


 

Setup and Install Apache Airflow on a Ubuntu 18 GCP (Google Cloud) VM

 

 First we log into GCP. 

Next create a VM within "Compute Engine". 

I create a small VM named Airflow for this demo.  

I choose Ubuntu 18.04 LTS Minimal. Create the VM

Connect to the VM using the browser SSH client.

sudo su
apt-get update
apt install python
apt-get install software-properties-common
apt-get install python-pip
export SLUGIFY_USES_TEXT_UNIDECODE=yes
pip install apache-airflow
pip uninstall marshmallow-sqlalchemy
pip install marshmallow-sqlalchemy==0.17.1
airflow initdb
airflow webserver -p 8080

 

The first thing I'll do when connected is elevate my user. 

Next I'll update the OS. 

Next Install Python. 

Next we'll install software-properties-common. This will help manage the repo's that we install software from. 

Next let's install Pip

 

 

We also want to export an environment variable for UNIDECODE to prevent errors. 

You can read more on this here : https://stackoverflow.com/questions/52203441/error-while-install-airflow-by-default-one-of-airflows-dependencies-installs-a

Now install apache airflow using pip

Currently in October 2019, you'll get a Marshmallow-SQLalchemy error if you attempt to initialize the default SQLite Database.

To prevent this error install an earlier version of Marshmallow-SQLalchemy.

Initialize the database

Run the web server on port 8080

Open the GCP Firewall to allow traffic to the airflow server. 

 

At this point you may be wondering ,  why is there an warning at the top of the page related to the scheduler. This is due to a "Max Threads" setting in the airflow config being greater than 1. With Sqlite as the DB , this setting will need to be set to 1 and the scheduler will need to be started. 

 

Ok, I'm going to log back into the console and use the browser to SSH into my instance. 
Once I'm in , I'll switch users and open the airflow config file. Once the config file is open, scroll down until you see  "max_threads". If you're using SQLite change this value to 1. Save the file.

Now we can start the scheduler. 

 

 

 

 

Airflow docs: https://airflow.apache.org/start.html