Databricks integration with Amazon EC2 Cassandra server

© Mariusz Rafało

This document specifies basic Apache Cassandra configuration and integration with Databricks Platform. The document does not cover setting up an AWS machine nor AWS VPC network configuration. As a result, there are several prerequisites before you begin work with Cassandra.

Prerequisites

Ensure, that all tasks specified below have been done.

  1. You have AWS EC2 machine up and running on Amazon Linux OS. At least t2.medium server is required. Select Amazon Linux 2 AMI 2.0.20181024 x86_64 HVM gp2.
  2. You have ssh access to your machine (e.g.: via private key).
  3. You have elastic IP assigned to your machine. You can find information about elastic IP's here.
  4. You have installed and started Apache Cassandra, for example according to this tutorial.
  5. Your EC2 machine is accessible from Internet via security groups. You need to configure network in order to open port 9042 (Cassandra default port for CQL). You can find instructions here.
  6. Cassandra is properly installed and Kafka service is running on default ports. Here you can find detailed port descriptions. Key assumption is that your Cassandra communicates on port 9042.

Cassandra configuration

Firewall

Open port 9094 on local firewall:

iptables -A INPUT -i eth0 -p tcp --dport 9042 -j ACCEPT

Network

Modify cassandra.yaml to setup remote access to Cassandra cluster:

listen_address: localhost rpc_address: localhost broadcast_rpc_address: xx.xx.xx.xx

Where xx.xx.xx.xx is elastic IP assigned to EC2 machine. More information available on this stack overflow article.

Restart Cassandra

sudo service cassandra restart

Databricks configuration

Cassandra driver for pure Python

Navigate to shared folder on your Databricks workspace and add new library. Select PyPi source for the library and type cassandra-driver as package name.

Create library. The library will now be installed on your cluster automatically. You can verify progress and status of the installation on the panel.

Cassandra driver for pure PySpark

Download Cassandra spark connector; file spark-cassandra-connector_2.11-2.3.1.jar- exact version. You can download jar from here,

Install spark-cassandra-connector_2.11-2.3.1.jar library manualy on Databricks.

Test connection

Python code should be run from Databricks Spark cluster, with following specification:

Databricks Runtime Version: 4.3 (includes Apache Spark 2.3.1, Scala 2.11). Python version 2.

Note, that xx.xx.xx.xx is elastic IP assigned to EC2 machine.

d = spark.read.format("org.apache.spark.sql.cassandra").options(host='xx.xx.xx.xx', table="trans",keyspace="test").load()