This document specifies basic Apache Cassandra configuration and integration with Databricks Platform. The document does not cover setting up an AWS machine nor AWS VPC network configuration. As a result, there are several prerequisites before you begin work with Cassandra.
Ensure, that all tasks specified below have been done.
9042(Cassandra default port for CQL). You can find instructions here.
9094 on local firewall:
iptables -A INPUT -i eth0 -p tcp --dport 9042 -j ACCEPT
cassandra.yaml to setup remote access to Cassandra cluster:
listen_address: localhost rpc_address: localhost broadcast_rpc_address: xx.xx.xx.xx
xx.xx.xx.xx is elastic IP assigned to EC2 machine. More information available on this stack overflow article.
sudo service cassandra restart
shared folder on your Databricks workspace and add new library. Select
PyPi source for the library and type
cassandra-driver as package name.
Create library. The library will now be installed on your cluster automatically. You can verify progress and status of the installation on the panel.
Download Cassandra spark connector; file
spark-cassandra-connector_2.11-2.3.1.jar- exact version. You can download jar from here,
spark-cassandra-connector_2.11-2.3.1.jar library manualy on Databricks.
Python code should be run from Databricks Spark cluster, with following specification:
Databricks Runtime Version: 4.3 (includes Apache Spark 2.3.1, Scala 2.11). Python version 2.
xx.xx.xx.xx is elastic IP assigned to EC2 machine.
d = spark.read.format("org.apache.spark.sql.cassandra").options(host='xx.xx.xx.xx', table="trans",keyspace="test").load()