Homemade BigData. Part 1. Spark Streaming practice on an AWS cluster
- Tutorial
- Recovery mode
Hello.
There are many services on the Internet that provide cloud services. With their help, you can learn the technology of BigData.
In this article, we will install Apache Kafka, Apache Spark, Zookeeper, Spark-shell on the EC2 AWS (Amazon Web Services) platform at home and learn how to use it all.

The link aws.amazon.com/console you have to register. Enter a name and remember the password.
Configure node instances for Zookeeper and Kafka services.
For convenience, rename the cluster nodes in Node01-04 notation. To connect to cluster nodes from the local computer via SSH, you need to determine the IP address of the node and its public / private DNS name, select each of the cluster nodes one by one and for the selected instance, write down its public / private DNS name for connecting via SSH and for installation Software to the text file HadoopAdm01.txt.
Example: ec2-35-162-169-76.us-west-2.compute.amazonaws.com
To install the software, select our node (copy its Public DNS) to connect via SSH. We configure connection through SSH. We use the saved name of the first node to configure the connection via SSH using the Private / Public key pair “HadoopUser01.ppk” created in clause 1.3. We go to the Connection / Auth section through the Browse button and look for the folder where we previously saved the file “HadoopUserXX.ppk”.
We save the connection configuration in the settings.
We are connected to the node and use login: ubuntu.
We create our first topic on the assembled kafka server.
We prepared a node instance with the Zookeeper and Kafka services installed on AWS, now you need to install Apache Spark, for this:
Download the latest version of the Apache Spark distribution.
Download the Scala-IDE editor (at scala-ide.org ). We start and start writing code. Here I will not repeat myself anymore, as there is a good article on Habré .
Useful literature and courses to help:
courses.hadoopinrealworld.com/courses/enrolled/319237
data-flair.training/blogs/kafka-consumer
www.udemy.com/apache-spark-with-scala-hands-on-with-big -data
There are many services on the Internet that provide cloud services. With their help, you can learn the technology of BigData.
In this article, we will install Apache Kafka, Apache Spark, Zookeeper, Spark-shell on the EC2 AWS (Amazon Web Services) platform at home and learn how to use it all.

Introducing Amazon Web Services
The link aws.amazon.com/console you have to register. Enter a name and remember the password.
Configure node instances for Zookeeper and Kafka services.
- Select "Services-> EC2" from the menu. Next, select the version of the operating system of the image of the virtual machine, select Ubuntu Server 16.04 LTS (HVM), SSD volume type, click “Select.” We proceed to configure the server instance: type “t3.medium” with parameters 2vCPU, 4 GB memory, General Purpose Click "Next: Configuring Instance Details".
- Add the number of instances 1, click "Next: Add Storage"
- We accept the default value for the disk size of 8 GB and change the type to Magnetic (in Production settings based on data volume and High Performance SSD)
- In the “Tag Instances” section for “Name”, enter the name of the instance of the node “Home1” (where 1 is just a serial number) and click on “Next: ...”
- In the "Configure Security Groups" section, select the "Use existing security group" option by selecting the name of the security group ("Spark_Kafka_Zoo_Project") and set the rules for incoming traffic. Click on "Next: ..."
- Scroll through the Review screen to verify your entries and launch Launch.
- To connect to the cluster nodes, you must create (in our case, use the existing) a pair of public keys for identification and authorization. To do this, select the operation type “Use existing pair” in the list.
Key Creation
- Download Putty (https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html) for the client or use SSH connection from the terminal.
- The .pem key file uses the old format for convenience, we convert it to the ppk format used by Putty. To do this, run the PuTTYgen utility, load the key in the old .pem format into the utility. We convert the key and save (Save Private Key) for later use in the home folder with the extension .ppk.
Cluster launch
For convenience, rename the cluster nodes in Node01-04 notation. To connect to cluster nodes from the local computer via SSH, you need to determine the IP address of the node and its public / private DNS name, select each of the cluster nodes one by one and for the selected instance, write down its public / private DNS name for connecting via SSH and for installation Software to the text file HadoopAdm01.txt.
Example: ec2-35-162-169-76.us-west-2.compute.amazonaws.com
Install Apache Kafka in SingleNode Mode on an AWS Cluster Node
To install the software, select our node (copy its Public DNS) to connect via SSH. We configure connection through SSH. We use the saved name of the first node to configure the connection via SSH using the Private / Public key pair “HadoopUser01.ppk” created in clause 1.3. We go to the Connection / Auth section through the Browse button and look for the folder where we previously saved the file “HadoopUserXX.ppk”.
We save the connection configuration in the settings.
We are connected to the node and use login: ubuntu.
- Using root privileges we update packages and install additional packages required for further installation and configuration of the cluster.
sudo apt-get update sudo apt-get -y install wget net-tools netcat tar
- Install Java 8 jdk and check the version of Java.
sudo apt-get -y install openjdk-8-jdk
- For normal cluster node performance, you need to adjust the memory swap settings. VM swappines is set to 60% by default, which means when utilizing memory in 60%, the system will actively swap data from RAM to disk. Depending on the version of Linux, the VM swappines parameter can be set to 0 or 1:
sudo sysctl vm.swappiness=1
- To save the settings during reboot, add a line to the configuration file.
echo 'vm.swappiness=1' | sudo tee --append /etc/sysctl.conf
- We edit the entries in the / etc / hosts file to conveniently resolve the names of the cluster nodes kafka and
zookeeper to the private IP addresses of the assigned cluster nodes.echo "172.31.26.162 host01" | sudo tee --append /etc/hosts
We check the correct recognition of names using ping any of the entries. - Download the latest current versions (http://kafka.apache.org/downloads) of the kafka and scala distributions and prepare the directory with the installation files.
wget http://mirror.linux-ia64.org/apache/kafka/2.1.0/kafka_2.12-2.1.0.tgz tar -xvzf kafka_2.12-2.1.0.tgz ln -s kafka_2.12-2.1.0 kafka
- Delete the tgz archive file, we will no longer need it
- Let's try to start the Zookeeper service, for this:
~/kafka/bin/zookeeper-server-start.sh -daemon ~/kafka/config/zookeeper.properties
Zookeeper starts with default startup options. You can check the log:tail -n 5 ~/kafka/logs/zookeeper.out
To ensure that the Zookeeper daemon starts, after rebooting, we need to start Zookeper as a background service:bin/zookeeper-server-start.sh -daemon config/zookeeper.properties
To check the launch of Zookepper, checknetcat -vz localhost 2181
We configure the Zookeeper and Kafka service for work. Initially, edit / create the file /etc/systemd/system/zookeeper.service (file contents below).[Unit] Description=Apache Zookeeper server Documentation=http://zookeeper.apache.org Requires=network.target remote-fs.target After=network.target remote-fs.target [Service] Type=simple ExecStart=/home/ubuntu/kafka/bin/zookeeper-server-start.sh /home/ubuntu/kafka/config/zookeeper.properties ExecStop=/home/ubuntu/kafka/bin/zookeeper-server-stop.sh [Install] WantedBy=multi-user.target
Next, for Kafka, edit / create the file /etc/systemd/system/kafka.service (file contents below).[Unit] Description=Apache Kafka server (broker) Documentation=http://kafka.apache.org/documentation.html Requires=zookeeper.service [Service] Type=simple ExecStart=/home/ubuntu/kafka/bin/kafka-server-start.sh /home/ubuntu/kafka/config/server.properties ExecStop=/home/ubuntu/kafka/bin/kafka-server-stop.sh [Install] WantedBy=multi-user.target
- We activate systemd scripts for Kafka and Zookeeper services.
sudo systemctl enable zookeeper sudo systemctl enable kafka
- Check the operation of systemd scripts.
sudo systemctl start zookeeper sudo systemctl start kafka sudo systemctl status zookeeper sudo systemctl status kafka sudo systemctl stop zookeeper sudo systemctl stop kafka
- We’ll check the functionality of the Kafka and Zookeeper services.
netcat -vz localhost 2181 netcat -vz localhost 9092
- Check the zookeeper log file.
cat logs/zookeeper.out
First joy
We create our first topic on the assembled kafka server.
- It is important to use the connection to "host01: 2181" as you indicated in the server.properties configuration file.
- We write some data in the topic.
kafka-console-producer.sh --broker-list host01:9092 --topic first_topic Привет Как прошли выходные
Ctrl-C - exit the topic console. - Now try to read the data from the topic.
kafka-console-consumer.sh --bootstrap-server host01:9092 --topic last_topic --from-beginning
- Let's look at the list of kafka topics.
bin/kafka-topics.sh --zookeeper spark01:2181 --list
- Editing the kafka server parameters for tuning for single cluster setup
# it is necessary to change the ISR parameter to 1.bin/kafka-topics.sh --zookeeper spark01:2181 --config min.insync.replicas=1 --topic __consumer_offsets --alter
- We restart the Kafka server and try to connect consumer ohm again
- Let's look at the topic list.
bin/kafka-topics.sh --zookeeper host01:2181 --list
Configure Apache Spark on a single-node cluster
We prepared a node instance with the Zookeeper and Kafka services installed on AWS, now you need to install Apache Spark, for this:
Download the latest version of the Apache Spark distribution.
wget https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.6.tgz
- Unzip the distribution and create a symbolic link for spark and delete unnecessary archive files.
tar -xvf spark-2.4.0-bin-hadoop2.6.tgz ln -s spark-2.4.0-bin-hadoop2.6 spark rm spark*.tgz
- Go to the sbin directory and run the spark wizard.
./start-master.sh
- We connect using a web browser to the Spark server on port 8080.
- Run spark-slaves on the same node
./start-slave.sh spark://host01:7077
- Run the spark shell with the master on host01.
./spark-shell --master spark://host01:7077
- If the launch does not work, add the path to Spark in bash.
vi ~/.bashrc # добавляем строчки в конец файла SPARK_HOME=/home/ubuntu/spark export PATH=$SPARK_HOME/bin:$PATH
source ~/.bashrc
- Run the spark shell again with the master on host01.
./spark-shell --master spark://host01:7077
A single-node cluster with Kafka, Zookeeper and Spark works. Hurrah!
A bit of creativity
Download the Scala-IDE editor (at scala-ide.org ). We start and start writing code. Here I will not repeat myself anymore, as there is a good article on Habré .
Useful literature and courses to help:
courses.hadoopinrealworld.com/courses/enrolled/319237
data-flair.training/blogs/kafka-consumer
www.udemy.com/apache-spark-with-scala-hands-on-with-big -data