HDP Repo with Nginx

Environments dedicated for a HDP install without connection to the internet require a dedicated HDP repository all nodes have access to. While such a setup can differ slightly depending on the connection, if they have temporary or no internet access, in any case they need a file service holding a copy of the HDP repo. Most enterprises have a dedicated infrastructure in place based on Aptly or Satellite. This post describes the setup of an Nginx host serving as a HDP repository host.

Downloading HDP Repo

The public HDP repo is hosted under http://public-repo-1.hortonworks.com, where different releases for various operating systems are published. You can read in about a method to explore the repository in order to find the release that is suited for your environment. For example a copy of the current recent release HDP-2.3.4 can be downloaded like this:

Cluster OS HDP Repository Tarballs
RHEL/CentOs/Oracle Linux 6.x wget http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.4.0/HDP-2.3.4.0-centos6-rpm.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/centos6/HDP-UTILS-1.1.0.20-centos6.tar.gz

RHEL/CentOs/Oracle Linux 7.x wget http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.3.4.0/HDP-2.3.4.0-centos7-rpm.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/centos7/HDP-UTILS-1.1.0.20-centos7.tar.gz

SLES 11 SP3 wget http://public-repo-1.hortonworks.com/HDP/suse11sp3/2.x/updates/2.3.4.0/HDP-2.3.4.0-suse11sp3-rpm.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/suse11sp3/HDP-UTILS-1.1.0.20-suse11sp3.tar.gz

Ubuntu 12.04 wget http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.x/updates/2.3.4.0/HDP-2.3.4.0-ubuntu12-deb.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/ubuntu12/HDP-UTILS-1.1.0.20-ubuntu12.tar.gz

Ubuntu 14 wget http://public-repo-1.hortonworks.com/HDP/ubuntu14/2.x/updates/2.3.4.0/HDP-2.3.4.0-ubuntu14-deb.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/ubuntu14/HDP-UTILS-1.1.0.20-ubuntu14.tar.gz

Debian 6 wget http://public-repo-1.hortonworks.com/HDP/debian6/2.x/updates/2.3.4.0/HDP-2.3.4.0-debian6-deb.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/debian6/HDP-UTILS-1.1.0.20-debian6.tar.gz

Debian 7 wget http://public-repo-1.hortonworks.com/HDP/debian7/2.x/updates/2.3.4.0/HDP-2.3.4.0-debian7-deb.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/debian7/HDP-UTILS-1.1.0.20-debian7.tar.gz

A release is typically around 3-4 GB in size.

Following a simple naming pattern you can also use this helper script to download the release you want. The script takes as arguements the OS and HDP version downloading the corresponding repository.

Install Nginx

Installing Nginx is straight forward by installing the available package. For CentOS 7 the EPEL repository is required.

$ yum install -y epel-release
$ yum install -y nginx

Starting Nginx:

Configure Nginx as a File Service

Nginx’s default configuration is placed under /etc/nginx inside of nginx.conf. The config is designed to declare servers. We will adjust the default to function as our repository service for our cluster.

It is important to watch the amount of worker processes you assign as well as the amount of connections each worker can hold. Nginx will function as a file service required to satisfy multiple concurrent connections for large cluster. We would want to acheave high individual throughput with support for concurrent connections. Following the recommendations:

Enabling sendfile

By default, NGINX handles file transmission itself and copies the file into the buffer before sending it. Enabling the sendfile directive will eliminate the step of copying the data into the buffer and enables direct copying data from one file descriptor to another. Alternatively, to prevent one fast connection to entirely occupy the worker process, you can limit the amount of data transferred in a single sendfile() call by defining the  sendfile_max_chunk directive:

Enabling tcp_nopush

Use the  tcp_nopush option together with sendfile on;. The option will enable NGINX to send HTTP response headers in one packet right after the chunk of data has been obtained by sendfile

Enabling tcp_nodelay

The  tcp_nodelay option allows overriding the Nagle’s algorithm, originally designed to solve problems with small packets in slow networks. The algorithm consolidates a number of small packets into the larger one and sends the packet with the 200 ms delay. Nowadays, when serving large static files, the data can be sent immediately regardless of the packet size. The delay would also affect online applications (ssh, online games, online trading). By default, the tcp_nodelay directive is set to on which means that the Nagle’s algorithm is disabled. The option is used only for keepalive connections:

System Optimizations

Some kernel optimizations are needed to support high amount of concurrent connections. These should already be part of the system configuration done on the HDP nodes, so are only required if not already done on the repository node.

Making it permanent:

For this final sample configuration the content of the tar file of the downloaded repository files is placed under the /repo folder:

Further Readings

Leave a Reply