HDP Repo with Nginx

Environments dedicated for a HDP install without connection to the internet require a dedicated HDP repository all nodes have access to. While such a setup can differ slightly depending on the connection, if they have temporary or no internet access, in any case they need a file service holding a copy of the HDP repo. Most enterprises have a dedicated infrastructure in place based on Aptly or Satellite. This post describes the setup of an Nginx host serving as a HDP repository host.

Downloading HDP Repo

The public HDP repo is hosted under http://public-repo-1.hortonworks.com, where different releases for various operating systems are published. You can read in about a method to explore the repository in order to find the release that is suited for your environment. For example a copy of the current recent release HDP-2.3.4 can be downloaded like this:

Cluster OS HDP Repository Tarballs
RHEL/CentOs/Oracle Linux 6.x wget http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.4.0/HDP-2.3.4.0-centos6-rpm.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/centos6/HDP-UTILS-1.1.0.20-centos6.tar.gz

RHEL/CentOs/Oracle Linux 7.x wget http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.3.4.0/HDP-2.3.4.0-centos7-rpm.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/centos7/HDP-UTILS-1.1.0.20-centos7.tar.gz

SLES 11 SP3 wget http://public-repo-1.hortonworks.com/HDP/suse11sp3/2.x/updates/2.3.4.0/HDP-2.3.4.0-suse11sp3-rpm.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/suse11sp3/HDP-UTILS-1.1.0.20-suse11sp3.tar.gz

Ubuntu 12.04 wget http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.x/updates/2.3.4.0/HDP-2.3.4.0-ubuntu12-deb.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/ubuntu12/HDP-UTILS-1.1.0.20-ubuntu12.tar.gz

Ubuntu 14 wget http://public-repo-1.hortonworks.com/HDP/ubuntu14/2.x/updates/2.3.4.0/HDP-2.3.4.0-ubuntu14-deb.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/ubuntu14/HDP-UTILS-1.1.0.20-ubuntu14.tar.gz

Debian 6 wget http://public-repo-1.hortonworks.com/HDP/debian6/2.x/updates/2.3.4.0/HDP-2.3.4.0-debian6-deb.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/debian6/HDP-UTILS-1.1.0.20-debian6.tar.gz

Debian 7 wget http://public-repo-1.hortonworks.com/HDP/debian7/2.x/updates/2.3.4.0/HDP-2.3.4.0-debian7-deb.tar.gz

wget http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/debian7/HDP-UTILS-1.1.0.20-debian7.tar.gz

A release is typically around 3-4 GB in size.

Following a simple naming pattern you can also use this helper script to download the release you want. The script takes as arguements the OS and HDP version downloading the corresponding repository.

$ ./download_hdp_repo.sh -o centos7 -r 2.3.2

Install Nginx

Installing Nginx is straight forward by installing the available package. For CentOS 7 the EPEL repository is required.

$ yum install -y epel-release
$ yum install -y nginx

Starting Nginx:

$ systemctl start nginx
$ systemctl status nginx

nginx.service - The nginx HTTP and reverse proxy server
  Loaded: loaded (/usr/lib/systemd/system/nginx.service; disabled)
  Active: active (running) since So 2016-02-28 18:02:05 UTC; 10s ago
  Process: 19499 ExecStart=/usr/sbin/nginx (code=exited, status=0/SUCCESS)
  Process: 19496 ExecStartPre=/usr/sbin/nginx -t (code=exited, status=0/SUCCESS)
  Process: 19494 ExecStartPre=/usr/bin/rm -f /run/nginx.pid (code=exited, status=0/SUCCESS)
Main PID: 19501 (nginx)

 CGroup: /system.slice/nginx.service
   ├─19501 nginx: master process /usr/sbin/nginx
   ├─19502 nginx: worker process
   ├─19503 nginx: worker process
   ├─19504 nginx: worker process
   ├─19505 nginx: worker process
   └─19506 nginx: worker process

Feb 28 18:02:05 one.hdp nginx[19496]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
Feb 28 18:02:05 one.hdp nginx[19496]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Feb 28 18:02:05 one.hdp systemd[1]: Started The nginx HTTP and reverse proxy server.

Configure Nginx as a File Service

Nginx’s default configuration is placed under /etc/nginx inside of nginx.conf. The config is designed to declare servers. We will adjust the default to function as our repository service for our cluster.

It is important to watch the amount of worker processes you assign as well as the amount of connections each worker can hold. Nginx will function as a file service required to satisfy multiple concurrent connections for large cluster. We would want to acheave high individual throughput with support for concurrent connections. Following the recommendations:

Enabling sendfile

By default, NGINX handles file transmission itself and copies the file into the buffer before sending it. Enabling the sendfile directive will eliminate the step of copying the data into the buffer and enables direct copying data from one file descriptor to another. Alternatively, to prevent one fast connection to entirely occupy the worker process, you can limit the amount of data transferred in a single sendfile() call by defining the sendfile_max_chunk directive:

location /repo {
    sendfile           on;
    sendfile_max_chunk 1m;
    ...
}

Enabling tcp_nopush

Use the tcp_nopush option together with sendfile on;. The option will enable NGINX to send HTTP response headers in one packet right after the chunk of data has been obtained by sendfile

location /repo {
    sendfile   on;
    tcp_nopush on;
    ...
}

Enabling tcp_nodelay

The tcp_nodelay option allows overriding the Nagle’s algorithm, originally designed to solve problems with small packets in slow networks. The algorithm consolidates a number of small packets into the larger one and sends the packet with the 200 ms delay. Nowadays, when serving large static files, the data can be sent immediately regardless of the packet size. The delay would also affect online applications (ssh, online games, online trading). By default, the tcp_nodelay directive is set to on which means that the Nagle’s algorithm is disabled. The option is used only for keepalive connections:

location /repo  {
    tcp_nodelay       on;
    keepalive_timeout 65;
    ...
}

System Optimizations

Some kernel optimizations are needed to support high amount of concurrent connections. These should already be part of the system configuration done on the HDP nodes, so are only required if not already done on the repository node.

$ sudo sysctl -w net.core.somaxconn=4096

Making it permanent:

$ vi /etc/sysctl.conf
net.core.somaxconn = 4096

For this final sample configuration the content of the tar file of the downloaded repository files is placed under the /repo folder:

user  nginx;
worker_processes  8;
 
error_log  /var/log/nginx16/error.log;
pid        /opt/rh/nginx16/root/var/run/nginx/nginx.pid;
 
events {
    worker_connections  1024;
}
 
 
http {
    include       /opt/rh/nginx16/root/etc/nginx/mime.types;
    default_type  application/octet-stream;
 
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';
 
    access_log  /var/log/nginx16/access.log  main;
 
    sendfile        on;
    sendfile_max_chunk 1m;
    tcp_nopush     on;

    keepalive_timeout  65;
 
    gzip  on;
           
    include /opt/rh/nginx16/root/etc/nginx/conf.d/*.conf;
 
    server {
        listen       80;
        server_name  localhost one.hdp;
 
        location / {
            root   /opt/rh/nginx16/root/usr/share/nginx/html;
            index  index.html index.htm;
        }
 
        location /repo {
            root /repo/www;
            index index.html;
        }
 
        error_page  404              /404.html;
        location = /40x.html {
            root   /opt/rh/nginx16/root/usr/share/nginx/html;
        }
 
        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   /opt/rh/nginx16/root/usr/share/nginx/html;
        }
    }
}

Further Readings

Leave a comment