Ajay Divakaran

nginx upstream timeout and php5-fpm

Posted On Mar-02
Ever since I’d moved my hosting from vanilla EC2 to docker, I’d been getting plagued by timeouts between NGINX and php-fpm. At first, the issue appeared to be on the fpm side, as these NGINX error-logs showed with a message saying ‘nginx upstream timed out’

2016/03/01 20:58:18 [error] 6318#0: *148 upstream timed out (110: Connection timed out) while connecting to upstream, client: 108.21.53.39, server: www.ajaydivakaran.com, request: "GET / HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.ajaydivakaran.com"
2016/03/01 20:58:41 [error] 6318#0: *232 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 68.180.229.40, server: www.ajaydivakaran.com, request: "GET /robots.txt HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.ajaydivakaran.com"
2016/03/01 20:59:34 [error] 6318#0: *557 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 68.180.229.40, server: www.ajaydivakaran.com, request: "GET /wp-json/ HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.ajaydivakaran.com"

/var/log/nginx/error.log was filled with entries similar to the above lines. A quick search on google left me feeling that this was a rather common scalability issue when you used tcp://127.0.0.1:9000 to call php5-fpm from nginx.

Fixing configs in nginx and php5-fpm

The suggested solution was to move from ‘tcp ports’ to ‘sock’ as long as it stayed on the local machine.Dutifully, I changed my nginx’s sites-enabled/*.conf as well as php5-fpm’s conf as follows.

FILE : /etc/nginx/sites-enabled/sitename.conf

location ~ .php$ {
+ fastcgi_pass unix:/var/run/php5-fpm.sock;
- fastcgi_pass 127.0.0.1:9000;
}

FILE : /etc/php5/fpm/pool.d/www.conf

+ listen = /var/run/php5-fpm.sock
- listen 127.0.0.1:9000
- listen.allowed_clients = 127.0.0.1

Restarted nginx and php5-fpm. It worked for a few hours, and then the site went down again. I spent another evening trying to figure this out.

At one point , I realized that I’d gone through the first 4 pages of Google search-results. There was nothing more to learn there, except that some people thought using the latter config with sockets was the way to go, while others thought that the former tcp-port was more scalable.
And the sites on my vps , were still down.

Logging the heck out of it

When you don’t know where to look, where do you look? Everywhere! So I enabled logging on php5-fpm, and set logging-level on nginx to ‘notice’. Heck, I even turned on access-log in nginx.

Also I’d gone back to the former practice of using 127.0.0.1:9000 to talk to php5-fpm, my rationale being that this allowed me to decouple and move fpm to another container. So I started logging the connections with ‘netstat -an | grep 9000’ .

So with all this logging, I restarted nginx and fpm, and began the waiting game. It didn’t take long.

Someone was DOS-ing me!

An hour later, I got those upstream-timeouts again. My sites came to a crashing halt. The logging from ‘netstat -an | grep 9000’ is what I first noticed. It had grown to 3132 in no time.

So a quick look at the access-log revealed what was going on. I was getting DOS-ed like hell.

185.130.5.104 - - [01/Mar/2016:02:37:26 -0500] "POST /xmlrpc.php HTTP/1.0" 502 537 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.106.92.33 - - [01/Mar/2016:02:37:26 -0500] "POST /xmlrpc.php HTTP/1.0" 502 537 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.106.92.252 - - [01/Mar/2016:02:37:26 -0500] "POST /xmlrpc.php HTTP/1.0" 499 0 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.106.92.252 - - [01/Mar/2016:02:37:26 -0500] "POST /xmlrpc.php HTTP/1.0" 499 0 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.106.92.33 - - [01/Mar/2016:02:37:26 -0500] "POST /xmlrpc.php HTTP/1.0" 502 537 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"

It was immediately obvious that the DOS was using up all my connections, and i was hitting my file-descriptor limit rather quickly. After that, php-fpm was getting zombied out.

The Solution…

The fix (something I had implemented in nginx several years ago, but left out on this one instance) was an easy one. I just added a connection-limit in nginx , and set it to no more than 10 connections/IP. Also added a request-rate to the one url (xmlrpc.php) which was getting DOSed.

so now my /etc/nginx/sites-enabled/sitename.conf got the following edits:

+ limit_conn_zone $binary_remote_addr zone=ajayzone1:10m;
+ limit_req_zone $binary_remote_addr zone=ajayzone2:10m rate=1r/m;
+ location = /xmlrpc.php {
    + limit_req zone=ajayzone2 burst=1 nodelay;
+ }
location / {
+ limit_conn ajayzone1 10;
...
}

Restarted nginx and php5-fpm. Things have been running smoothly since. My connections from ‘netstat -an’ doesn’t even exceed 30 at any given time. What a beautiful thing!

And of course, the logs show the offending requests being silently limited. Some day, when I have some free time, i’ll reconfigure nginx to serve DOS-ing IPs a page with a picture of my middle finger instead of silently dropping them.