2016/03/01 20:58:41 [error] 6318#0: *232 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 68.180.229.40, server: www.ajaydivakaran.com, request: "GET /robots.txt HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.ajaydivakaran.com"
2016/03/01 20:59:34 [error] 6318#0: *557 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 68.180.229.40, server: www.ajaydivakaran.com, request: "GET /wp-json/ HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.ajaydivakaran.com"
/var/log/nginx/error.log was filled with entries similar to the above lines. A quick search on google left me feeling that this was a rather common scalability issue when you used tcp://127.0.0.1:9000 to call php5-fpm from nginx.
Fixing configs in nginx and php5-fpm
The suggested solution was to move from ‘tcp ports’ to ‘sock’ as long as it stayed on the local machine.Dutifully, I changed my nginx’s sites-enabled/*.conf as well as php5-fpm’s conf as follows.
FILE : /etc/nginx/sites-enabled/sitename.conf
+ fastcgi_pass unix:/var/run/php5-fpm.sock;
- fastcgi_pass 127.0.0.1:9000;
}
FILE : /etc/php5/fpm/pool.d/www.conf
- listen 127.0.0.1:9000
- listen.allowed_clients = 127.0.0.1
Restarted nginx and php5-fpm. It worked for a few hours, and then the site went down again. I spent another evening trying to figure this out.
At one point , I realized that I’d gone through the first 4 pages of Google search-results. There was nothing more to learn there, except that some people thought using the latter config with sockets was the way to go, while others thought that the former tcp-port was more scalable.
And the sites on my vps , were still down.
Logging the heck out of it
When you don’t know where to look, where do you look? Everywhere! So I enabled logging on php5-fpm, and set logging-level on nginx to ‘notice’. Heck, I even turned on access-log in nginx.
Also I’d gone back to the former practice of using 127.0.0.1:9000 to talk to php5-fpm, my rationale being that this allowed me to decouple and move fpm to another container. So I started logging the connections with ‘netstat -an | grep 9000’ .
So with all this logging, I restarted nginx and fpm, and began the waiting game. It didn’t take long.
Someone was DOS-ing me!
An hour later, I got those upstream-timeouts again. My sites came to a crashing halt. The logging from ‘netstat -an | grep 9000’ is what I first noticed. It had grown to 3132 in no time.
So a quick look at the access-log revealed what was going on. I was getting DOS-ed like hell.
185.106.92.33 - - [01/Mar/2016:02:37:26 -0500] "POST /xmlrpc.php HTTP/1.0" 502 537 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.106.92.252 - - [01/Mar/2016:02:37:26 -0500] "POST /xmlrpc.php HTTP/1.0" 499 0 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.106.92.252 - - [01/Mar/2016:02:37:26 -0500] "POST /xmlrpc.php HTTP/1.0" 499 0 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
185.106.92.33 - - [01/Mar/2016:02:37:26 -0500] "POST /xmlrpc.php HTTP/1.0" 502 537 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
It was immediately obvious that the DOS was using up all my connections, and i was hitting my file-descriptor limit rather quickly. After that, php-fpm was getting zombied out.
The Solution…
The fix (something I had implemented in nginx several years ago, but left out on this one instance) was an easy one. I just added a connection-limit in nginx , and set it to no more than 10 connections/IP. Also added a request-rate to the one url (xmlrpc.php) which was getting DOSed.
so now my /etc/nginx/sites-enabled/sitename.conf got the following edits:
+ limit_req_zone $binary_remote_addr zone=ajayzone2:10m rate=1r/m;
+ location = /xmlrpc.php {
+ limit_req zone=ajayzone2 burst=1 nodelay;
+ }
location / {
+ limit_conn ajayzone1 10;
...
}
Restarted nginx and php5-fpm. Things have been running smoothly since. My connections from ‘netstat -an’ doesn’t even exceed 30 at any given time. What a beautiful thing!
And of course, the logs show the offending requests being silently limited. Some day, when I have some free time, i’ll reconfigure nginx to serve DOS-ing IPs a page with a picture of my middle finger instead of silently dropping them.