Impossible tri-bar

Digital Phenomena - Your first stop for internet consultancy 
Tuning Apache Web Servers for Speed

Page 4 — Compile-Time Configuration Issues

mod_status and Rule STATUS=yes

If you include mod_status and you also set Rule STATUS=yes when building Apache, then on every request Apache will perform two calls to gettimeofday(2) (or times(2), depending on your operating system) and, in versions prior to 1.3, will make several extra calls to time(2). This is all done so that the status report will contain timing indications. For highest performance, set Rule STATUS=no.

accept Serialization - multiple sockets

Let's discuss a shortcoming in the Unix socket API. Suppose your Web server uses multiple Listen statements to listen on either multiple ports or multiple addresses. In order to test each socket to see if a connection is ready, Apache uses select(2). This indicates whether a socket has none or at least one connection waiting on it. Apache's model includes multiple children, and all the idle ones test for new connections at the same time. A naive implementation looks something like this (these examples do not match the code, they're contrived for pedagogical purposes):

     
    
    for (;;) {
        for (;;) {
    	   fd_set accept_fds;
    
    
    		FD_ZERO (&accept_fds);
     		for (i = first_socket; i <= last_socket; ++i) {
     			FD_SET (i, &accept_fds);
    		}
     		rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL);
     		if (rc < 1) continue;
     		new_connection = -1;
     		for (i = first_socket; i <= last_socket; ++i) {
    			if (FD_ISSET (i, &accept_fds)) {
     				new_connection = accept (i, NULL, NULL);
     				if (new_connection != -1) break;
     			}
     		}
     		 if (new_connection != -1) break;
     	}
     	process the new_connection;
       }
    
    

But this naive implementation has a serious starvation problem. Recall that multiple children execute this loop at the same time, and so multiple children will block at select when they are in between requests. All those blocked children will awaken and return from select when a single request appears on any socket (the number of children that awaken varies depending on the operating system and timing issues). They will all then fall down into the loop and try to accept the connection. But only one will succeed (assuming there's still only one connection ready); the rest will be blocked in accept. This effectively locks those children into serving requests from that one socket and no other sockets, and they'll be stuck there until enough new requests appear on that socket to wake them all up. This starvation problem was first documented in PR#467. There are at least two solutions.

One solution is to make the sockets non-blocking. In this case, the accept won't block the children, and they will be allowed to continue without interruption. But this wastes CPU time. Suppose you have 10 idle children in select, and one connection arrives. Then nine of those children will wake up, try to accept the connection, fail, and loop back into select, accomplishing nothing. Meanwhile, none of those children are servicing requests that occurred on other sockets until they get back up to the select again. Overall, this solution does not seem very fruitful unless you have as many idle CPUs (in a multiprocessor box) as you have idle children; not a very likely situation.

Another solution, the one used by Apache, is to serialize entry into the inner loop. The loop looks like this (differences in bold):

    
         for (;;) {
               accept_mutex_on ();
               for (;;) {
                   fd_set accept_fds;
      
                   FD_ZERO (&accept_fds);
                   for (i = first_socket; i <= last_socket; ++i) {
                       FD_SET (i, &accept_fds);
                   }
                   rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL);
                   if (rc < 1) continue;
                   new_connection = -1;
                   for (i = first_socket; i <= last_socket; ++i) {
                       if (FD_ISSET (i, &accept_fds)) {
                           new_connection = accept (i, NULL, NULL);
                           if (new_connection != -1) break;
                       }
                   }
                   if (new_connection != -1) break;
               }
               accept_mutex_off ();
               process the new_connection;
           }
    
    

The functions accept_mutex_on and accept_mutex_off implement a mutual-exclusion semaphore. Only one child can have the mutex at any time. There are several choices for implementing these mutexes. The choice is defined in src/conf.h (pre-version 1.3) or src/main/conf.h (version 1.3 or later). Some architectures do not have any locking choice made, and on these architectures it is unsafe to use multiple Listen directives.

USE_FLOCK_SERIALIZED_ACCEPT
This method uses the flock(2) system call to lock a lock file (located by the LockFile directive).
USE_FCNTL_SERIALIZED_ACCEPT
This method uses the fcntl(2) system call to lock a lock file (located by the LockFile directive).
USE_SYSVSEM_SERIALIZED_ACCEPT (1.3 or later)
This method uses SysV-style semaphores to implement the mutex. Unfortunately, SysV-style semaphores have some bad side-effects. One is that it's possible Apache will die without cleaning up the semaphore (see the ipcs(8) man page). The other is that the semaphore API allows for a denial-of-service attack by any CGIs running under the same uid as the Web server (i.e., all CGIs, unless you use something like suexec or cgiwrapper). For these reasons, this method is not used on any architecture except IRIX (the previous two are prohibitively expensive on most IRIX boxes).
USE_USLOCK_SERIALIZED_ACCEPT (1.3 or later)
This method is only available on IRIX and uses usconfig(2) to create a mutex. While this method avoids the hassles of SysV-style semaphores, it is not the default for IRIX. This is because on single-processor IRIX boxes (5.3 or 6.2), the uslock code is two orders of magnitude slower than the SysV-semaphore code. On multiprocessor IRIX boxes, the uslock code is an order of magnitude faster than the SysV-semaphore code. Kind of a messed up situation. So if you're using a multiprocessor IRIX box, then you should rebuild your Web server with -DUSE_USLOCK_SERIALIZED_ACCEPT on the EXTRA_CFLAGS.
USE_PTHREAD_SERIALIZED_ACCEPT (1.3 or later)
This method uses POSIX mutexes and should work on any architecture implementing the full POSIX threads specification; however, it appears to work only on Solaris (2.5 or later). This is the default for Solaris 2.5 or later.

If your system has another method of serialization that isn't in the above list, then it may be worthwhile adding code for it (and submitting a patch back to Apache).

Another solution that has been considered but never implemented is to partially serialize the loop - that is, let in a certain number of processes. This would only be of interest on multiprocessor boxes where it's possible that multiple children could run simultaneously and the serialization actually doesn't take advantage of the full bandwidth. This is a possible area of future investigation, but priority remains low because highly parallel Web servers are not the norm.

Ideally, you should run servers without multiple Listen statements if you want the highest performance. But read on.

accept Serialization - single socket

The above is fine and dandy for multiple-socket servers, but what about single-socket servers? In theory, they shouldn't experience any of these same problems because all children can just block in accept(2) until a connection arrives, and no starvation results. In practice, this hides almost the same "spinning" behavior discussed above in the non-blocking solution. The way that most TCP stacks are implemented, the kernel actually wakes up all processes blocked in accept when a single connection arrives. One of those processes gets the connection and returns to user-space, the rest spin in the kernel and go back to sleep when they discover there's no connection for them. This spinning is hidden from the user-land code, but it's there nonetheless. This can result in the same load-spiking, wasteful behavior as a non-blocking solution to the multiple sockets case.

For this reason, we have found that many architectures behave more "nicely" if we serialize even the single-socket case. So this is actually the default in almost all cases. Crude experiments under Linux (2.0.30 on a dual Pentium Pro 166 with 128 MB RAM) have shown that the serialization of the single-socket case causes less than a 3 percent decrease in requests per second over unserialized single-socket. But unserialized single-socket showed an extra 100 ms latency on each request. This latency is probably a wash on long-haul lines and only an issue on LANs. If you want to override the single-socket serialization you can define SAFE_UNSERIALIZED_ACCEPT, and then single-socket servers will not serialize at all.

Lingering Close

As discussed in draft-ietf-http-connection-00.txt section 8, in order for an HTTP server to reliably implement the protocol, it needs to shut down each direction of the communication independently (recall that a TCP connection is bidirectional, each half independent of the other). This fact is often overlooked by other servers, but is correctly implemented in Apache as of 1.2.

When this feature was added to Apache, it caused a flurry of problems on various shortsighted versions of Unix. The TCP specification does not state that the FIN_WAIT_2 state has a timeout, but it doesn't prohibit it. On systems without the timeout, Apache 1.2 induces many sockets to stick forever in the FIN_WAIT_2 state. In many cases this can be avoided by simply upgrading to the latest TCP/IP patches supplied by the vendor; in cases where the vendor has never released patches, such as SunOS4 (even though folks with a source license can patch it themselves), Apache has decided to disable this feature.

There are two ways of accomplishing this. One is the socket option SO_LINGER. But as fate would have it, this has never been implemented properly in most TCP/IP stacks. Even on those stacks with a proper implementation (e.g., Linux 2.0.31), this method proves to be more expensive in CPU time than the next solution.

For the most part, Apache implements this in a function called lingering_close (in http_main.c). The function looks roughly like this:

        void lingering_close (int s)
        {
    	char junk_buffer[2048];
    
    	/* shutdown the sending side */
    	shutdown (s, 1);
    
    	signal (SIGALRM, lingering_death);
    	alarm (30);
    
    	for (;;) {
    	    select (s for reading, 2 second timeout);
    	    if (error) break;
    	    if (s is ready for reading) {
    		read (s, junk_buffer, sizeof (junk_buffer));
    		/* just toss away whatever is here */
    	    }
    	}
    
    	close (s);
        }
    

This naturally adds some expense at the end of a connection, but it is required for a reliable implementation. As HTTP/1.1 becomes more prevalent, and all connections become persistent, this expense will be amortized over more requests. If you want to play with fire and disable this feature you can define NO_LINGCLOSE, but this is not recommended at all. In particular, as HTTP/1.1 pipelined, persistent connections come into use, lingering_close is an absolute necessity (and pipelined connections are faster, so you want to support them).

Scoreboard File

Apache's parent and children communicate with each other through something called the scoreboard. Ideally this should be implemented in shared memory. For those operating systems that we either have access to or have been given detailed ports for, it typically is implemented using shared memory. The rest default to using an on-disk file. The on-disk file is not only slow, but it is unreliable (and less featured). Peruse the src/main/conf.h file for your architecture and look for either HAVE_MMAP or HAVE_SHMGET. Defining one of these two enables the supplied shared memory code. If your system has another type of shared memory, then edit the file src/main/http_main.c and add the hooks necessary to use it in Apache. (Send Apache back a patch, too, please.)

Historical note: The Linux port of Apache didn't start to use shared memory until version 1.2 of Apache. This oversight resulted in really poor and unreliable behavior by earlier versions of Apache on Linux.

DYNAMIC_MODULE_LIMIT

If you have no intention of using dynamically loaded modules (you probably don't if you're reading this and tuning your server for every last ounce of performance), then you should add -DDYNAMIC_MODULE_LIMIT=0 when building your server. This will save RAM that's allocated only for supporting dynamically loaded modules.




|Home|About Us|Services|Search|
|Software|Products|Support|Links|Latest|
W3C validatedW3C validated CSSCompatible with all browsers