What are some tips for optimizing a really busy loghost running syslog-ng?
In no particular order:
-
If you use DNS, at least keep a caching DNS server
running on the local host and make use of it - or
better yet don't use DNS.
You can post-process logs on an analysis host later on
and resolve hostnames at that time if you need to. On
your loghost your main concern is keeping up with the
incoming log stream - the last thing you want to do is
make the recording of events rely on an external
lookup. syslog-ng blocks on DNS lookups (as noted
elsewhere in this FAQ), so you'll slow down/stop ALL
destinations with slow/failed DNS lookups.
-
Don't log to the console or a tty, under heavy load they
won't be able to read the messages as fast as syslog-ng
sends them, slowing down syslog-ng too much.
-
Don't use regular expressions in your filters. Instead of:
filter f_xntp_filter_no_regexp {
# original line: "xntpd[1567]: time error -1159.777379 is way too large (set clock manually);
program("xntpd") and
match("time error .* is way too large .* set clock manually");
};
Use this instead:
filter f_xntp_filter_no_regexp {
# original line: "xntpd[1567]: time error -1159.777379 is way too large (set clock manually);
program("xntpd") and
match("time error") and match("is way too large") and match("set clock manually");
};
Under heavy, heavy logging load you'll see CPU usage like this when using regexps:
...vs CPU usage like this when not using regexps:
Note that the results at the bottom of the graphs show that the test with heavy
regexp use caused huge delays, almost 25% lost messages (the test only sent 5,000 messages!)
and hammered the CPU. The test without regexps was one where I sent 50,000 messages, and
it hardly used any CPU, didn't drop any messages and all the messages made it across
in under a second (not all 50,000, each individual message made it in under a second). Note
that the "Pace" of 500/sec is simply how fast they were injected to the syslog system
using the syslog() system call (from perl using Unix::Syslog).
NOTE: when not using regexps and matching on different
pieces of the message, you might match messages that you don't
mean to. There is only a small risk of this, and it is much
better than running out of CPU resources on your log server
under most circumstances. It is your call to make.
Please don't ask me for the scripts that generated these graphs, I wrote them for work
and it probably wouldn't be possible to ever release them. I hope to one day write some
like it in my free time and release them...but that may be a pipe dream. :(
There's a good chance you'll want to set per-destination
buffers. The official reference manual covers the subject here.
The idea here is to make sure that when you have multiple log
destinations that might block somewhat "normally" (TCP and FIFO come to mind)
that they don't interfere with each other's buffering. If you have a TCP
connection maxed out in its buffer because of an extended network problem, but
have only a temporary problem feeding logs into a FIFO, you can avoid losing
any data in the FIFO (assuming your buffer size is large enough to handle the
backup) if you set up separate buffers.
If our TCP destination connection drops because the regional syslog server is
down for a syslog-ng upgrade or kernel patch, we want events bound for the TCP
destination to be held in the buffer and sent across once the connection is
re-established. If our bucket is already filled because of FIFO problems to a
local process, we can't buffer a single message for the entire duration of the
TCP connection outage. Ouch.
The problem with implementing per-destination buffers is that the log_fifo_size
option was added to the TCP destinations in the 1.6.6 version. You need to
upgrade to syslog-ng 1.6.6 or later (I suggest the latest stable version).