The future of Thold

I have eluded in several places about the future features that will be available in thold.   But the information is spread all over the place, so I will take the time now to list out changes, and a brief description of the changes.  Some of these are slated for the next major release, some slightly further down the road.  Please take note that “sponsored” feature requests make it into production a lot quicker than everything else.  Most of the features added into v0.4 were actually sponsored by a third party.

Escalation – This feature has been requested multiple times, and is fairly simple to implement.  It moves thold to be able to alert several different people at different times to allow an issue to be escalated up the chain of command.  For instance you can have your technicians alerted of a problem after 10 minutes.  If the problem still exists after an hour, alert the senior technician.  If the problem still exists after 2 hours, then alert the supervisor.  And so on.  This is slated for the next major release (0.5)

Alternate Alerts – This goes well with the previous request, allows alerts other than emails.  This will can be anything from SMS Messages, Running Scripts, SNMP Writes, SNMP Traps, etc… this will be pluginable, so that other plugins can add their own type of alerts.  When combined with the previous request, you can have it set to restart the service after 5 minutes, send an email after 10 minutes, SMS Text after 30 minutes, etc… This is slated for the next major release (0.5).  SNMP Write, SMS, etc.. will be handled by separate plugins to be released later.

Maintenance Periods – One of the future goals is to trim down the total number of alerts you receive from thold.  Overloading of alerts tends to annoy people and leads to ignored alerts.  A lot of times these alerts are valid alerts, but it is a known issue that is currently being worked by another member of your team, or is an automated process at a specific time.  To help combat these types of false positives, I will be implementing Maintenance periods.  You will be able to select specific hosts and / or time periods in which alerts will not be sent.  If you have a process to restart your web service every Friday at 3am, then you can set a reoccuring schedule to eliminate alerts from the “Web Service Status” threshold template during that time.  If you are planning on working on a server on Tuesday at 7pm and should be done in 2 hours, you can set a 1 time occurring which will expire and begin alerting again after the time period is over.   This is slated for the next major release (0.5)

Minor Additions – There will lots of minor additions in the future too.  This include things like customizable email messages per threshold (even per escalation).  More logging and flexibility in the auto-creation (regex matching).  Possibly a renaming of the plugin to “Alerts”.  Technician on-call rotations.  Really too many things to list.

If you have any suggestions, please feel free to list them here.  There is no ETA on the next major release, but I can say that the escalation code, and alternate alerts are 60% completed already.

February 25, 2009 · Jimmy · 5 Comments
Posted in: Plugins

5 Responses

  1. Lupick - April 6, 2009

    Sounds good!!

    Expecially the Maintenance period is what cacti/thold needs in order to be a nagios killer!!

    wen you plan to release thold 0.5??

    thank you

  2. Marcel Grandemange - April 11, 2009

    Looking forward as i specifically need to able to execute alerts differently…

    In my case i need to execute a sql query that will send a text message (smsd) to notify of failure/restore and changes.

    What would be ETA on the next release too include this?

    Regards, And Great Work!

  3. Jimmy - April 12, 2009

    Ya, I need to spend some more time on it, but I have a limited amount of time spread over many many projects. I actually have the functionality completely done for thresholds. I just have to backport it to the threshold templates, and have it propagate between the 2 like the current threshold do. Escalation is done also (except for Time Based, need to work some more on that one).

  4. MattyD - April 13, 2009

    Thold really needs the option to set high threshold to total throughput over a specific period rather then total speed. This would be worth its weight in gold to people looking to monitor total usage over specific periods, such as internet service providers, network consultants, etc.

  5. heybigben - April 15, 2009

    The thold enhancements will really help bring cacti up to the next level. Especially the alternative (pluggable) alerts. Having the ability to execute a script brings a lot more flexability. I hope that component is released very soon. (I looked in cacti svn but did not see them yet). Thanks, Ben