Keep: the open-source AIOps platform

Dec 2, 2024

https://www.keephq.dev/https://github.com/keephq/keep

Keep is an open-source alert management and AIOps platform that is a swiss-knife for alerting, automation, and noise reduction.

You can start exploring Keep by simply logging in to the platform. Join Slack community to get help and share your feedback.

What’s an alert?

An alert is an event that is triggered when something undesirable occurs or is about to occur. It is usually triggered by monitoring tools. Example could include: Prometheus, Grafana, Datadog or CloudWatch, and your own proprietary tools.

Alerts are usually categorized into three different groups:

Infrastructure-related alerts - e.g., a virtual machine consumes more than 99% CPU.
Application-related alerts - e.g., an endpoint starts returning 5XX status codes.
Business-related alerts - e.g., a drop in the number of sign-ins or purchases.

What problem does Keep solve?

Keep helps with every step of the alert lifecycle:

Maintenance - Keep integrates with all of your monitoring tools, allowing you to manage all of your alerts within a single pane of glass.
Noise reduction - By integrating with monitoring tools, Keep can deduplicate and correlate alerts to reduce noise in your organization. There are 2 types of deduplication: Rule-based (semi-manual) and AI-based (fully automated).
Automation - Keep Workflows is a GitHub Actions-like experience for automating anything that is triggered by things in Keep: alerts, events, incidents, manually and based on time intervals. It can help with: alert enrichment, ticket creation, self-healing, root cause analysis, and more.
Incident Correlation - Correlate alerts to incidents, performs triage, and conducts root cause analysis.

How does Keep get my alerts?

There are primarily two ways to get alerts into Keep:

Push

When you connect a Provider, Keep automatically instruments the tools to send alerts to Keep via webhook. As an example, when you connect Grafana, Keep will automatically create a new Webhook contact point in Grafana, and a new Notification Policy to send all alerts to Keep.

You can configure which providers you want to push from by checking the Install Webhook checkbox in the provider settings.

Pull

When you connect a Provider, Keep will start pulling alerts from the tool automatically. Pulling interval is defined by the KEEP_PULL_INTERVAL environment variable and defaults to 7 days and can be completely turned off by using the KEEP_PULL_DATA_ENABLED environment variable.

You can also configure which providers you want to pull from by checking the Pulling Enabled checkbox in the provider settings.

We strongly recommend using the push method for alerting, as pulling does not include a lot of the features, like workflow automation. It is mainly used for a quick way to get alerts into Keep and start exploring the value.

Use cases

Central alert management

No more navigating between multiple Prometheus instances and dealing with per-region, per-account CloudWatch settings.

By linking your alert-triggering tools to Keep, you gain a centralized dashboard for managing all your alerts.

With Keep, you can review, throttle, mute, and fine-tune all of your alerts from a single console.

Alerts enrichment

You’re no longer constrained by the alerting mechanisms implemented in your tools.

Need alerts triggered exclusively for your enterprise customers? No problem. Want to add extra context that isn’t available in your existing tools? Easy.

Simply connect your observability tools, databases, ticketing systems, or any other tools that can provide additional context, and integrate them with your alerts.

Automate the alert response process

There’s a saying that goes, “If you can automate the response to an alert, it shouldn’t be an alert,” right?

While that might be true in an ideal world, we understand that many times the response to an alert can be automated—whether by double-checking or taking steps to verify that an alert is not a false positive.

Consider a common scenario—you receive a 502 error on one of your endpoints. That’s alert-worthy, isn’t it?

But what if you could confirm that it’s a genuine error with an additional query? Or even determine if it’s a free-trial user whose issue can wait until morning?

Comparison

It’s often easier to grasp a tool’s features by comparing it to others in the same ecosystem. Here, we’ll explain how Keep interacts with and compares to these tools.

Keep vs IRM (PagerDuty, OpsGenie, etc.)

Incident management tools aim to notify the right person at the right time, simplify reporting, and set up efficient war rooms.

”Keep” focuses on the alert lifecycle, noise reduction, and AI-driven alert-incident correlation. Essentially, Keep acts as an ‘intelligent layer before the IRM,’ managing millions of alerts before they reach your IRM tool. Keep offers high-quality integrations with PagerDuty, OpsGenie, Grafana OnCall, and more.

Keep vs AIOps in Observability (Elastic, Splunk, etc.)

Keep is different because it’s able to correlate alerts between different observability platforms.

Keep vs AIOps platforms (BigPanda, Moogsoft, etc.)

Keep is an alternative to platforms like BigPanda and Moogsoft. Customers who have used both traditional platforms and Keep notice a significant improvement in alert correlation. Unlike the manual methods of other platforms, Keep uses advanced state-of-the-art AI models for easier and more effective alert correlation.

Examples

Create an incident only if the customer is on Enterprise tier

In this example we will utilize:

Datadog for monitoring
OpsGenie for incident management
A postgres database that stores the customer tier.

This example consists of two steps:

Connect your tools - Datadog, OpsGenie and Postgres.
Create a workflow that is triggered by the alert, runs an SQL query, and decides whether to create an incident. Once the workflow is created, you can upload it via the Workflows page.

alert:
  id: enterprise-tier-alerts
  description: Create an incident only if the customer is enterprise.
  triggers:
    - type: alert
      filters:
        - key: source
          value: datadog
        - key: name
          value: YourAlertName
  steps:
    - name: check-if-customer-is-enterprise
      provider:
        type: postgres
        config: "{{ providers.postgres-prod }}"
        with:
          # Keep will replace {{ alert.customer_id }} with the customer id
          query: "SELECT customer_tier, customer_name FROM customers_table WHERE customer_id = {{ alert.customer_id }} LIMIT 1"
  actions:
    - name: opsgenie-incident
      # trigger only if the customer is enterprise
      condition:
      - name: verify-true
        type: assert
        assert: "{{ steps.check-if-customer-is-enterprise.results[0] }} == 'enterprise'"
      provider:
        type: opsgenie
        config: " {{ providers.opsgenie-prod }} "
        with:
          message: "A new alert on enterprise customer ( {{ steps.check-if-customer-is-enterprise.results[1] }} )"

Send a slack message for every Cloudwatch alarm

Connect your Cloudwatch(/es) and Slack to Keep.
Create a simple Workflow that filters for CloudWatch events and sends a Slack message:

workflow:
  id: cloudwatch-slack
  description: Send a slack message when a cloudwatch alarm is triggered
  triggers:
    - type: alert
      filters:
        - key: source
          value: cloudwatch
  actions:
    - name: trigger-slack
      provider:
        type: slack
        config: " {{ providers.slack-prod }} "
        with:
          message: "Got alarm from aws cloudwatch! {{ alert.name }}"

Monitor a HTTP service

Suppose you want to monitor an HTTP service. All you have to do is upload the following workflow:

workflow:
  id: monitor-http-service
  description: Monitor a HTTP service each 10 seconds
  triggers:
    - type: interval
      value: 10
  steps:
  - name: simple-http-request
    provider:
    type: http
    with:
        method: GET
        url: 'https://YOUR_SERVICE_URL/'
        timeout: 2
        verify: true
  actions:
    - name: trigger-slack
      condition:
        - name: assert-condition
          type: assert
          assert: '{{ steps.simple-http-request.results.status_code }} == 200'
      provider:
        type: slack
        config: ' {{ providers.slack-prod }} '
        with:
          message: "HTTP Request Status: {{ steps.simple-http-request.results.status_code }}\nHTTP Request Body: {{ steps.simple-http-request.results.body }}"
  on-failure:
    # Just need a provider we can use to send the failure reason
    provider:
      type: slack
      config: ' {{ providers.slack-prod }} '