Major Incident Management use case [Guest Post]

Editor’s note: A massive thankyou to Patrick Bayle for today’s guest post on using a SOAR playbook to handle major incident management comms.

When does a use case be classified as something other than a use case?

When posing the question to customers I engage with, the responses are fairly typical from a small subset of common challenges focused around security incident management. Some typical examples as follows: 

  • triaging SIEM alerts
  • malware investigations
  • Phishing

And so on. It’s hard sometimes not to feel like I am a terrible game show host seeking audience participation but I would much rather this than hear crickets!

I am nothing if not prepared for such engagements and having first hand experience helps when focusing attention to the matters that would bring MOST value to the SOC and one such example is common but rarely thought of in the realms of incident handling. 

In an industry plagued with three-letter acronyms for stuff this one appears to have slipped through the net, for now at least. I have heard it described as “Major Incident Management” (MIM) or “Critical Incident Handling” (CIH). Neither the acronym nor naming convention matters in truth, yet this is one challenge that can have serious repercussions to a business not automated. Pertaining specifically to MIM/ CIH, the SOCs goal is to:

Ensure a consistent methodology is applied during a major/ critical incident and regular communications occur during the investigation of said incident. 

If you’ve ever had to firefight (metaphorically or other) then you will know that communication is often automatically placed to the bottom of the list as the priority is to fight fire (obviously). A SOC analyst has to diagnose and mitigate a threat as quickly as possible and naturally all attention is on performing this duty. Updating management slows the response and detracts from the task at hand so quite simply automation is the only option. 

How about this: a playbook that runs on a schedule and sends an email with a predefined format to a distribution list in the event of an incident that exists that matches “critical”. That would solve the problem of communication whilst ensuring that the SOC can do their thing and put the fire out as quickly as possible! Well look no further as we have a playbook for that:

The playbook logic is very simple: If there is no critical incident, then no email is sent; if there is a critical incident then the playbook generates the report at a schedule defined by the business (in my case the requirement was to update management at least every thirty minutes but this does vary). Here’s the filter on the search incident task:

And the mail received, with an attachment and a small amount of text with the incident numbers (customisable of course):

The attachment (funnily enough, also easily customised) is designed for management’s perusal. A separate email with much more detail can also be sent to the SOC manager… but this is probably unnecessary as they should be using the default XSOAR dashboard that shows them this 🙂

I have no doubt that every SOC has this need but maybe they just don’t know it yet?

My closing advice: always think broader than the two or three incidents you work on regularly or the most annoying cases you work on within the SOC. The business should know the value the SOC brings by thwarting attacks in a timely manner and easily demonstrate your value to the organisation.

Awesome post Patrick I look forward to the your next one 🙂


XKCD on SOAR metrics

Is It Worth the Time?

That is to say that… a saving of 5 minutes, against an action that happens 5 times a day… you’re allowed to spend 4 weeks making it and still be in the green.

I’ve never spent 4 weeks purely on a use-case, that’s insane. Many playbooks I build take about ~1 day (that includes building, testing, case management, reporting, SLA metrics, etc), with some more complicated playbooks taking a few days.

I don’t recall ever needing 5-6 days, but even If I had…. that time is justified on a use case that:

  • is once a week 30 min saving (Green)
  • is daily 5 minute saving (Blue)
  • is frequent 1 minute saving (Red)


(Randall you’re awesome, keep writing those books!)

Automated Testing of Defences and Alerting

Yes, defenCe, not defenSe, I’m British darling.

When I was a SOC team leader (before SOAR existed) I tried to build automated processes to confirm technology and process worked as excpected. Even though I suffered from scalability, my aim was to test:

  • Was existing technology blocking known bad as designed?
  • Were alerts being raised to my Analysts?
  • Was the team reacting quick enough?

Technology Configuration Testing

Over time, policies and allow/block lists get abused by inexperienced staff making unsafe/incorrect changes.

(I once saw “allow encrypted PDF” at the top of a proxy config. #Fail)

Imagine a playbook that could:

  • Test Web policies by downloading an encrypted zip
  • Test AV by downloading Eicar
  • Test firewall policy by connecting inbound HTTP 80 to your DMZ
  • Test SSL policy by connecting to an invalid Certificate

We could run this playbook every 60 minutes, and any test that “fails” can create a Critical Severity incident for the team to investigate WHY it was successful.

Testing Alert-Workflow

Referencing the “Connect to known C2C” validation above as an example, this should be blocked, but even when it is blocked we can test more:

  • Was the HTTP block logged in your LogStore/DataLake/SIEM?
  • Was this malicious request raised as a new Alert to your analysts?

Can we check this automaticaly, and check whether the alert creation is happening quick enough?

This kind of playbook can be left running for weeks, and you only get involved if an alert fails to be created. That’s a lot of peace of mind for a very small amount of effort.

Alternative Usecases

The list of actual test is endless, but if I still ran a SOC here is a simple list I would want to create for endless validation:

  • Bruteforce a random account and test if it becomes locked out
  • Is inbound password spraying detected?
  • Test inbound checks of SSL Cert validity and TLS1.0 handshakes
  • Inbound port scans, unsecure protocols
  • Add a new account to a sensitive OU (e.g. Domain Admins) and see if anyone notices
  • Run encoded/obfuscated PowerShell against endpoints
  • Probe internal lateral movement to sensitive networks
  • Large file transfer, transmit easily detected PII

What else could you test?


Python Cheatsheet for SOAR

I find that the vast majority of vendor integrations and playbooks automations are 90% identical: ingest inbound data (array, object, etc), parse through it whilst validating and extracting, then finally pushing it out.

This means I use the same code aaaaaall the time. So I thought I would make a little cheatsheet for the basics (to prevent me googling the same things over and over). This list will change/grow over time.

Not covered in this cheatsheet:

  • Local file handling (open, read, close, etc)
  • HTTP and SSL request/replies
  • Time/date handling

To keep formatting simple, “<<tab>>” respresents a real Tab in the code

#Common Lib Imports
import json, re, pprint, time, random, base64
#Basic string
myString = “hi”
myString += ” I added a bit”
myNum = int(myString)
myInt = str(myNumber)
if varA > 5:
elif varA > 0:
Basic structure
Remember indents are tabs
if myString == “compare me”:Simple string comparison
if myVar is None:None / Null
if not myVar:Check if value exists, but is empty
if (a == b) and (not b == c):
if (a < 5 < b):If both are true
if isinstance(myVar, list):list, dict, str, int… etc
#Lists (arrays)
myList = []
[output1, output2] = myList.split(‘character’)
myList = myString.split(” “)
myString = “-“.join(myList)
#Python Dict
myDict = {}alternative >> myDict = dict()
myVar = myDict.get(“key”, default_value)Extract a value into another var
myVar = myObject[‘key’]Similar to above, but will return error if not found
myDict[‘key’] = ‘value’Set a new value to the Dict
if myKey in myDict:Check if key exists
myList = myDict.keys()Same for “.values()”
#Json(Technically not a Dict, but handled very similarly)
import jsonImport to code to use the calls
myJson = {}
myJson = json.loads(string)Where string is in the form  “{‘key’:value}”
myString = myJson[‘key’]
myJson[‘key’] = newValue
myJsonAsString = json.dumps(myJson)
if myKey in myJson:
for value in myList:
Loop a List
while condition == True:Simple while
for myInt in range(5):myInt will be 1,2,3,4,5
for index in range(len(myList)):index will represent a number of the position in array
for key, value in myJson.items():Loop through Json/Dict
for key in jsonObject:
<<tab>>value = jsonObject[key]
Alternative to the above
breakBreak out of the current loop (1 layer) to next code
continueStop processing this loop, and go to next iteration of this loop structure
newString = re.sub(r”pattern”, “replaceWith”, targetString))regex substitution
arrayResults = re.findall(r’pattern’, targetString)regex findall
matches = re.match(r”Goodbye”, “Hello”)
if match is None:
regex match and test
objectResults =“[a-z]+”,myVar)Returns complex output
encoded = base64.b64encode(‘Hello World’)Encode a string into Base64
readableData = base64.b64decode(encoded)Convert Base64 back into the original string
<<tab>>print(“Something went wrong”)
<<tab>>print(“The ‘try except’ is finished”)
Remember tab indents
print()Simple print
pprint()Print more complicated objects
from pprint import pprint
#Bits and Bobs
myInt = random.random()
Int between 0.0 and 1.0
randomInt = random.randint(1,100)Int between x and y
print(type(x))To get the Type of a variable

SOAR helping out Unstable Server/Service

Socops.Rocks is hosted on a WordPress site:

  • Pro – WordPress allows for quick easy deployment
  • Con – WordPress gets attacked a lot, and crashes, needing restarts

The problem can be described:

  • 24/7 monitoring
  • When an outage is found, start a process
  • Require approval from team
  • Fix the situation automatically
  • Full Audit log

SOAR to the rescue! We need:

  • Testing criteria –> GET & HTTP Response Code
  • Automated frequency –> “Jobs”
  • Process approval –> Me
  • Remediation –> Reboot box (/restart service/other)

Job 1 – Configure a SSH integration using a secure SSH Key (i.e. not password auth)

Job 2 – Configure a Task to connect to Linux and issue a reboot

Job 3 – Build a very simple workflow around the ‘Reboot’ task. If we get a HTTP 200, simply close the ticket, ‘else’ ask the sysadmin whether to issue a reboot.

Job 4 – Create a schedule to run this process every 5 minutes

Job 5 – Enjoy life and at a social distancing BBQ

Job 5.1 – If/when needed, approve the process (here I’m using the mobile app… because I’m at the BBQ)

Job 6 – Make it a Dashboard / Report

There are of course many improvements I can make (and I probably will to squeeze a second blog article out of this….)

  • Reboot, wait 180 seconds, and retest HTTP 200
  • Check the HTML content for unpredicted changes
  • Check SSL cert validity
  • Restart web service instead of a reboot
  • Download last 20 log entries (pass through Threat Intel Platform)
  • Etc

Without a video it’s hard to show this in action, but I’m happy to say that it works perfectly.


  • With no manual labour, every 5 minutes, if there’s an issue, I get a mobile notification to ask for my authorisation to reboot
  • I can reboot the server from anywhere in the world without needing my SSH keys with me
  • Full audit log, easy to expand
  • Dashboards


Auto closing tickets based on workload

Last week I had an intersting chat with a security team:

  • Our workload is very unpredictable
  • We want SOAR to intelligently auto-prioritise incidents
  • And when we are ‘busy’ auto close low priority tickets
  • but we still want automated IOC enrichment, full auditing, etc

Coupled with intelligent prioritising this is a great idea

Request : “if workload is high, auto close incident”

  • After a new incident workflow enriches, we calculate the current team workload
  • For every open incident: Priority1 = 4 points, open Priority2 = 3 points, etc
  • If total points is >20 then auto close the incident with a note “auto closed due to too much workload”

This is great, but I see an improvement. Workloads change very quick, you might be busy right now, but in 1 hour everything gets resolved, then you have no tickets to look at.

My alternative: “create, enrich, wait, auto close”

  • Any low priority incident starts a 3 day timer
  • Incidents are assigned to the team, not an individual
  • If an analyst has capacity then can self assign and now own the ticket
  • If the incident isn’t touched in 3 days it is auto closed
  • We create dashboards that look at the incident count per close duration
  • This dashboards show how many incidents / type are closed without being looked at

I’m an ex-analyst, I know that low quality alerts can contain valuable information, we don’t always have the time, but that ticket still needs enrichment for future analysis if we need to come back to it.

At least using SOAR for automation you ensure that:

  • The incident was logged
  • The details were enriched
  • You were able to reach out to members of the company to validate
  • Auto log all information/decisions for future audit and reviews
  • The playbook had the option to double check the alert is low priority (and self re-prioritize if not)

…which is significantly more than I was able to control a few years ago 🙁


Intelligent SLA vs Knock Knock jokes

A man walks into a bar, ouch

This is a quick ‘joke’, it takes 2 seconds to say, everytime I tell it. I have no concerns giving a SLA for this joke. On the other hand…

Knock Knock…

Whilst I know *I* can tell this joke in under 5 seconds I’m entirely relying on the person I’m talking to, is it representative to apply a SLA on me?

Compare this to a SOAR playbook: any local task we have control over, but it’s not so simple when we wrap a business process around this:

  • Any interaction that involves human input (especially where that person is not part of our team, and we can’t kick them)
  • A query that potentially takes hours to complete
  • Unstable technology we can’t change
  • Technology belonging to another team

So how do we apply such SLA to playbooks ?

SLA for an entire Incident

Pro – Quick to configure. Great for small simple playbooks.

Con – Very inflexible.

A timer starts with the incident, if the ticket takes longer, we have a SLA breach.

SLA for each individual task

Pro – Finely tuned

Con – Administrative overheads building and maintaining

Start a timer for each specific task, if that task takes too long we can either alert, skip the task, or take a different playbook route and escalate the process to the senior team.

Timed Section

Pro – Flexible. quick to deploy

Con – none?

E.g. Task 1 starts timer, task 5 pauses it, task 7 resumes it, task 10 closes it.

Knock Knock (including SLA)

  • The “Joke SLA” represents the entire incident
    • Terminology “Incident SLA”
  • The “My Team SLA” stops and starts
    • Terminology “Timer”
  • The “Punchline SLA”
    • “Task specific SLA”


“SOAR? ..but we’re not big enough to have a SOC”

I hear this a lot, but it doesn’t matter. If anything, a smaller team has more of a need for SOAR.

Don’t believe me? Listen to Bruce Potter (CISO, Expel) and Mike Johnson (CISO, Fastly) on the CisoSeries blog (fast forward 1m38s)

A listener writes in asking “How do you thrive, and how do you survive as a team of 1?”

The panel discuss many general points, including:

  • think about how to amplify yourself
  • democratise others to do work on your behalf
  • only so many hours in a day for you to get things done
  • bring in others to help
  • be the architect for security
  • think about the bigger picture, of how to apply your knowledge
  • find the multipliers

Whilst the panel lean towards distributing responsibilities and finding allies to do work on your behalf (lucky them), I was just hearing:

  • Yes, SOAR
  • Yes, SOAR
  • Yes, SOAR
  • Yes, DBot
  • Yes, SOAR
  • Yes, SOAR

Essentially, automate the hell out of it.  If you have 1 or 2 people, surely automation is the only way to scale.

Example:  User submits a request (priority 3?), which goes to the bottom of the priority queue, which takes 2 days to find, and 30 minutes to fix.  That’s a long wait for something simple.  People see the IT team as blockers, not enablers.

Now imagine SOAR performing all those simple/fast requests with a turnaround of 2 minutes.   2 days wait -> 2 mins wait is a 144,000% increase in service you give without any additional head count or training.

Use Cases every sized organisation has to deal with:

  • Whitelisting domains
  • Joiners Movers Leavers (JML)
  • Blocking IP
  • Enriching IOC for teams
  • Phishing 

This has a few benefits:

  • Smaller ticket queue
  • Increased perception of service (your end users feel that your service is always-on)
  • More time and less interruptions whilst you are doing more important tasks

That’s just the basics of what SOAR does. Taking that further and using all the functionality of SOAR Case Management here too:

  • The remaining 50% of tickets are already prioritised so you wont miss a high priority just because you ‘started at the top of the inbox and worked down’
  • Playbooks can interact with non-SOAR users meaning your end users always had the possibility to “click here to escalate this to a security person now” if it can’t wait
  • Daily/Hourly reporting to you on the queue size with nice breakdowns on priority, incident type etc.
  • Even cooler would be SOAR checking the baseline of tickets. Imagine a security incident happens and many users report a similar issue, if we see this rise we contact you straight away “warning, phishing requests at 205% of normal”

Honestly I could go on lots more on how SOAR in the background supports a smaller team, but I hope this makes the point?