How to use robot framework to manage a virtual QA lab in aws cloud

Back>

Aws cloud makes QA test easier. We can setup the production VPC, then simulate that VPC network in another aws VPC for QA purpose. The word "simulate" is under-estimating the similarity between prod env and QA env, because using the production's cloudformation template, we can build a QA network exactly as the production network. Even better, once we are done with the QA test, we can delete all the assets created by the cloudformation template, and we don't have to pay the maintenance cost for QA lab. Of course, it is a good practice to keep the prod cloudformation template and QA cloudformation template separate, have template variables to make the templates flexible, also divide a big cloudformation template into smaller cloudformation template as modules. Then we don't have to create the whole production network, we can just create part of it for testing.

Since setup a aws QA network is so easy, we can setup a few cron jobs to run a few QA test projects in parallel. A test project is like a movie theater -- at different times, a cron job put different movies on the show. A show is handled by a robot in robot framework.

Upon starting by the cron job, the robot will accomplish the following task during its life cycle:
  1. create the QA test network using cloudformation api, the needed cloudformation template can be stored in a S3 bucket.
  2. Once the QA test network is created, the robot will start to run a few test suits.
  3. Assume the robot framework is integrated with grafana, we can then see the test statistics graphically in a dashboard in real time.
  4. After all the test suits are finished, the robot call cloudformation api to delete all the aws assets during the test.
Now we know the game rules, let's take a look at an example.

aws QA lab example
aws QA lab example


Our example test project is to verify the alert system can escalate a network intrusion event into a ticket.

The network includes two part: the customer's network and the security center's network. In the customer's network, we have an intrusion detector. The detector will collect logs from the customer's assets. The detector uses a set of regular expression rules to match logs in order to filter out those relevant to network intrusion. (snort for example, detect intrusion this way). Once the regular expression matches something, an event is generated for closer look. The event should at least have the timestamp, sourceIp, destinationIp, description, raw log. These events are send to the event engine located in security center network via a VPN tunnel. On that server, the event is inserted into a mysql database, then the event engine will escalate the relevant event to ticket for human inspection. The ticket should at least have severity, POC, relevant event lists. Once the ticket is generated, the ticket will be ETL to a security analysis's' ticketing system and inserted into their mysql database. The ticket should at least have severity, POC, relevant event lists and status.

So the pass criteria is:

  1. found a new event in event engine's mysql database, with matching sourceIp, destinationIp, description.
  2. found a new ticket in event engine's mysql database, with expected severity, POC, event lists.
  3. found a new ticket in security analysis's' mysql database with expected severity, POC, event lists and status.
With these in mind we can have the robot issue a cloudformation api call with parameters to trigger a cloudformation build. The orchestration is achieved with a SQS, which the template will create first.  The template will create an EC2 for the intrusion detector from an existing image,  the EC2 instance needs a secret to call back security center, which will be supplied as an EC2 instance variable. Once the EC2 is created, a startup script will send a message to the SQS, informing its ip address. At the same time, the cloudformation template is also creating 2 EC2 instances as attacker and victim. Once The victim instance is created, a startup script will poll on the SQS for the message with intrusion detector's ip. Once that message is retrieved, the victim EC2 will setup the syslog server to point to the intrusion detector. Then the victim will send a SQS message with its own ip address. The attacker EC2 is polling the SQS for the victim's ip address. Once that message is retrieved, it will create a robot that attacking the victim. Assume the robot issues 3 failed ssh to the victim in order to have the victim send a syslog to intrusion detector reporting this event. (We can have the victim do real hack with hacking tools installed there, but let's assume failed ssh login is enough for our test.) The robot starting the cloudformation template will sleep long enough for the system to fully digest this attacking event. Once it wakes up, it will check the mysql databases for the passing criteria. If no event is found, it will sleep longer and check again. If still not found event, the test will be marked as a failure.

There are some details not mentioning, the attacker and victim's EC2 are two simple unix OS, with rpms (such as robot framework) downloaded from S3 buckets. The intrusion detector have a http client to call back to the server located in the security center for sending events, the secrete to identify itself is send to the server via vpn tunnel during hand-shake process. This intrusion detector EC2 can also download rpm from S3 bucket. The rpms stored in the S3 bucket is automatically updated by another robot 24/7 through Nexsus repository download and aws S3 api calls.



No comments:

Post a Comment