Architecture

Overview

Data Pipeline Diagram

Datasource

Service Logging Library (SLL) is an API that allows the services to generate telemetry events based on common schema using server-side code. Part B Scenario of the Correlation Context will be used for associating events to scenarios and partners/customers/constituents.

The events generated by SLL could be consumed by different event processors such as Xpert, Xlens, and Asimov Cooker. Both Xpert and Xlens need to have their agent installed in the machine. They process the events locally and send the aggregated data to their respective data stores. Asimov cooker is an offline cooker (data processed outside the machine) that contains set of data pipelines that takes the raw events as input and produces cooked hourly streams. In Asimov pipeline, the events are sent to cosmos thru Cosmos Data Loader (CDL). Asimov cooker removes the duplicates, applying common schema and cluster them based on the event name and produces the hourly streams.

As Asimov Cooker allows us to handle bulk data (without any service restrictions), our Availability pipeline will be based on the SLL events cooked by Asimov Cooker. We will be also looking for integrating data from other sources such as Xlens, Application Insights, etc.

Data Processing

The data processing contains the following components.

Aggregate
Translate
Validate
Publish

Aggregate

Aggregation of the events is done in cosmos. It involves two steps.

Filter and Extract
Minute and Daily Aggregation

Required fields are extracted from Asimov Cooker View. Scenario ID and Partner ID are extracted from Correlation Context. If Correlation Context does not exist (for SLL version <5), We will be using ScScenario and ScPartner or Operation Name and Caller Name from Part C to calculate the Scenario ID and Partner ID, respectively. The extracted fields will be stored as a structured stream (RawEvents) under respective hour folder in cosmos. The minute level aggregation is done using the produced structure stream. The daily aggregation is done on the minute level aggregated stream. Both Aggregated streams are copied to the datastore.

Translate

In datastore, the aggregated data is joined with Scenario and Customers tables based on the scenario ID and Partner ID.

Validate

The data is validated (duplicates, invalid scenario Ids, etc), in datastore.

Publish

The validated data will be pivoted to Scenarios and published for the UI to consume.

Schema

Filtered fields

ScenarioID (from cC: ms.b.tel.scenario)
PartnerID (from cC: ms.b.te.partner)
ScScenario
ScPartner
OperationName
CallerName
Time
RequestType
Environment
RoleInstance
SLLVersion
Latency
RequestStatus
testHeader
quoteIsTest
isTest

Aggregated Stream

see Calculations for how it is calculated.

ScenarioID
PartnerID
Time
RequestType
Environment
TotalRequests
SuccessfulRequests
FailedRequests
AverageLatency
STID

Filters

These are the filters applied while collecting the SLL data from Asimov pipeline.

Data_BaseType = Ms.QoS.IncomingServiceRequest
CloudEnvironment - Excludes dev, int, ppe, non-prod, nonprod, sandbox, perf, test (exact match or contains prefixed with "-")
If cC is empty, scScenario
If cC and ScScenario is empty, OperationName from the list provided by the OMS team

Excluding the test traffic using testHeader, quoteIsTest and isTest fields. Considering the events which satisfy following string.IsNullOrEmpty(testHeader) && (string.IsNullOrEmpty(quoteIsTest) || quoteIsTest.ToLowerInvariant() != "true") && (string.IsNullOrEmpty(isTest) || isTest.ToLowerInvariant() != "true")

WHERE (data_baseType == "Ms.Qos.IncomingServiceRequest" OR data_baseType == "IncomingServiceRequest")
    AND (
        string.IsNullOrEmpty(cloudEnvironment)
        OR (
            cloudEnvironment.ToLowerInvariant() != "dev" AND
            cloudEnvironment.ToLowerInvariant() != "int" AND
            cloudEnvironment.ToLowerInvariant() != "ppe" AND
            cloudEnvironment.ToLowerInvariant() != "non-prod" AND
            cloudEnvironment.ToLowerInvariant() != "nonprod" AND
            cloudEnvironment.ToLowerInvariant() != "sandbox" AND
            cloudEnvironment.ToLowerInvariant() != "perf" AND
            cloudEnvironment.ToLowerInvariant() != "test" AND
            !cloudEnvironment.ToLowerInvariant().Contains("-dev") AND
            !cloudEnvironment.ToLowerInvariant().Contains("-int") AND
            !cloudEnvironment.ToLowerInvariant().Contains("-ppe") AND
            !cloudEnvironment.ToLowerInvariant().Contains("-sandbox") AND
            !cloudEnvironment.ToLowerInvariant().Contains("-perf") AND
            !cloudEnvironment.ToLowerInvariant().Contains("-test")
        )
        OR (
            cloudEnvironment.ToLowerInvariant().Contains("prod") AND
            !cloudEnvironment.ToLowerInvariant().Contains("non-prod") AND
            !cloudEnvironment.ToLowerInvariant().Contains("nonprod")
        )
    )
    AND (
        string.IsNullOrEmpty(testHeader) &&
        (string.IsNullOrEmpty(quoteIsTest) || quoteIsTest.ToLowerInvariant() != "true") &&
        (string.IsNullOrEmpty(isTest) || isTest.ToLowerInvariant() != "true")
    )

Calculations

Calculating ScenarioId: The following priority is used to calculate the scenario Id: Ms.b.tel.ScenarioId -> ScScenario -> OperationName
Determining RequestStatus: if(RequestStatus <5) status=success else //(>=5 or empty) status =failure
Calculating Availability: Availability = 100*SuccessfulRequests/TotalRequests