In Part 1 of this article we have explored the basics of collecting user data. In the end, we have also come up with a scalable process:
Of course there are downsides to this approach. If any team can track any amount of events - there is no single source of truth in regards to which event types are tracked and the circumstances in which they are triggered. This approach also creates a bit of confusion when onboarding new Product Managers and Data Analysts since they require at least a bird’s eye view on the whole system before they can start answering business-related questions. Defining our own event types on a team-by-team basis without any exposure to the outside also causes teams to become too self-contained. The analytical knowledge exists in a silo, instead of being distributed across teams.
To analyse the data one has to ask the data warehouse a question in a particular manner. Doing so is called making a query (Fig. 1) - essentially a couple of lines of code. Depending on the complexity of the question, writing a query could take a couple of hours. In our situation, the independent teams will write their queries as they go, without knowing if a similar query has already been written by someone else. This is not very efficient. So, how do we deal with the consequences?
{{Divider}}
On the surface it may appear we’ve moved away from our initial quest - enabling everyone to come up with an insight; however, in reality, we are only two small steps away from our goal. What we need now is a single source of truth that would allow the whole organization to speak the same language and a shared, searchable repository of all queries which is easily accessible and open to contribution from everyone.
To address the latter, we have already found a great solution - the Databricks service. We can keep all our queries there, shared and searchable, with an overview of the table format and column names. We have also started an online Data Guide, written by the best technical writer we could find, so that the whole company shares the same vocabulary when writing queries.
Nevertheless some questions still lingered:
The problem is the answers to these questions will change faster than we will be able to update our Data Guide. The manual solution does not scale, and, frankly, it’s almost unethical to employ anyone to support such a Sisyphean task.
To avoid this problem, we have implemented instead a system that generates documentation automatically. We have one tool that creates a list of all of the event types tracked in the last 2 weeks, with all the attributes contained. Another tool matches up the definitions with the attributes, and outputs the result to a file, one for each event type. Ultimately these files end up as pages in our Data Guide (Fig. 2).
But why write if you can show? The possibilities are endless, especially if you employ a screenshot tool that will snap a gif of the website when it detects an event being fired. In this way, all of the events triggers would be clear even for those with little technical knowledge, thus enabling them to contribute to the generation of insights and analysis.
One of our ultimate goals is to enable our company to scale 10x as efficiently as possible. There is no way to reach this goal if we don’t have perspectives from as many diverse viewpoints as we can. With this data analytics setup, we believe we are enabling everyone in the company today to provide us with an insight that might change the face of the industry tomorrow.
{{Divider}}