Best practices when working with datasets
A grasp of these concepts will help you understand this documentation better:
Dataset is the product of the data builder, and it is the interface that end-users will analyze data. Therefore, you must pay extra care when building the dataset.
What makes a good dataset
Understanding the basics of building dataset is easy — ”just join together your data models by setting up the relationships and paths!”
But there are three things you should be optimizing for when building your dataset:
- Correctness: Do reports built on top of this dataset return correct data? This is the very lowest bar for design, as Holistics makes it fairly easy to get correct results as long as your join conditions are correct.
- Performance: While there may be multiple ways to get Holistics to write a query, there will typically be one way that has the optimal performance. Designing your dataset with performance in mind can be the difference between 15-second and 15-minute dashboard load times.
- Usability: When a user who is not an expert in your data loads your explore, it should be immediately clear how to get the answers they’re looking for.
There is no hard-and-fast rule to develop datasets effectively in Holistics, but here are some recommendations:
Best Practices
Organizing Datasets in a logical manner
Here are some typical ways of organizing the dataset:
Master Datasets: group by entity type (customers, inventory, product). Goal is completeness
Department Datasets. Goal is relevance. Ask yourself these questions before building a dataset:
- “If I'm from this department, how can I find my data quickly?
- "Is there a dataset folder that my department will access frequently?”
Don't worry about duplicating datasets (there's no harm in having the same data model fields for different teams) - Focus on accessibility and building purpose
Make it clear when datasets are not meant to be explored (Group into folder)
Start small then expand the dataset
There seems to be 2 common use cases:
- Building "small" datasets with only a few models that allow users to deep dive into one aspect of the business
- Building “large” datasets that can be used for complex exploration
The best advice we can give is to keep multiple small datasets for "low-requirement" business users and also create large datasets for the more analytics-savvy users to do more complex things and ad-hoc exploration.
Start out with small datasets which answer specific questions
- Design each dataset to answer a specific set of questions. Anticipate your end users’ questions, and build your dataset around that. Only include the models relevant to the questions.
- It is advisable to have 4-5 models joined together per one small dataset, and 6–7 is the absolute upper limit. If you find yourself needing more, you probably need to model the underlying data more effectively.
- This is because:
- Too many models will confuse end-users
- Too many models will more likely generate complicated & non-optimal joins
Create large datasets for more complex requirements or ad-hoc exploration
- When the company grows, so should the datasets in order to answer more questions that business users might have. That's why datasets will naturally grow as there are more questions from the business. Continue to include many models in one dataset as long as business users become more confident with data exploration.
- One typical way is creating large datasets related to each business activity like (‘inventory’, ‘sales’, ‘marketing’, ‘risk’), and joining 10-15 tables (or more) one each dataset. On the user navigation end, it’s very simplified, sales users know where to find sales data, and marketing user knows where to find marketing data.
- However, there is a trade-off between dataset flexibility and performance and sometimes, there are many incompatible fields which makes it frustrating to figure out what fields can be used together, or not. Here are a few notes to control these large datasets:
- Self-explanatory naming for data models and fields
- Enrich Data models and Fields with metadata
Avoid complex join paths
- Build your dataset in “star schema”. There should be one central table (can be a fact table) that contains necessary measures, surrounded by dimension tables.
- Avoid “snowflake-schema like” datasets, because in snowflake schema, additional information about an entity/dimension is separated in their own tables → More models to include in the dataset → more joins are needed to get complete information about a certain entity
- Ideally, set up your models so that the dimensions you need and the measures are only one JOIN away. A join chain that is too long will not be performant.
- One good way to reduce the joins generated at dataset run time is to prejoin / denormalize your data as much as possible.
Other Notes
- You can create any number of datasets because Holistics datasets has the below 2 benefits
- No need to worry about report sprawl. All definitions are centralized within Holistics.
- All Holistics datasets are virtual and they don't take up physical storage space.
- Common Challenges:
- Data exploration offers flexibility for business users, but there is the risk of dragging in incompatible dimensions and metrics, or adding non-additive metrics.
- Beware of Fanout Issues.