Blog Data Science Data Modelling Interview Questions and Answers

Data Modelling Interview Questions and Answers

By: Sanjay Prajapat

Last Updated: May 27th, 2026

Read Time: 10:00 Minutes

1. Data Modelling Interview Questions for Freshers

Q1. What is data modelling and why is it important?

Q2. What is the difference between primary and foreign keys?

Q3. What is an Entity-Relationship Diagram (ERD)?

Q4. What is normalization in data modelling?

Q5. What is the difference between a relational data model and a NoSQL data model?

Q6. What is a schema in a database?

Q7. What is cardinality in data modelling?

Q8. What is the difference between a fact table and a dimension table?

Q9. What is a surrogate key and how is it different from a natural key?

Q10. What is a data dictionary?

2. Intermediate Data Modelling Interview Questions

Q1. What are the differences between a star schema and a snowflake schema?

Q2. What are the different types of slowly changing dimensions (SCD)?

Q3. What is denormalization and when would you use it?

Q4. What is the difference between OLTP and OLAP systems and how does data modelling differ for each?

Q5. What is referential integrity and how do you enforce it?

Q6. What is a bridge table and when do you use one?

Q7. What is a data vault model and how does it differ from a traditional data warehouse model?

Q8. How do you handle null values in a data model?

Q9. What is a composite key and when would you use it?

Q10. What are the best practices for naming conventions in data modelling?

3. Data Modelling Interview Questions for Experienced Professionals

Q1. How do you design a data model for a cloud-based data warehouse like Snowflake, BigQuery, or Redshift?

Q2. How do you design a data model that can handle historical data changes without losing information?

Q3. What is the role of data lineage in data modelling and how do you implement it?

Q4. How do you approach data modelling for a real-time streaming data system?

Q5. What is schema-on-read versus schema-on-write and when do you use each?

Q6. How do you handle many-to-many relationships in a large-scale data model?

Q7. How do you ensure data model performance at scale?

Q8. What is a galaxy schema and how does it differ from star and snowflake schemas?

Q9. How do you approach dimensional modelling for a multi-tenant SaaS application?

Q10. What is the role of a data modeller in a modern data mesh architecture?

4. Scenario-Based Data Modelling Interview Questions

Q1. You are designing a data model for an e-commerce platform. How would you structure the database?

Q2. Your company is migrating from a legacy on-premise Oracle database to a cloud data warehouse. How do you approach the data model redesign?

Q3. You discover that two teams in your organization are using different definitions for the same metric. How do you resolve this using data modelling?

Q4. You need to design a data model for a healthcare system that must follow HIPAA. What considerations do you include?

Q5. A business analyst tells you a report is running very slowly. The report joins five tables. How do you diagnose and fix the problem using data modelling?

Q6. You are building a data model for a financial institution that needs to track transactions across multiple currencies. How do you handle currency conversion?

Q7. Your team is adopting a lakehouse architecture using Delta Lake. How does your data modelling approach change?

Q8. A new business requirement asks you to add a new attribute to a dimension table that already has millions of records. How do you handle this change?

Q9. You are asked to design a data model to support machine learning feature engineering. What does this look like?

Q10. You are reviewing a junior team member's data model and notice they have used a text column to store dates. How do you explain the problem and guide them to fix it?

5. Wrapping Up

6. FAQs

1. What is the best way to respond to scenario-based data modeling queries?

2. Is the nature of data modeling questions different for experienced applicants?

3. Which data modeling interview tools should I be familiar with?

If you are preparing for a data analyst, data engineer, or database architect role, you are almost certainly going to face data modelling interview questions. These questions show up at every level. Beginners get asked about basic concepts. Senior candidates are tested on their ability to make informed architecture decisions and trade-offs. Scenario-based rounds test how you think when real problems land on your desk.

This guide covers 40 data modelling interview questions and answers across four levels: beginner, intermediate, experienced and scenario-based. Each answer is written clearly so you can understand it, remember it and explain it in your own words during the interview.

Data Modelling Interview Questions for Freshers

If you are applying for a junior data analyst, junior database developer, or entry-level BI role, expect these types of questions. Interviewers want to check whether you understand the foundational concepts of data modelling.

Q1. What is data modelling and why is it important?

Data modelling is the process of organizing and defining data structures before building a database or data system. You create a model that maps out what data exists, how it is organized and how different data entities relate to each other.

It is important because it reduces development errors, improves data quality, makes databases easier to maintain and ensures that the system actually meets business requirements. Without a proper data model, databases often become messy, inconsistent and hard to scale.

Q2. What is the difference between primary and foreign keys?

A primary key uniquely identifies each row in a table, while a foreign key links one table to another, helping maintain relationships and consistency across the database.

Feature	Primary Key	Foreign Key
Purpose	Uniquely identifies a record	Connects two tables
Uniqueness	Must be unique	Can have duplicates
Null Values	Cannot be null	Can be null
Table Scope	Exists in its own table	Refers to another table
Count per Table	Only one primary key	Multiple foreign keys allowed
Integrity Role	Ensures entity integrity	Ensures referential integrity
Example	Student ID in Students table	Student ID in Orders table

Q3. What is an Entity-Relationship Diagram (ERD)?

An Entity-Relationship Diagram, or ERD, is a visual diagram that represents the entities in a database and how they relate to each other. Entities are the main objects or concepts, such as Customer, Product, or Order. Each entity has attributes, which are the properties that describe it, like CustomerName or OrderDate. Relationships show how entities are connected — for example, a Customer places many Orders. ERDs are used during the conceptual and logical modelling phases to plan the structure of a database before actually building it.

Q4. What is normalization in data modelling?

Normalization is the process of organizing a database to reduce data redundancy and improve data integrity. You do this by dividing a large table into smaller, related tables and defining relationships between them. Normalization follows a series of rules called normal forms. The most commonly discussed are First Normal Form (1NF), Second Normal Form (2NF) and Third Normal Form (3NF). Each normal form builds on the previous one. The goal is to make sure each piece of data is stored in only one place, which makes updates easier and reduces the chance of inconsistent data.

Q5. What is the difference between a relational data model and a NoSQL data model?

Relational models store structured data in tables with fixed schema, while NoSQL models handle flexible, unstructured data using formats like documents, key-value pairs, or graphs.

Feature	Relational Model	NoSQL Model
Structure	Tables (rows & columns)	Flexible formats (JSON, key-value)
Schema	Fixed schema	Dynamic schema
Scalability	Vertical scaling	Horizontal scaling
Data Type	Structured data	Structured + unstructured
Relationships	Uses joins	Often avoids joins
Consistency	Strong consistency	Eventual consistency
Examples	MySQL, PostgreSQL	MongoDB, Cassandra

Q6. What is a schema in a database?

A schema is a database blueprint. It defines the structure of the database including the tables, columns, data types, relationships, constraints, views and indexes. Think of it as the architecture plan before you build a house. A schema tells the database how data is organized and what rules apply to it. In SQL databases, you also use schemas to organize database objects into logical groups, which is especially useful in large systems with many tables.

Q7. What is cardinality in data modelling?

Cardinality describes the number of instances of one entity that can or must be associated with each instance of another entity. The four main cardinality types are one-to-one, one-to-many, many-to-one and many-to-many.

For example, in a school database, one Student can enroll in many Courses. That is a one-to-many relationship. One Course can have many Students, making it a many-to-many relationship. Understanding cardinality helps you design tables correctly and decide where foreign keys and junction tables should go.

Q8. What is the difference between a fact table and a dimension table?

Fact tables store measurable data for analysis, while dimension tables store descriptive details that give context to the facts in data warehousing systems.

Feature	Fact Table	Dimension Table
Purpose	Stores numeric data (metrics)	Stores descriptive data
Content	Sales, revenue, counts	Names, dates, categories
Size	Large (many records)	Smaller
Keys	Contains foreign keys	Contains primary keys
Usage	Used for calculations	Used for filtering/grouping
Data Type	Mostly numbers	Mostly text
Example	Sales amount table	Customer details table

Q9. What is a surrogate key and how is it different from a natural key?

A surrogate key is a system-generated, artificial identifier for a record in a table. It has no business meaning. It is usually an auto-incrementing integer or a UUID. A natural key, on the other hand, is a key derived from actual business data, such as a Social Security Number, email address, or product code. Surrogate keys are preferred in data warehouses because they are stable even when the business changes the natural key. They also make join operations faster because they are typically short integers rather than long strings.

Q10. What is a data dictionary?

A data dictionary is a centralized document or repository describing data in a database or data system. It contains information about each data element including its name, data type, size, format, description, allowed values and how it relates to other data elements. A data dictionary is an essential part of data governance. It helps developers, analysts and business users understand what data exists, what it means and how to use it correctly. It is especially useful in large organizations where many teams work with the same data.

Intermediate Data Modelling Interview Questions

These questions target candidates who have one to three years of experience in database design, data warehousing, or business intelligence. Interviewers expect you to go beyond definitions and show practical understanding.

Q1. What are the differences between a star schema and a snowflake schema?

Star schema uses a simple structure with direct links between fact and dimension tables. Snowflake schema normalizes dimensions into multiple related tables for better organization.

Feature	Star Schema	Snowflake Schema
Structure	Simple, single-level	Complex, multi-level
Design	Denormalized	Normalized
Performance	Faster queries	Slower due to joins
Complexity	Easy to understand	More complex
Storage	Uses more space	Saves space
Joins	Fewer joins	More joins
Use Case	Quick reporting	Detailed data modeling

Q2. What are the different types of slowly changing dimensions (SCD)?

SCD handles changes in dimension data over time

Type 1: overwrites old data, no history kept

Type 2: adds new rows to maintain full history

Type 3: stores limited history in new columns

Type 4: uses separate history table

Type 6: hybrid of Types 1, 2, and 3

Q3. What is denormalization and when would you use it?

Denormalization is the intentional process of adding redundancy to a database that was previously normalized. You merge tables or add duplicate data to reduce the number of joins needed at query time.

You use denormalization when query performance is a higher priority than storage efficiency. This is common in data warehouses, read-heavy reporting systems and OLAP databases.

For example, instead of joining a Sales table with a Customer table every time you run a report, you might store the customer name directly in the Sales table. The trade-off is that updates become more complex because you have to update data in multiple places.

Q4. What is the difference between OLTP and OLAP systems and how does data modelling differ for each?

OLTP systems handle daily transactions quickly, while OLAP systems are designed for data analysis and reporting, requiring different data modeling approaches for performance and usability.

Feature	OLTP	OLAP
Purpose	Transaction processing	Data analysis
Data	Current, detailed data	Historical, summarized data
Queries	Simple, fast queries	Complex queries
Users	End users (apps)	Analysts, managers
Design	Normalized schema	Denormalized schema
Speed Focus	Insert/update speed	Query performance
Example	Banking system	Data warehouse

Q5. What is referential integrity and how do you enforce it?

Referential integrity is a set of rules that ensures the relationships between tables in a relational database remain consistent. It means that if a foreign key value exists in one table, the corresponding primary key value must exist in the referenced table. You enforce referential integrity by defining foreign key constraints in your database. Most relational databases like PostgreSQL, MySQL and SQL Server support ON DELETE and ON UPDATE rules. These let you control what happens to child records when a parent record is deleted or updated. Options include CASCADE, SET NULL, SET DEFAULT and RESTRICT.

Q6. What is a bridge table and when do you use one?

A bridge table, also called a junction table or associative table, is used to resolve a many-to-many relationship between two entities. Because you cannot directly represent a many-to-many relationship in a relational database without data duplication, you create a third table that holds the foreign keys from both tables. For example, if a Student can enroll in many Courses and a Course can have many Students, you create a StudentCourse bridge table with columns for StudentID and CourseID. In data warehousing, bridge tables are also used to handle multi-valued dimensions, such as a customer belonging to multiple market segments.

Q7. What is a data vault model and how does it differ from a traditional data warehouse model?

A Data Vault is a modern data warehouse modelling methodology designed for flexibility, scalability and auditability. It organizes data into three core types of tables: Hubs, Links and Satellites. Hubs store unique business keys. Links capture relationships between business keys. Satellites store all the descriptive and historical data. Unlike a traditional star or snowflake schema, Data Vault is highly normalized and designed to handle changing business rules without breaking the entire model. It is well-suited for large enterprises with complex, evolving data sources. The trade-off is that querying a Data Vault is more complex, so you typically build a presentation layer on top of it.

Q8. How do you handle null values in a data model?

Handling null values requires both design decisions and data quality rules. First, you decide at the schema level whether a column should be nullable or not. If a column should always have a value, you mark it as NOT NULL. For nullable columns, you document what a null value means in that context, because null can mean different things: missing data, not applicable, or unknown. In data warehouses, it is common to replace nulls with default values like 'Unknown', 0, or 'N/A' to simplify aggregation and reporting. You also handle nulls in ETL pipelines using COALESCE or ISNULL functions to substitute default values during data loading.

Q9. What is a composite key and when would you use it?

A composite key is a primary key that consists of two or more columns working together to uniquely identify a row. You use a composite key when no single column is unique on its own but the combination of columns is unique. For example, in an OrderItems table, neither OrderID alone nor ProductID alone uniquely identifies a row because the same product can appear in multiple orders. But the combination of OrderID and ProductID together uniquely identifies each line item. Composite keys are common in bridge tables and historical tables. The downside is that they can make joins more complex compared to using a single surrogate key.

Q10. What are the best practices for naming conventions in data modelling?

Good naming conventions make a data model readable and maintainable. Use clear, descriptive names that reflect the business meaning of the data. Be consistent with casing — most teams use snake_case for SQL or PascalCase for object-oriented models. Use singular nouns for table names, eg Customer instead of Customers. Prefix foreign keys with the referenced table name, eg CustomerID in an Orders table. Avoid abbreviations unless they are industry-standard. Document your naming conventions in a data dictionary so every team member follows the same rules. Consistency matters more than any specific style choice.

Read Also: Data Science Tutorial for Beginners

Data Modelling Interview Questions for Experienced Professionals

These questions are for senior data modellers, data architects and lead data engineers. Interviewers expect you to discuss design decisions, trade-offs, large-scale architecture and governance. They want to see how you think, not just what you know.

Q1. How do you design a data model for a cloud-based data warehouse like Snowflake, BigQuery, or Redshift?

Cloud data warehouses have different performance characteristics than traditional on-premise systems. In cloud systems, storage is cheap and compute is the main cost driver. This changes how you think about normalization. For example, in Snowflake and BigQuery, wide denormalized tables often perform better than many normalized small tables because columnar storage and distributed compute make large reads efficient. You also design with clustering keys and partitioning in mind. In BigQuery, you partition tables by date or another key field to reduce query scan costs. In Snowflake, you use clustering keys for columns that are frequently filtered. You also pay close attention to data types because storing large strings or redundant columns adds to both storage and compute costs.

Pro Tip: Mention the platform you have actually worked with. Interviewers value hands-on experience over theoretical knowledge for this level of question.

Q2. How do you design a data model that can handle historical data changes without losing information?

The standard approach for tracking historical changes is SCD Type 2. You add a new row for each change with a StartDate, EndDate and an IsCurrent flag. This preserves the full history of the record while making it easy to query the current state or the historical state at any point in time. For operational databases, you can use temporal tables, which are a feature in SQL Server and other modern databases. Temporal tables automatically track row history with a system time period. In a Data Vault model, Satellites naturally capture history because you insert new records rather than updating existing ones. The right approach depends on your storage constraints, query patterns and the complexity of changes you expect.

Q3. What is the role of data lineage in data modelling and how do you implement it?

Data lineage tracks where data comes from, how it moves through systems and how it transforms along the way. In a data model, lineage documentation tells you which source systems feed which tables, which ETL transformations are applied and how downstream reports or models depend on upstream data. You implement data lineage using several approaches. Metadata tools like Apache Atlas, Alation, or dbt's built-in lineage graphs automatically capture lineage as part of the data pipeline. In custom solutions, you build a lineage table that records the source system, source table, transformation logic, load timestamp and target table for every data load. Data lineage is critical for data governance, impact analysis and debugging data quality issues.

Q4. How do you approach data modelling for a real-time streaming data system?

Real-time streaming data systems require a different modelling approach compared to batch-oriented data warehouses. You typically work with event-driven models where each record represents a discrete event with a timestamp, event type and payload. In Apache Kafka-based systems, you model data as topics and events rather than as tables and rows. When you land streaming data into a storage layer like a data lake or a lakehouse, you use append-only tables that partition by event time. You avoid complex joins in real time because they add latency. Instead, you denormalize data at ingestion time or use stateful stream processing frameworks like Apache Flink or Spark Streaming to enrich events on the fly. The key design challenge is balancing latency, throughput and consistency.

Q5. What is schema-on-read versus schema-on-write and when do you use each?

Schema-on-write means you define the structure of your data before writing it to storage. Traditional relational databases use this approach. You create tables with defined columns and data types and data must conform to the schema when it is inserted. This enforces data quality at the entry point. Schema-on-read means you store raw data without a predefined schema and apply the structure when you read and query the data. Data lakes typically use this approach. It gives you flexibility to store diverse data formats and apply different schemas depending on the use case. The downside is that data quality issues are discovered later, at query time. Modern lakehouses like Delta Lake and Apache Iceberg blend both approaches by adding schema enforcement and metadata management on top of a data lake.

Q6. How do you handle many-to-many relationships in a large-scale data model?

In a relational model, you resolve many-to-many relationships with a bridge or junction table that contains the primary keys of both entities plus any additional attributes that describe the relationship. In large-scale systems, you also evaluate whether the bridge table itself needs partitioning, indexing, or clustering based on query patterns. In a Data Vault model, a Link table captures many-to-many relationships between Hub tables. In dimensional modelling, if a dimension has a many-to-many relationship with the fact table, such as a customer belonging to multiple segments, you use a factless fact table or a bridge dimension table combined with a weighting factor. The right approach depends on the query patterns and whether you need to aggregate across the relationship.

Q7. How do you ensure data model performance at scale?

Performance at scale requires decisions at multiple layers. At the schema level, you choose between normalized and denormalized structures based on whether the system is write-heavy or read-heavy. At the physical level, you add appropriate indexes on columns that appear in WHERE clauses, JOIN conditions and ORDER BY clauses. In columnar databases, you select clustering keys based on the most common query filter patterns. You also use partitioning to limit the amount of data scanned per query. At the pipeline level, you use incremental loading instead of full reloads wherever possible. You monitor query execution plans regularly and work with database administrators to identify and resolve bottlenecks. Caching frequently accessed aggregations in materialized views also improves read performance significantly.

Q8. What is a galaxy schema and how does it differ from star and snowflake schemas?

A galaxy schema, also called a fact constellation schema, contains multiple fact tables that share one or more dimension tables. It is like having several stars connected through shared dimensions. For example, a company might have a Sales fact table and an Inventory fact table, both sharing a Date dimension table and a Product dimension table. Galaxy schemas are used when a business has multiple related business processes that share common context. They are more complex to design and maintain than a single star schema but are necessary when you need to integrate multiple subject areas in a unified data warehouse. They are common in enterprise data warehouses that serve many departments at once.

Q9. How do you approach dimensional modelling for a multi-tenant SaaS application?

Multi-tenant data modeling requires isolating tenant data for both security and performance. You have three main architectural choices. First, a shared database with a tenant ID column — you add a TenantID column to every table and use row-level security to restrict access. This is the most cost-effective approach but requires strict query filtering discipline. Second, shared databases with separate schemas — each tenant gets their own schema within the same database, giving better logical isolation but complicating maintenance. Third, separate databases per tenant — this gives the strongest isolation and performance predictability but increases operational overhead. For dimensional modelling, you include TenantID in all dimension and fact tables and make it part of the partition key or clustering key to ensure query performance does not degrade as the number of tenants grows.

Q10. What is the role of a data modeller in a modern data mesh architecture?

Data mesh is an architectural approach where data ownership is distributed to domain teams rather than centralized in a single data engineering team. In a data mesh, each domain team is responsible for its own data product, including its data model. As a data modeller in this environment, you serve as a domain expert who designs the data model for your team's domain and publishes it as a well-defined data product with clear contracts — including schemas, data quality SLAs and lineage documentation. You also define the canonical data model for your domain so that other teams can consume your data without needing to understand your internal system details. You work closely with data platform engineers who provide the infrastructure and tooling that makes self-serve data access possible across domains.

Scenario-Based Data Modelling Interview Questions

Scenario-based questions test your practical thinking. Interviewers give you a real-world situation . They want to see how you break it down, what trade-offs you consider, and how you arrive at a design decision. There is rarely one correct answer. What matters is your reasoning process.

Q1. You are designing a data model for an e-commerce platform. How would you structure the database?

I would start by identifying the core business entities: Customer, Product, Order, OrderItem, Category, and Payment. Each Customer can place many Orders, so the relationship between Customer and Order is one-to-many. Each Order can contain many Products, and each Product can appear in many Orders, which is a many-to-many relationship. I would resolve this with an OrderItem table that holds the OrderID, ProductID, Quantity, and UnitPrice. Products belong to Categories, which can be hierarchical, so I would use a self-referencing Category table with a ParentCategoryID. For payment processing, I would create a Payment table linked to Order. For the data warehouse layer, I would build a Sales fact table with foreign keys to Date, Customer, Product, and Geography dimensions to support reporting and analytics.

Q2. Your company is migrating from a legacy on-premise Oracle database to a cloud data warehouse. How do you approach the data model redesign?

The first step is to audit the existing data model. I would document all tables, relationships, stored procedures, views, and business rules in the current system. Then I would identify which parts of the model are still relevant and which are legacy artifacts that no longer serve a purpose. Next, I would map the source data to target tables in the new cloud environment. I would evaluate whether to lift and shift the existing schema or redesign it using dimensional modelling principles for better analytics performance. Cloud data warehouses are columnar and distributed, so I would denormalize where appropriate and choose partitioning and clustering strategies based on query patterns. I would also plan for data type compatibility differences between Oracle and the target system, and involve business stakeholders to validate that the redesigned model still meets reporting requirements before going live.

Q3. You discover that two teams in your organization are using different definitions for the same metric. How do you resolve this using data modelling?

This is a data governance problem, and the solution starts with bringing both teams together to agree on a single business definition. I would facilitate a meeting where each team explains how they are calculating the metric and why. In most cases, both definitions have some validity but serve different purposes. Once we agree on a canonical definition, I would create a shared dimension or a centralized metrics layer that encodes the agreed definition as a calculated field or view. In modern data stacks, a tool like dbt is ideal for this because you can define the metric once in a metrics definition file and expose it consistently to all downstream consumers. I would also update the data dictionary to document the agreed definition so future team members do not reinvent the problem.

Q4. You need to design a data model for a healthcare system that must follow HIPAA. What considerations do you include?

Healthcare data modelling requires both technical and regulatory thinking. First, I would classify all data elements by sensitivity level, separating Protected Health Information (PHI) from non-sensitive operational data. PHI fields like patient name, date of birth, SSN, and medical record numbers require encryption at rest and in transit. I would design the schema so that PHI is stored in separate tables with strict access controls, and I would use tokenization or hashing for PHI in any tables used for analytics. Audit logging is a HIPAA requirement, so I would design an audit trail table that captures who accessed or modified which records and when. I would also design for data minimization, meaning I would only collect and store the data that is necessary for each specific purpose. Role-based access control would be enforced at the database level, not just the application level.

Q5. A business analyst tells you a report is running very slowly. The report joins five tables. How do you diagnose and fix the problem using data modelling?

I would start by running the query's execution plan to understand where the bottleneck is. If the execution plan shows a full table scan on a large table, I would check whether the correct indexes exist on the join columns and filter columns. If indexes are in place and the query is still slow, I would look at whether the join order is efficient and whether any implicit type conversions are happening on join columns, which kills index usage. From a data modelling perspective, if this report runs frequently and the underlying data does not change in real time, I would consider creating a materialized view or an aggregated summary table that pre-joins the five tables and stores the result. I would also check whether denormalizing one or two of the tables could reduce the join depth.

Q6. You are building a data model for a financial institution that needs to track transactions across multiple currencies. How do you handle currency conversion?

Currency conversion in a data model is a common challenge. I would store all transaction amounts in both the original currency and a standardized base currency, typically USD. I would create a Currency dimension table that contains the currency code and currency name. I would also create an ExchangeRate fact table that stores the exchange rate for each currency pair for each date. When a transaction comes in, the ETL process looks up the exchange rate for the transaction date and currency pair and calculates the converted amount at the time of the transaction. Storing the historical exchange rate with the transaction is critical because you do not want to retroactively recalculate historical transaction amounts using today's exchange rates, which would distort financial reports.

Q7. Your team is adopting a lakehouse architecture using Delta Lake. How does your data modelling approach change?

A lakehouse architecture blends the flexibility of a data lake with the structure and governance of a data warehouse. With Delta Lake, I would use a medallion architecture, which organizes data into Bronze, Silver, and Gold layers. The Bronze layer stores raw, unprocessed data exactly as it arrived from source systems. The Silver layer applies cleaning, deduplication, and light transformation. The Gold layer contains business-level aggregations and dimensional models ready for consumption by analysts and reporting tools. Delta Lake supports ACID transactions and schema enforcement, so I can apply data quality constraints at the Silver layer. For the Gold layer, I would apply traditional dimensional modelling techniques using star schemas. I would also use Delta Lake's Time Travel feature to handle late-arriving data and to provide an audit trail without building a separate SCD Type 2 mechanism.

Q8. A new business requirement asks you to add a new attribute to a dimension table that already has millions of records. How do you handle this change?

Adding a new column to an existing dimension table is a standard schema evolution task, but it requires careful planning in production. First, I would check whether the new attribute is available for historical records or only for records going forward. If historical data is not available, I would add the column as nullable or assign a default value like 'Unknown' for historical records. I would write and test the ALTER TABLE statement in a development environment first. In cloud platforms like Snowflake or BigQuery, adding a nullable column is typically a metadata operation and does not scan the table, so it is fast even for large tables. I would also update the ETL pipeline to populate the new column for incoming records and backfill it for historical records if the source data is available. Finally, I would update the data dictionary to document the new attribute.

Q9. You are asked to design a data model to support machine learning feature engineering. What does this look like?

A data model for machine learning feature engineering is called a feature store. The core tables in a feature store include a Feature Registry that catalogs all available features with their names, descriptions, data types, and computation logic. A Feature Table or Entity Table stores the precomputed feature values indexed by entity ID, such as CustomerID or ProductID, and a timestamp. The timestamp is critical because machine learning models require point-in-time correct features to avoid data leakage. You design the feature store to support both online serving — where you need low-latency feature retrieval for real-time predictions — and offline training, where you need large batches of historical feature data. Tools like Feast or Tecton provide feature store implementations, but the underlying data model follows these principles regardless of the tool.

Q10. You are reviewing a junior team member's data model and notice they have used a text column to store dates. How do you explain the problem and guide them to fix it?

I would approach this as a coaching conversation rather than a criticism. I would ask them to walk me through their reasoning for using a text column. Then I would explain that storing dates as text creates several problems. First, the database cannot validate that the values are actual valid dates, so invalid values like '2024-02-31' can be stored without error. Second, date calculations and range queries do not work correctly or efficiently on text columns. For example, sorting a text column containing dates gives alphabetical ordering rather than chronological ordering. Third, you cannot use built-in date functions like DATEDIFF or DATE_TRUNC on a text column. I would show them how to use a proper DATE or TIMESTAMP column and walk through how to convert existing text values to proper dates during the fix. Turning it into a learning moment is more valuable than just telling them to change it.

Wrapping Up

Data modeling is a skill at the intersection of technical knowledge and business understanding. You need to know the mechanics of database design, normalization, and dimensional modelling. But you also need to understand why certain design decisions matter and what impact they have on performance, maintainability, and the business.

FAQs

1. What is the best way to respond to scenario-based data modeling queries?

Identify entities, define relationships, make assumptions and discuss trade-offs. Interviewers value your thought process more than the exactness of your answer.

2. Is the nature of data modeling questions different for experienced applicants?

Yes. Experienced applicants will answer questions about architecture decisions, scalability, performance optimizations, and solving real world issues.

3. Which data modeling interview tools should I be familiar with?

An intermediate level of knowledge of tools such as MySQL, PostgreSQL, Snowflake, BigQuery, and ER diagram (e.g. Lucidchart, Draw.io) tools can be useful for standing out against candidates with similar experience.

About the Author

Sanjay Prajapat

Sanjay Prajapat is a Data Engineer and technology writer with expertise in Python, SQL, data visualization, and machine learning. He simplifies complex concepts into engaging content, helping beginners and professionals learn effectively while exploring emerging fields like AI, ML, and cybersecurity in today’s evolving tech landscape.

Drop Us a Query

Fields marked * are mandatory

Name

Phone Number

Comments

Data Science Certification Courses

View All

Data Modelling Interview Questions and Answers

Table of Content

Data Modelling Interview Questions for Freshers

Q1. What is data modelling and why is it important?

Q2. What is the difference between primary and foreign keys?

Q3. What is an Entity-Relationship Diagram (ERD)?

Q4. What is normalization in data modelling?

Q5. What is the difference between a relational data model and a NoSQL data model?

Q6. What is a schema in a database?

Q7. What is cardinality in data modelling?

Q8. What is the difference between a fact table and a dimension table?

Q9. What is a surrogate key and how is it different from a natural key?

Q10. What is a data dictionary?

Intermediate Data Modelling Interview Questions

Q1. What are the differences between a star schema and a snowflake schema?

Q2. What are the different types of slowly changing dimensions (SCD)?

Q3. What is denormalization and when would you use it?

Q4. What is the difference between OLTP and OLAP systems and how does data modelling differ for each?

Q5. What is referential integrity and how do you enforce it?

Q6. What is a bridge table and when do you use one?

Q7. What is a data vault model and how does it differ from a traditional data warehouse model?

Q8. How do you handle null values in a data model?

Q9. What is a composite key and when would you use it?

Q10. What are the best practices for naming conventions in data modelling?

Data Modelling Interview Questions for Experienced Professionals

Q1. How do you design a data model for a cloud-based data warehouse like Snowflake, BigQuery, or Redshift?

Q2. How do you design a data model that can handle historical data changes without losing information?

Q3. What is the role of data lineage in data modelling and how do you implement it?

Q4. How do you approach data modelling for a real-time streaming data system?

Q5. What is schema-on-read versus schema-on-write and when do you use each?

Q6. How do you handle many-to-many relationships in a large-scale data model?

Q7. How do you ensure data model performance at scale?

Q8. What is a galaxy schema and how does it differ from star and snowflake schemas?

Q9. How do you approach dimensional modelling for a multi-tenant SaaS application?

Q10. What is the role of a data modeller in a modern data mesh architecture?

Scenario-Based Data Modelling Interview Questions

Q1. You are designing a data model for an e-commerce platform. How would you structure the database?

Q2. Your company is migrating from a legacy on-premise Oracle database to a cloud data warehouse. How do you approach the data model redesign?

Q3. You discover that two teams in your organization are using different definitions for the same metric. How do you resolve this using data modelling?

Q4. You need to design a data model for a healthcare system that must follow HIPAA. What considerations do you include?

Q5. A business analyst tells you a report is running very slowly. The report joins five tables. How do you diagnose and fix the problem using data modelling?

Q6. You are building a data model for a financial institution that needs to track transactions across multiple currencies. How do you handle currency conversion?

Q7. Your team is adopting a lakehouse architecture using Delta Lake. How does your data modelling approach change?

Q8. A new business requirement asks you to add a new attribute to a dimension table that already has millions of records. How do you handle this change?

Q9. You are asked to design a data model to support machine learning feature engineering. What does this look like?

Q10. You are reviewing a junior team member's data model and notice they have used a text column to store dates. How do you explain the problem and guide them to fix it?

Wrapping Up

FAQs

1. What is the best way to respond to scenario-based data modeling queries?

2. Is the nature of data modeling questions different for experienced applicants?

3. Which data modeling interview tools should I be familiar with?

Sanjay Prajapat

Recommended Reads

Data Science Certification Courses