Data science is the driving force behind today's most exciting technologies, including computer science, statistics, machine learning, deep learning, data analysis, and visualization, and more. It has the capability to analyze and transform raw data into actionable and meaningful insights.
You may have also noticed the popularity of data scientists around the world. They are valuable assets for both emerging and fortune companies. Becoming one of them could be your chance to earn INR 8 LPA to INR 29 LPA salary packages. As a working professional, I have gathered the top Data Science interview questions in this article that can help you crack your next interview. Let's begin!
Explore our some of our best Data Science Certification courses.
Data Science Interview Questions For Freshers
Fresher is the first role an individual starts their journey from. Therefore, they often encounter easy and fundamental questions during the interviews. As an interviewer, I have asked the following data science interview questions.
1. What is data science?
Data science is an interdisciplinary field that involves using different techniques, algorithms, tools and technologies to transform a bunch of raw data into meaningful insights. It also uses statistical and computational analysis to uncover the hidden patterns from big data. The complete involves the following steps:
- Data Gathering: The first step involves collecting business-relevant data and requirements.
- Data Organizations: The raw data should be cleansed and well-organized to extract the most important insights. It is done by performing various actions like data cleaning, warehousing, staging, etc.
- Data Processing: Once the data is ready to use, it is processed for mining and analysis. This step involves removing all the impurities and anomalies from the data.
- Data Analysis: After cleaning the data, it is passed through various algorithms like text mining, regression, predictive analysis, recognition patterns, and more. The algorithms are selected based on the business requirements.
- Data Visualization: The final steps involve conveying the insights through interactive visuals through data visualization techniques. These techniques include charts, graphs, images, dashboards, templates, reports and more.
Tip: You can also explain it with an example, if you have worked on any project.
2. Is data science and data analytics the same?
Data science and data analytics are very much related but are not the same. Data analytics is just a part of data science. A part of performing analysis of various kinds of data is data analytics. Data science involves a variety of such activities like data cleansing, organization, visualization and more.
Tip: Give a small answer, as both of them have various similarities that can create confusion in your explanation.
3. Do you know the techniques used for sampling? Why should we use them?
The raw datasets often include many impurities and irrelevant information, which can not be used for further processing. Data sampling allows us to collect the valuable data or remove the irrelevant data from the given datasets. Here are the most common techniques one can use for sampling:
- Probability Sampling Techniques: It involves clustered sampling, stratified sampling, random sampling and more.
- Non-Probability Sampling Techniques: It involves quota sampling, snowball sampling, convenience sampling and more.
Tip: You can also explain where and why to use the techniques.
4. What do you understand about overfitting and underfitting?
Overfitting and underfitting are the issues with machine learning models, which shows their learning efficiency. Overfitting is the result of a prefect trained model where it learns everything from the training data, including its noise. This resulted in poor generalization to new and unseen data. Underfitting happens when a model is too simple to capture the essential relationships in the data. This results in poor performance of both training and new data.
Tip: You should also know how to avoid or solve these issues, as they might ask you.
5. What is resampling and why is it used?
Resampling involves frequently extracting samples from a dataset to build new datasets. This technique is used to evaluate model performance, handle imbalanced datasets, reduce overfitting and perform feature selection. This approach allows for a more robust assessment of a model's performance and can help in various data analysis and machine learning operations.
Tip: You can also add one or two examples where it can be used.
6. What is kernel trick?
The kernel trick is used in algorithms that allows them to operate in a higher-dimensional space without explicitly calculating the coordinates in that space. This is achieved by using a "kernel function" which computes the inner product in the higher-dimensional space based on the original data points. It enables linear algorithms to solve non-linear problems by implicitly transforming the data into a more suitable space for classification or other tasks.
Related Article- Data Science Career - Exploring The Right Why And How
Intermediates are the most demanded individuals in any industry as they are considered as the new talents of the market. Therefore, one should prepare well for these types of interviews. The most common asked data science interview questions for intermediate are as follows:
1. Explain the ROC curve?
ROC stands for Receiver Operating Characteristic. It is a graphical representation of true positive rate and false positive rate. This representation helps individuals to detect the right tradeoff between both rates. It considers many different probability thresholds of the potential outcome. Models with the shorter difference between the curves are said to be the best one. Here is an instance:
Tip: Only use the instance, when you know the explanations.
2. What do you understand about precision?
It is a metric used to evaluate the precision of a model's positive predictions. It specifically measures the proportion of correctly predicted positive instances out of all instances that the model classified as positive. It basically answers the question: "Out of all the items the model identified as belonging to a certain class, how many actually belong to that class?"
Tip: Give a short and clear explanation, don't create any confusion. I have seen most of the candidates doing it.
3. What is Recall? How does it differ from Precision?
Recall or Sensitivity or True Positive Rate measures the proportion of real positive instances. The real positive instance is identified by the model correctly. It answers the question: "Out of all the actual positive items that should have been identified, how many did the model correctly find?" It is calculated as:
Recall=fractextTruePositivestextTruePositives+textFalseNegatives |
The key difference from Precision is their focus: Precision focuses on the accuracy of the positive predictions made by the model, while Recall focuses on how many of the actual positives in the dataset the model was able to capture.
Tip: Explain the difference accurately.
4. What is the F1-Score and when is it useful?
The F1-score is the harmonic mean of Precision and Recall. It provides a metric that can balance precision and recall at the same time. It is especially useful when you need to evaluate a model on datasets with an uneven class distribution (imbalanced datasets), where simply looking at accuracy might be misleading.
F1text-Score=2timesfractextPrecisiontimestextRecalltextPrecision+textRecall |
A high F1-score shows that the model has high recall and high precision, meaning it correctly identifies positive cases without making too many false positive errors, and it captures most of the actual positive cases.
Tip: You can also explain it with a hands-on example based on your experience.
5. Explain the difference between supervised and unsupervised learning.
These are two fundamental paradigms in machine learning based on how the model learns from data:
- Supervised Learning: This approach involves training a model on a labeled dataset, meaning the input data has corresponding "correct" output labels. The model learns a mapping from inputs to outputs, and its goal is to predict the label for new, unseen data. For instance classification and regression. Think of it like learning with a teacher providing correct answers.
- Unsupervised Learning: Unsupervised learning works on unlabeled data. The model's goal is to find hidden patterns, structures, or relationships within the data on its own, without any prior knowledge of outputs. This is often used for tasks like clustering (grouping similar data points) and dimensionality reduction. Think of it as learning by discovering structure without a teacher.
Related Article- Tips To Improve Data Science Strategy in the Organization
Tip: Explain in detail as these are the two most important paradigms.
6. What is the bias-variance trade-off?
The bias-variance tradeoff describes the relationship between two sources of error, which are bias and variance. It highlights that a model cannot simultaneously minimize both bias (underfitting) and variance (overfitting). The main goal is to find the optimal balance between these two. This leads to a model that generalizes well to unseen data.
Data Science Interview Questions for Experienced Professionals
Experienced data scientists are highly sought after for their ability to lead projects, solve complex problems and understand the business impact of their work. The most common data science interview questions for experienced professionals are as follows:
1. How do you handle missing values in a dataset?
Managing missing values is an important step of data preprocessing. My approach depends on the nature and extent of missingness. Common strategies include:
- Deletion: Removing rows (listwise deletion) or columns entirely if a large proportion of data is missing, or if the missingness is random and negligible.
- Imputation: Filling in missing values. Techniques range from simple (mean, median, mode imputation) to more sophisticated ones (e.g., K-Nearest Neighbors (KNN) imputation, regression imputation, or using predictive models like MICE). For time-series data, forward-fill or backward-fill can be effective.
- Sophisticated Methods: Using algorithms that can inherently handle missing values (like XGBoost or LightGBM), or creating separate binary indicator variables to flag missingness if the fact that data is missing holds information itself. The choice is driven by data characteristics, domain knowledge, and the potential impact on model performance.
Tip: Add an example where you actually performed this operation.
2. Explain ensemble learning and name some popular techniques.
Ensemble learning is a powerful machine learning paradigm where multiple individual models are trained to solve the same problem. The predictions from these base learners are then combined to achieve better predictive performance. The core idea is that combining diverse perspectives leads to a more robust and accurate overall prediction.
Popular ensemble techniques include:
- Bagging: Trains multiple base learners independently on different random subsets (bootstrapped samples) of the training data. Predictions are then averaged or voted. Random Forest is a prime example.
- Boosting: Trains base learners sequentially, where each new learner attempts to correct the errors made by the previous ones. It focuses on misclassified data points. Examples include AdaBoost, Gradient Boosting Machines (GBM), XGBoost, and LightGBM.
- Stacking: Trains a meta-learner (a "stacker" model) to combine the predictions of several diverse base learners. The predictions of base learners are the input features for the meta-learner.
Tip: You can also give your preference to show your experience.
3. Describe your experience with A/B testing.
A/B testing is a randomized controlled experiment used to compare two versions of a product, marketing campaign and feature to detect which performs better. My experience involves:
- Defining Hypotheses: Clearly stating the null and alternative hypotheses, and identifying the key metric(s) to optimize (e.g., conversion rate, click-through rate).
- Experimental Design: Determining sample size, control and variant groups, duration, and ensuring proper randomization to minimize bias.
- Data Collection & Analysis: Setting up tracking, collecting data from both groups, and then performing statistical analysis (e.g., t-tests, chi-squared tests) to determine if the observed difference is statistically significant.
- Interpretation & Recommendation: Translating statistical results into actionable business insights, recommending whether to launch the variant or stick with the control, and documenting findings. I've used A/B tests to optimize user interfaces, email subject lines, and marketing offers.
Tip: This question is all about checking your experience and expertise, so answer wisely.
4. How do you approach model deployment and MLOps?
My approach to model deployment and MLOps (Machine Learning Operations) focuses on building a repeatable, scalable, and reliable pipeline for getting models into production and maintaining them. Key steps include:
- Model Packaging: Containerizing the trained model and its dependencies (e.g., using Docker) to ensure consistency across environments.
- API Development: Creating a RESTful API (e.g., with Flask or FastAPI) to expose the model for inference, allowing other applications to interact with it.
- Deployment Environment: Utilizing cloud platforms (AWS Sagemaker, GCP AI Platform, Azure ML) or Kubernetes for robust scaling, monitoring, and management.
- Monitoring: Implementing dashboards and alerts for model performance (accuracy, latency, drift detection), infrastructure health, and data quality.
- Retraining Strategy: Defining a clear strategy for model retraining - whether on a schedule, based on performance degradation (drift), or new data availability. This often involves automated data pipelines and version control for models and data.
- Version Control & CI/CD: Applying DevOps principles to ML, including versioning models and data, and using CI/CD pipelines for automated testing, building, and deployment of ML workflows.
Tip: This one checks your proficiency and problem solving skills, so answer accordingly.
5. What are the ethical considerations in data science?
Ethical considerations are paramount in data science, especially as models impact real-world decisions. Key areas are:
- Bias and Fairness: Ensuring models do not perpetuate or amplify existing societal biases (e.g., in hiring or loan applications) due to biased training data or algorithmic design. This involves bias detection, debiasing techniques, and fairness metrics.
- Privacy and Data Security: Protecting sensitive user data, adhering to regulations like GDPR or CCPA, anonymization/pseudonymization, and secure data storage and access.
- Transparency and Explainability (XAI): Making model decisions understandable, especially for critical applications. This involves using interpretable models or explainability techniques like SHAP or LIME.
- Accountability: Establishing clear responsibility for model outcomes, especially in cases of error or harmful impact.
- Misinformation and Manipulation: Being aware of how data science can be used to spread false information or manipulate behavior, and adhering to responsible use guidelines. I believe in a "privacy by design" and "ethics by design" approach.
Tip: This one is for checking your knowledge level.
6. What do you understand about Support Vectors in SVM (Support Vector Machine)?
Support Vectors are the closest data points to the decision boundary (hyperplane) and directly influence its position and orientation. These vectors are important as they define the margin, the distance between the hyperplane and the nearest data points of each class, which SVM aims to maximize. This is why the support vectors are the "support" for the hyperplane and their removal would change the decision boundary.
Related Article- Why Data Science Jobs Are In High Demand [Updated 2026]
Data Science Technical Interview Questions
These questions delve into your foundational knowledge of algorithms, statistics, and core data science concepts.
1. Explain the bias-variance trade-off.
The bias-variance trade-off is a central concept in machine learning that helps explain model performance.
- Bias: It refers to the simplifying assumptions made by a model to learn the target function more easily. High bias implies the model is too simple (underfitting) and systematically misses the true relationships in the data.
- Variance: It refers to the sensitivity of a model to small fluctuations in the training data. High variance implies the model is too complex (overfitting) and learns the noise in the training data that performs badly on unseen data. The trade-off is that reducing bias typically increases variance, and reducing variance typically increases bias. The goal is to find a balance where the model generalizes well to new data, minimizing the total error (which is roughly reducible error due to bias + reducible error due to variance + irreducible error).
Tip: Answer according to your experience level. Experienced ones are expected to give detailed explanation.
2. What is regularization and why is it used? Name common types.
Regularization is a technique used in machine learning, particularly in regression and classification. It prevents overfitting condition and improves the generalization capability of a model.
It attaches a penalty term on the loss function in model training, which discourages the model from assigning excessively large weights to features. This effectively shrinks the coefficients, making the model simpler and less prone to capturing noise in the training data.
Common types of regularization include:
- L1 Regularization: It is also known as Lasso Regression. It adds a penalty proportional to the actual value of the coefficients. It often leads to sparse models through driving coefficients exactly to zero, effectively performing feature selection.
- L2 Regularization: It is also known as Ridge Regression. It adds a penalty proportional to the double of the magnitude of the coefficients. It shrinks coefficients towards zero but doesn't force them to be exactly zero.
- Elastic Net Regularization: A combination of L1 and L2 regularization, offering the benefits of both.
Tip: Regularization is the fundamental technical concept, therefore, every individual should have the correct explanation.
3. How does a Decision Tree work?
It is a non-parametric supervised learning algorithm that can be used for both regression and classification tasks. It works by recursively splitting the dataset into smaller and smaller subsets based on feature values, creating a tree-like structure of decisions.
- Splitting: At each node, the tree identifies the best feature and a split point (or category) that best divides the data, typically by maximizing information gain or minimizing impurity (e.g., Gini impurity or entropy).
- Nodes: Internal nodes are the tests on an attribute. Branches are the outcome of the test, and leaf nodes (terminal nodes) represent the class label or a continuous value.
- Recursion: The process continues until a stopping criterion is met (e.g., a node contains only one class, a minimum number of samples per leaf is reached, or maximum depth is achieved). The model learns a set of if-then-else rules from the training data.
Tip: It is a very good topic of discussion, prepare it well and be ready for additional questions.
4. What is Principal Component Analysis (PCA) and when would you use it?
PCA is among the unsupervised dimensionality reduction techniques. Its main goal is to transform a high-dimensional dataset into a lower-dimensional one. It does this by creating a new set of orthogonal (uncorrelated) variables called Principal Components (PCs), which are linear combinations of the original features. The first PC captures the most variance, the second captures the most remaining variance, and so on.
You would use PCA when:
- Dimensionality Reduction: To reduce the number of features in a dataset, especially when dealing with the "curse of dimensionality."
- Noise Reduction: To filter out noise by retaining only the components that capture significant variance.
- Visualization: To visualize high-dimensional data by projecting it onto 2 or 3 principal components.
- Multicollinearity: To address multicollinearity issues in regression models by transforming correlated features into uncorrelated principal components.
Tip: There are many instances of using it, you can pick any of them and showcase your hands-on experience.
5. Differentiate between covariance and correlation.
Both covariance and correlation measure the relationship between two variables, but they differ significantly:
- Covariance: Measures the directional relationship between two variables. A positive covariance indicates that variables move in the same direction (as one increases, the other tends to increase). A negative covariance always moves in opposite directions that are closes to zero, suggesting no linear relationship.
- Correlation: Measures both the direction and strength of the linear relationship between two variables. It is a normalized version of covariance, scaled to be between -1 and +1.
- +1 indicates a perfect positive linear relationship.
- -1 indicates a perfect negative linear relationship.
- 0 indicates no linear relationship. Correlation is dimensionless, making it easier to compare the relationships between different pairs of variables regardless of their units.
Tip: These are advanced topics, meaning beginners can give answers with a definition. However, the experienced ones can face some additional questions.
6. What do you understand about Eigenvectors and Eigenvalues?
Eigenvectors and eigenvalues are two of the most important concepts from linear algebra. They help analyze and simplify complicated data. Eigenvectors are special vectors that only scale, stretch or shrink without changing direction.
The corresponding eigenvalue is the scaling factor. They are fundamental to understanding data structure, dimensionality reduction and feature extraction in various data science and machine learning algorithms.
7. How are univariate, bivariate and multivariate analysis different?
These three types of analysis are differentiated as follows:
| Feature | Univariate Analysis | Bivariate Analysis | Multivariate Analysis |
| Definition | Analysis of a single variable | Analysis of two variables | Analysis of more than two variables simultaneously |
| Purpose | To understand the distribution, central tendency, and spread of one variable | To explore the relationship between the two variables | To explore interactions and relationships among multiple variables |
| Number of Variables | One | Two | Three or more |
| Techniques Used | Frequency distribution, histograms, bar charts, mean, median, mode, standard deviation | Scatter plots, correlation, cross-tabulation, regression | Multiple regression, factor analysis, MANOVA, PCA |
| Visualization Tools | Bar chart, histogram, pie chart | Scatter plot, line chart | Heatmap, 3D plots, correlation matrix |
| Example Use Case | Analyzing the average age of customers | Relationship between hours of study and the exam score | Predicting house prices based on size, location, and age |
| Output Type | Descriptive statistics | Relationship metrics (e.g., correlation coefficient) | Predictive or explanatory models |
| Complexity | Simple | Moderate | High |
Related Article- Top Data Science Tools and Technologies for 2026
Data Science With Python Interview Questions
These questions test your practical coding skills, often focusing on Python, data manipulation with libraries like Pandas, and sometimes implementing core algorithm logic.
1. Write Python code to find the most frequent element in a list.
from collections import Counter def find_most_frequent(data_list): """ Finds the most frequent element(s) in a list. If multiple elements have the same highest frequency, all are returned. """ if not data_list: return [] # Return empty list if input is empty # Use Counter to get frequencies of all elements counts = Counter(data_list) # Find the maximum frequency max_count = 0 if counts: # Check if counts is not empty max_count = max(counts.values()) # Collect all elements that have the maximum frequency most_frequent_elements = [ item for item, count in counts.items() if count == max_count ] return most_frequent_elements # Test cases print(f"Most frequent in [1, 3, 3, 2, 1, 1, 4]: {find_most_frequent([1, 3, 3, 2, 1, 1, 4])}") print(f"Most frequent in ['apple', 'banana', 'apple', 'orange', 'banana']: {find_most_frequent(['apple', 'banana', 'apple', 'orange', 'banana'])}") print(f"Most frequent in [5, 5, 5, 2, 2, 2]: {find_most_frequent([5, 5, 5, 2, 2, 2])}") print(f"Most frequent in []: {find_most_frequent([])}") |
2. How would you clean a dataset in Pandas: handle missing values, duplicates, and convert data types?
import pandas as pd import numpy as np def clean_dataframe(df): """ Cleans a Pandas DataFrame by handling missing values, duplicates, and converting common data types. """ print("--- Original DataFrame Info ---") df.info() print("\nOriginal Head:\n", df.head()) print("\nMissing values before cleaning:\n", df.isnull().sum()) print("\nDuplicates before cleaning:", df.duplicated().sum()) # 1. Handle Missing Values: # For numerical columns, fill with median (robust to outliers) # For categorical columns, fill with mode (most frequent) # Drop columns with too many missing values (e.g., >70%) # Identify numerical and categorical columns numerical_cols = df.select_dtypes(include=np.number).columns categorical_cols = df.select_dtypes(include='object').columns # Assuming object for strings for col in numerical_cols: if df[col].isnull().any(): median_val = df[col].median() df[col].fillna(median_val, inplace=True) print(f"Filled missing values in '{col}' with median: {median_val}") for col in categorical_cols: if df[col].isnull().any(): mode_val = df[col].mode()[0] # mode() returns a Series, take first element df[col].fillna(mode_val, inplace=True) print(f"Filled missing values in '{col}' with mode: '{mode_val}'") # Drop columns with too many NaNs (example threshold) original_cols = df.shape[1] df.dropna(axis=1, thresh=len(df) * 0.3, inplace=True) # Keep columns with at least 30% non-nulls if df.shape[1] < original_cols: print(f"Dropped {original_cols - df.shape[1]} columns due to excessive missing values.") # 2. Handle Duplicates: # Drop full row duplicates initial_rows = df.shape[0] df.drop_duplicates(inplace=True) if df.shape[0] < initial_rows: print(f"Removed {initial_rows - df.shape[0]} duplicate rows.") # 3. Convert Data Types: # Convert 'object' columns that should be categorical to 'category' dtype for col in categorical_cols: if col in df.columns: # Ensure column wasn't dropped df[col] = df[col].astype('category') print(f"Converted '{col}' to 'category' dtype.") # Example: convert a specific column to datetime if applicable if 'date_column' in df.columns: # Placeholder for a date column try: df['date_column'] = pd.to_datetime(df['date_column'], errors='coerce') print(f"Converted 'date_column' to datetime dtype.") except Exception: print("Could not convert 'date_column' to datetime.") print("\n--- Cleaned DataFrame Info ---") df.info() print("\nCleaned Head:\n", df.head()) print("\nMissing values after cleaning:\n", df.isnull().sum()) print("Duplicates after cleaning:", df.duplicated().sum()) return df # Create a sample DataFrame with missing values, duplicates, and mixed types data = { 'ID': [1, 2, 3, 4, 5, 1, 6, 7], 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Alice', 'Frank', 'Grace'], 'Age': [24, 30, np.nan, 28, 22, 24, 35, 29], 'City': ['NY', 'LA', 'NY', 'SF', 'LA', 'NY', 'SF', np.nan], 'Score': [85.5, 90.0, 78.0, np.nan, 92.0, 85.5, 70.0, 88.0], 'Gender': ['F', 'M', 'M', 'M', 'F', 'F', 'M', 'F'], 'Unknown_Col': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan] # A column with all NaNs } sample_df = pd.DataFrame(data) cleaned_df = clean_dataframe(sample_df.copy()) |
3. Implement a simple linear regression from scratch (conceptually, or simplified code).
Conceptually, simple linear regression aims to model the relationship between a single independent variable (X) and a dependent variable (Y) by fitting a straight line to the data. The equation of this line is Y=b_0+b_1X, where b_0 is the y-intercept and b_1 is the slope.
The goal is to find the values of b_0 and b_1 that minimize the sum of squared errors (the difference between the actual Y values and the predicted Y values). This is typically done using the Ordinary Least Squares (OLS) method, where the formulas for b_0 and b_1 are derived:
b_1=fracsum(X_i-barX)(Y_i-barY)sum(X_i-barX)2 b_0=barY-b_1barX where barX and barY are the means of X and Y, respectively.
import numpy as np class SimpleLinearRegression: def __init__(self): self.b0 = 0 # Intercept self.b1 = 0 # Slope def fit(self, X, y): """ Calculates the coefficients (b0 and b1) for simple linear regression using the Ordinary Least Squares method. """ if len(X) != len(y): raise ValueError("Input arrays X and y must have the same length.") if len(X) < 2: raise ValueError("At least two data points are required for linear regression.") n = len(X) mean_X = np.mean(X) mean_y = np.mean(y) # Calculate numerator and denominator for b1 numerator = np.sum((X - mean_X) * (y - mean_y)) denominator = np.sum((X - mean_X)**2) if denominator == 0: raise ValueError("Cannot perform regression: X values are all the same.") self.b1 = numerator / denominator self.b0 = mean_y - self.b1 * mean_X def predict(self, X_new): """ Predicts Y values for new X values using the learned coefficients. """ return self.b0 + self.b1 * np.array(X_new) # Example Usage: # Training data X_train = np.array([1, 2, 3, 4, 5]) y_train = np.array([2, 4, 5, 4, 5]) # Create and train the model model = SimpleLinearRegression() model.fit(X_train, y_train) print(f"Calculated Intercept (b0): {model.b0}") print(f"Calculated Slope (b1): {model.b1}") # Make predictions X_test = np.array([6, 7]) predictions = model.predict(X_test) print(f"Predictions for X={X_test}: {predictions}") # Expected output from online calculator for these values: b0 = 2.2, b1 = 0.6 |
4. Given two lists, write Python code to find common elements.
def find_common_elements(list1, list2): """ Finds elements that are common to both list1 and list2. Returns a list of unique common elements. """ # Convert lists to sets for efficient intersection finding set1 = set(list1) set2 = set(list2) # Find the intersection of the two sets common_elements_set = set1.intersection(set2) # Convert the resulting set back to a list (optional, if list format is required) return list(common_elements_set) # Test cases list_a = [1, 2, 3, 4, 5] list_b = [4, 5, 6, 7, 8] print(f"Common elements between {list_a} and {list_b}: {find_common_elements(list_a, list_b)}") list_c = ['apple', 'banana', 'cherry'] list_d = ['banana', 'date', 'cherry', 'grape'] print(f"Common elements between {list_c} and {list_d}: {find_common_elements(list_c, list_d)}") list_e = [10, 20, 30] list_f = [40, 50, 60] print(f"Common elements between {list_e} and {list_f}: {find_common_elements(list_e, list_f)}") |
5. Write a Python function to calculate entropy for a given set of probabilities.
Entropy is a quantity of the impurity in a data set. In the context of information theory and decision trees, it quantifies the average amount of information needed to identify the outcome of a random variable. A higher entropy means more uncertainty.
The formula for entropy (H) for a discrete random variable with n possible outcomes, each with probability P_i, is:H(X)=-sum_i=1nP_ilog_2(P_i)
import numpy as np def calculate_entropy(probabilities): """ Calculates the entropy for a given set of probabilities. Args: probabilities (list or np.array): A list or array of probabilities. These should sum to 1. Returns: float: The calculated entropy. Returns 0 if probabilities list is empty or if all probabilities are 0 (no information). """ if not probabilities: return 0.0 # Ensure probabilities are numpy array for element-wise operations probabilities = np.array(probabilities) # Remove any zero probabilities to avoid log(0) which is undefined # For entropy, P*log(P) where P=0 is considered 0, so we filter them out. non_zero_probs = probabilities[probabilities > 0] if len(non_zero_probs) == 0: # Handle cases where all probabilities are 0 return 0.0 # Calculate entropy using the formula: -sum(p * log2(p)) entropy = -np.sum(non_zero_probs * np.log2(non_zero_probs)) return entropy # Test cases # Example 1: Fair coin (50% heads, 50% tails) - Max entropy for 2 outcomes probs1 = [0.5, 0.5] print(f"Entropy for {probs1}: {calculate_entropy(probs1)}") # Expected: 1.0 # Example 2: Certain event (100% heads) - Min entropy probs2 = [1.0, 0.0] print(f"Entropy for {probs2}: {calculate_entropy(probs2)}") # Expected: 0.0 # Example 3: Biased coin probs3 = [0.75, 0.25] print(f"Entropy for {probs3}: {calculate_entropy(probs3)}") # Expected: approx 0.81 # Example 4: Multiple outcomes probs4 = [0.25, 0.25, 0.25, 0.25] print(f"Entropy for {probs4}: {calculate_entropy(probs4)}") # Expected: 2.0 # Example 5: Empty or all zeros probs5 = [] print(f"Entropy for {probs5}: {calculate_entropy(probs5)}") probs6 = [0.0, 0.0, 0.0] print(f"Entropy for {probs6}: {calculate_entropy(probs6)}") |
Top 10 Data Science MCQ Questions
1. Why use exploratory data analysis (EDA)?
A. To clean the dataset
B. To predict future values
C. To summarize the dataset and identify patterns
D. To deploy a model
2. Which of the following algorithms is suitable for classification problems?
A. K-Means
B. Linear Regression
C. Logistic Regression
D. PCA
3. What does a high variance in a model typically indicate?
A. The model is underfitting
B. The model is stable
C. The model generalizes well
D. The model is overfitting
4. Which metric is best to evaluate a classification model on imbalanced datasets?
A. Accuracy
B. RMSE
C. F1-Score
D. Mean Squared Error
5. What is feature engineering?
A. Selecting the model
B. Creating new input variables from existing data
C. Removing outliers
D. Splitting the dataset
6. In which case would you use PCA?
A. When labels are missing
B. For time series forecasting
C. To reduce dimensionality
D. For clustering
7. What type of plot is best for visualizing the distribution of a single continuous variable?
A. Scatter plot
B. Histogram
C. Bar chart
D. Line graph
8. What is a confusion matrix used for?
A. Visualizing correlation
B. Tracking model training time
C. Evaluating classification performance
D. Performing clustering
9. Which of the following is a supervised learning technique?
A. K-Means Clustering
B. DBSCAN
C. Random Forest
D. PCA
10. What does cross-validation help with?
A. Reducing the number of features
B. Ensuring the model runs faster
C. Assessing the model's ability to generalize
D. Collecting new data
Wrapping Up Data Science Interview Questions
This article has navigated you through a comprehensive set of data science interview questions from fundamental concepts to technical and coding aspects. Additionally, the tips at the end of each question will help you to impress the interviewees.
Data science is a dynamic field, where continuous learning is key. By understanding these core concepts and practicing your problem-solving skills, you are well-equipped to impress potential employers and start a rewarding career path. Keep exploring, keep learning, and good luck with your data scientist interview preparation!
FAQs
Q1. How to prepare for data science interviews?
Preparing for a data science interview requires technical skills, behavioral aspects and a deep understanding of the role and company. The best data science interview questions can be a great help in this preparation.
Q2. How many types of data science are there?
The 4 types of data science are:
- Descriptive (what happened)
- Diagnostic (why it happened)
- Predictive (what will happen)
- Prescriptive (what should be done).
Q3. What is the salary of a data scientist?
The salary of a data scientist ranges from INR 8 LPA to INR 29 LPA, with an average of INR 12 LPA.
Q4. What skills are required for a Data Scientist?
Key skills include Python or R programming, statistics, machine learning, data visualization and problem-solving.
Q5. What jobs are available in Data Science?
Common jobs include:
- Machine Learning Engineer
Course Schedule