Data Science

What is Data Science?

Data science is a multidisciplinary field that employs advanced analytical techniques, statistical methods, and machine learning algorithms to collect, analyze, and interpret large and diverse datasets. It combines elements of statistics, computer science, and domain expertise to uncover hidden patterns and insights, enabling organizations to make data-driven decisions.

The Basic Idea

Data science is becoming increasingly useful in our technology-driven world. Our widespread use of the internet and social media has led to increased access to billions and billions of pieces of information. But what do we do with all of this information? 

By using complex analytical techniques to collect and analyze data, and leveraging artificial intelligence and programming, data science uncovers patterns to help organizations plan, strategize, and make data-driven decisions.

Broadly, Data science takes raw data and summarizes it into a cohesive language for decision-makers, ranging from CEOs of large corporations to government officials. For example, Starbucks uses data from a location-analytics company that reveals where target demographic groups are located and traffic patterns to determine where to open new stores. One way the government uses data science is to identify vulnerable populations, analyzing factors such as poverty, education, and unemployment, to inform where resources should be allocated and targeted interventions.1

Data science is often confused with data analytics, but what sets it apart is its scope: data science is about translating complex data into actionable insights; it uses past data to predict future trends. A data analyst uses data to answer questions like: “What was our monthly sales revenue for the last year?” In contrast, a data scientist uses data to pose the question: “Based on historical data, what will our sales be next quarter?”

The Step-by-Step

A data scientist would normally take the following steps to get from a bunch of numbers to meaningful insights.

1. Data ingestion: The first step involves collecting the data to be analyzed. This is done through numerous methods, both automated and manual. Data can come from structured sources like databases and APIs, as well as unstructured sources like social media, IoT devices, and web scraping. Using ETL (Extract, Transform, Load) tools, data scientists can automate the collection and initial transformation of data.

2. Data management and processing: Data comes in many different formats. Because of this, standardizing the data allows for easier analysis. Data standardization and storage structure are up to the organization’s preferences. Data cleaning is crucial at this stage, involving tasks like removing duplicates, handling missing values, and correcting invalid entries. Tools like SQL, Python (pandas), and data warehousing solutions are commonly used to manage and process data efficiently.

3. Data analysis: Here is where the magic happens. Data scientists dive into exploratory data analysis (EDA) to identify patterns, distributions, and potential biases. This involves statistical analysis and the development of machine learning models. Techniques such as regression analysis, clustering, and classification are applied to extract meaningful insights. Tools like Python (scikit-learn), R, and TensorFlow are essential for conducting these analyses.

4. Communication: The insights from the exploratory analysis are translated into an understandable report. Data visualizations are often provided to support the recommendations given and effectively communicate the findings to stakeholders. The report serves as a guide for businesses’ planning and action.

“Data isn’t units of information. Data is a story about human behavior — about real people’s wants, needs, goals, and fears. Our real job with data is to better understand these very human stories, so we can better serve these people.”


– Daniel Burstein, host of the “How I Made It In Marketing” podcast

Theory, meet practice

TDL is an applied research consultancy. In our work, we leverage the insights of diverse fields—from psychology and economics to machine learning and behavioral data science—to sculpt targeted solutions to nuanced problems.

Our consulting services

Key Terms

Business Intelligence (BI): The umbrella term that describes the process of collecting, processing, analyzing, and visualizing business data to support better decision-making within an organization. Although there are some overlaps with data science, BI has a heavier focus on past data and uses more descriptive tools to analyze data.

Descriptive Analysis: A type of data science analysis that summarizes and describes the main features of a dataset. The goal is to make sense of the data’s basic characteristics, identify trends, patterns, and anomalies, and present this information clearly and concisely.

Diagnostic Analysis: This type of analysis dives into the data to understand the underlying reasons behind why something occurred. The goal is to identify the causal relationships in data leading to insights that can inform decision-making and planning.

Predictive Analysis: The process of using historical data, statistical algorithms, and machine learning techniques to forecast future outcomes. This helps businesses anticipate events that could happen, understand potential risks, and identify opportunities.

Prescriptive Analysis: Similar to descriptive and predictive analysis, prescriptive analysis also aims to use past data to forecast future events. These analyses go to the next level by also providing a recommendation of the optimal response to that outcome.

Data Mining: Data mining is the process of closely examining and analyzing big data to identify patterns and gain insights.2

Machine Learning: A subcategory of artificial intelligence (AI) that refers to the capability of a machine to use data and algorithms to mimic the way that humans learn.

History

Data science can be traced back to the 1960s, a period when the fields of computer science and statistics began to intersect. Experts in the field started to realize the potential role computers had in large data analysis. John Tukey, in his 1962 paper, The Future of Data Analysis, emphasized the need for techniques that could efficiently extract meaningful insights from increasingly complex datasets, laying foundational ideas for what would later evolve into the interdisciplinary field of data science.3

The advent of the Internet in 1983 changed the game for data science: now, analysts had access to unprecedented amounts of data on humans and our behavior. The digital age led to big data, which sparked innovation in creating new tools and techniques for processing and analysis. 

Fast-forward to the beginning of the 21st century, William S. Cleveland presented an action plan to push the boundaries of data science outside of theory and into practice.4 His idea was to integrate data mining techniques with computer science to ultimately transform statistics into a tool for innovation. This radical change in the field also warranted the adoption of data science as a new discipline.

As knowledge and interest in data science grew, the title of “data scientist,” a buzzword pioneered by DJ Patil and Jeff Hammerbacher, became its own job. This marked the solidification of the field, creating a dedicated career path for data science enthusiasts. Indeed, Harvard University awarding data scientists the “sexiest job of the 21st century,” was a sure sign that data science is here to stay.5

People

C.F. Jeff Wu: A statistician and professor known for his pioneering work in statistics. In 1985, he used the term “data science” to describe the field of statistics during a lecture at the Chinese Academy of Sciences in Beijing. He ended up popularizing the term and called for the renaming of statistics to data science.

Peter Naur: A Danish computer scientist whose trailblazing work made him a prominent figure in the field. In his book, the Concise Survey of Computer Methods, he alternated between the terms data science and computer science. This sparked the realization among academics of the intersection between computer science and statistics.

William S. Cleveland: American computer scientist and professor, Cleveland is credited with establishing data science in the modern era as its own separate discipline in a 2001 paper Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.

Dhanurjay “DJ” Patil: An American mathematician and computer scientist who, alongside Jeff Hammerbacher, solidified the data scientist career path. Patil served as the White House’s Chief Data Scientist where he established nearly 40 Chief Data Officer roles within the Federal government. In his role, Patil was focused on how to responsibly unleash the power of data to collectively benefit all Americans.6

Dr. Hal Varian: One of the most famous economists and data scientists, Varian currently works as Google’s Chief Economist. One of the first key players to understand the immense potential and importance of data science in all fields. He recognized the digital skills gap that would need to be addressed to truly unlock the potential of data science.

Consequences

When computer scientists and statisticians first conceptualized data science, they saw its potential to revolutionize data analysis and decision-making. Today, data science has profoundly impacted various sectors, driving innovation, efficiency, and strategic decision-making. 

Our ability to accurately sort through larger volumes of data at a much quicker pace is key in developing insights into the problems that we face. These problems can range anywhere from a business determining their target audience and what kind of marketing will be effective, to helping governments evolve fraud-detection algorithms to tackle tax evasion. 

Data science is now part of our everyday lives. Organizations leverage data science to analyze market trends and customer behaviors, optimizing their marketing strategies and enhancing customer experiences. For example, apps like Spotify and Netflix use data science to analyze your preferences and viewing history and then develop personalized recommendations. Similarly, this is why you have a unique Instagram Explore page curated to your interests. By employing big data and sophisticated algorithms, these platforms enhance your experience with tailored content suggestions.

The impact of data science extends far beyond playlists and social media. In healthcare, data science enables the creation of personalized treatment plans by analyzing vast amounts of genetic, clinical, and lifestyle data. It also improves diagnostic accuracy and accelerates medical research by effectively analyzing large datasets.

Controversies

One of the most significant debates surrounding data science concerns its ethicality. Many argue that data science is unethical due (primarily) to privacy concerns. To function effectively, data science relies on the collection, storage, and processing of vast amounts of personal data, potentially compromising individual privacy. While we may be prompted to “accept cookies” when visiting a website, most people do not consider the implications of machines retaining your personal information and potentially selling it to other businesses without your explicit permission. 

Additionally, while data science may seem unbiased as the heavy lifting is completed by a computer, bias still exists. AI and machine learning systems are trained on data that comes from humans, whether that be tracking people’s comments and likes on social media platforms or weeding through complex academic papers. This kind of human data naturally holds bias and machine learning isn’t immune. Additionally, Data scientists are still involved in interpreting data and creating the programs, which means some bias still exists and can perpetuate social inequalities.

For example, returning to the use of data science in healthcare, while algorithms are data-driven and based on patient charts, the dataset will only be representative of a segment of the population captured in the patient database. Since it is not representative of the full population, treatment recommendations resulting from data science will not be suitable for all demographics. There exists racial inequality in medicine, with racial and ethnic minorities receiving lower quality care, meaning that treatment recommendations produced by data science would continue to have racial bias embedded.7

However, data science has tons of positive effects on our society as well. At its core, it helps key decision-makers and influencers understand human behavior and provides strategies for nudging people toward decisions that positively impact society. For example, data science can provide insight into which people would be the best candidates for charitable giving, helping nonprofit organizations decide where to direct their efforts and leading to more donations.

Case Study

Data Science and Its Role in the Pandemic8

It is hard to believe that it has been 4 years since the first COVID-19 lockdown was announced. The repeating cycles of lockdowns and COVID-19 case spikes made the future seem bleak. Because the transmission rate was more severe than SARS or the common flu, it quickly spread to every continent in the world apart from Antarctica.8 There was massive pressure to try to limit its global impact. Data science proved to the world its power to contribute to the efforts to slow down the spread and confront the biggest pandemic of the century.

 How? Through predictive statistical modeling using big data, data scientists were able to estimate the pattern of spread and hotspots. This allowed experts to determine who was most at risk and where. Being able to predict risk helped governments, policymakers, and public health practitioners to plan and protect vulnerable populations such as those in the service industry or senior citizens.

Technology and Fraud9

Privacy concerns often dominate discussions surrounding data science, but the field also plays a pivotal role in protecting individuals and their finances. As fraud becomes more sophisticated, traditional rule-based systems are proving inadequate. Data science utilizes vast amounts of consumer transactional data to identify anomalies and flag suspicious transactions. For instance, if someone makes 10 transactions in one day from the same city and an 11th transaction suddenly appears from a different city, the bank might block the transaction and notify the account holder for verification. This feedback loop is crucial as it helps refine and enhance the fraud detection algorithm over time.
Furthermore, data science enables financial institutions to move beyond simplistic rule-based systems to develop more sophisticated models. By analyzing historical transaction patterns, spending behaviors, and other variables, data scientists can create algorithms that detect subtle signs of fraudulent activity that might otherwise go unnoticed. These models not only help prevent financial losses for consumers but also safeguard the integrity of the financial system.7

Related TDL Content

Let the Data Do the Talking

All research requires funding to proceed. The problem is deciding who should receive it. In the past, the National Research Council (NRC) made these decisions by meeting with stakeholders. But this is a time-consuming and bias-ridden method.

This article explores how TDL assisted the NRC by optimizing its research funding allocation process. The newly developed method emphasizes data. By integrating multiple evaluation metrics into a single composite score, the decision-making process not only is simplified but also made more evidence-based.

Data-Driven Decision-Making

A lot of the time, the way we make decisions is based on a model. These models contain valuable data that help drive our decisions. When choosing what missing furniture our house needs, we envision a mental model of the house and then come to a decision. This TDL article takes a look at how we can make better decisions. Turns out that the key lies in data science and quantitative techniques. Read more about it here.

References

  1. Stobierski, Tim. (2019, August 26). The Advantages of Data Driven Decision-Making. Harvard Business School Online. https://online.hbs.edu/blog/post/data-driven-decision-making
  2. Press, G. (2022, April 14). A Very Short History Of Data Science. Forbes. https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/
  3. Hand, D. (2023, September 30). Hand writing: John Tukey, the first data scientist? Institute of Mathematical Statistics. https://imstat.org/2023/09/30/hand-writing-john-tukey-the-first-data-scientist/
  4. Cleveland, W. S. (2001). Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics. International Statistical Review / Revue Internationale de Statistique, 69(1), 21–26. https://doi.org/10.2307/1403527
  5. Davenport, T. H., & Patil, D. J. (2012, October). Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
  6. DJ Patil. (n.d.). Stanford - Program in Data Science. Retrieved June 17, 2024, from https://datasciencemajor.stanford.edu/dj-patil-0
  7. Backman, Isabelle. (2023, December 21). Eliminating Racial Bias in Health Care AI: Expert Panel Offers Guidelines. https://medicine.yale.edu/news-article/eliminating-racial-bias-in-health-care-ai-expert-panel-offers-guidelines
  8. Chow, R. (2021, October 6). The Role of Data Science During the COVID-19 Pandemic. History of Data Science. https://www.historyofdatascience.com/the-role-of-data-science-during-the-covid-19-pandemic/
  9. Mascia, K. (2022, May 10). How data science is ushering in a new era of modern medicine. Johnson & Johnson. https://www.jnj.com/innovation/how-data-science-ushers-in-new-era-of-modern-medicine

About the Authors

Samantha Lau

Samantha Lau

Samantha graduated from the University of Toronto, majoring in psychology and criminology. During her undergraduate degree, she studied how mindfulness meditation impacted human memory which sparked her interest in cognition. Samantha is curious about the way behavioural science impacts design, particularly in the UX field. As she works to make behavioural science more accessible with The Decision Lab, she is preparing to start her Master of Behavioural and Decision Sciences degree at the University of Pennsylvania. In her free time, you can catch her at a concert or in a dance studio.

Emilie Rose Jones

Emilie Rose Jones

Emilie currently works in Marketing & Communications for a non-profit organization based in Toronto, Ontario. She completed her Masters of English Literature at UBC in 2021, where she focused on Indigenous and Canadian Literature. Emilie has a passion for writing and behavioural psychology and is always looking for opportunities to make knowledge more accessible. 

Read Next

Notes illustration

Eager to learn about how behavioral science can help your organization?