Data Science Usecase: Keywords

Keywords for data science

terms are linked to their Wikipedia articles

  • data science: using scientific methods, algorithms, and systems to extract knowledge and insights from data
  • decision science: for business problems, data science combined with behavioral science and design thinking to understand end users
  • business intelligence (BI): analyzing and reporting historical data, like sales statistics and operational metrics, to guide strategic decision-making
  • data analysis: inspecting, cleansing, transforming, and modeling data, with the goal of discovering useful information
  • data mining: discovering patterns in data with methods and tools like machine learning, statistics, and database systems
  • exploratory data analysis (EDA): summarizing a dataset’s main characteristics and informing the development of more complex models or logical next steps
  • data engineering: building infrastructure with which data are gathered, cleaned, stored, and prepped for data science
  • DataOps: automated, process-oriented methodologies to improve quality and reduce cycle time in data analytics — akin to DevOps for data, with these key differences
  • artificial intelligence (AI): computer systems that can perform tasks that normally require human intelligence, using human reasoning as a model
  • AIOps: DataOps at the intersection of AI and big data, often using machine learning with the intent to feed continuous insights into continuous improvement, and often including collaborative automation, performance monitoring, and event correlations
  • machine learning (ML): A subset of AI in which a system learns from input by identifying patterns in that data, then applies those patterns to new problems or requests, allowing data scientists to teach a computer to carry out tasks rather than programming it step-by-step
  • supervised learning: a subset of ML with a data scientist guiding or teaching the desired conclusion to the algorithm, such as a system learning to identify problems by being trained on a dataset of correctly labeled and characterized problems
  • deep learning: advanced machine learning systems with multiple input/output layers, as opposed to shallow systems having one round of data input/output
  • MLOps: akin to DevOps or DataOps, collaboration and communication between data scientists and operations professionals to manage the production ML lifecycle, with increased automation and improved quality per business and regulatory requirements

terms are linked to their Wikipedia articles

  • ETL (extract, transform, load): data integration from multiple sources, normalized or transformed into a common or standardized format, often to build a data warehouse
  • data visualization (dataviz): visual representation of text-based information, to help recognize patterns, trends, and correlations and to generally understand the significance of data
  • data model: defines how datasets are connected to each other and how they are processed and stored
  • data warehouse: repository where all the data collected by an organization is stored and used as a guide for business decisions
  • R: programming language for statistical computing, used by statisticians and data miners for data analysis and developing statistical software
  • Python: programming language popular for manipulating and storing data, as well as for general-purpose programming
  • SQL (Structured Query Language): declarative programming language used to perform tasks such as updating or retrieving data
  • big data: data sets too large or complex to be dealt with by traditional data-processing software
  • classification: an example of supervised learning in which an algorithm puts new data under a pre-existing category based on characteristics for which the category is already known — for example, classification can be used to determine if a customer is likely to spend over $20 online, based similarity to other customers who have previously spent that amount
  • cluster analysis: like classification, but where the algorithm receives inputted data and finds similarities in the data itself by grouping data points together that are alike, i.e. classification without supervised learning
  • cross validation: method to validate the stability or accuracy of machine-learning models, often by splitting a training set in two and training an algorithm on one subset before applying it the second
  • linear regression: modeling the relationship between two variables by fitting a linear equation to the observed data, enabling prediction of an unknown variable based on its related, known variable
  • causal inference: process that tests whether there is a relationship between cause and effect, often requiring subject matter expertise in addition to good data and algorithms
  • hypothesis testing: use of statistics to determine the probability that a given hypothesis is true; often used in science
  • statistical power: the probability of making the correct decision to reject the null hypothesis when the null hypothesis is false, i.e. higher statistical power reflects lower likelihood of concluding incorrectly that a variable has no effect
  • standard error: the measure of the statistical accuracy of an estimate, such that larger sample size generally decreases standard error