图书介绍

数据挖掘导论 英文版2025|PDF|Epub|mobi|kindle电子书版本百度云盘下载

数据挖掘导论 英文版
  • (美)谭庞宁,(美)斯坦巴克,(美)库马尔著 著
  • 出版社: 北京:机械工业出版社
  • ISBN:9787111316701
  • 出版时间:2010
  • 标注页数:771页
  • 文件大小:50MB
  • 文件页数:791页
  • 主题词:数据采集-英文

PDF下载


点此进入-本书在线PDF格式电子书下载【推荐-云解压-方便快捷】直接下载PDF格式图书。移动端-PC端通用
种子下载[BT下载速度快]温馨提示:(请使用BT下载软件FDM进行下载)软件下载地址页直链下载[便捷但速度慢]  [在线试读本书]   [在线获取解压码]

下载说明

数据挖掘导论 英文版PDF格式电子书版下载

下载的文件为RAR压缩包。需要使用解压软件进行解压得到PDF格式图书。

建议使用BT下载工具Free Download Manager进行下载,简称FDM(免费,没有广告,支持多平台)。本站资源全部打包为BT种子。所以需要使用专业的BT下载软件进行下载。如BitComet qBittorrent uTorrent等BT下载工具。迅雷目前由于本站不是热门资源。不推荐使用!后期资源热门了。安装了迅雷也可以迅雷进行下载!

(文件页数 要大于 标注页数,上中下等多册电子书除外)

注意:本站所有压缩包均有解压码: 点击下载压缩包解压工具

图书目录

1 Introduction1

1.1 What Is Data Mining?2

1.2 Motivating Challenges4

1.3 The Origins of Data Mining6

1.4 Data Mining Tasks7

1.5 Scope and Organization of the Book11

1.6 Bibliographic Notes13

1.7 Exercises16

2 Data19

2.1 Types of Data22

2.1.1 Attributes and Measurement23

2.1.2 Types of Data Sets29

2.2 Data Quality36

2.2.1 Measurement and Data Collection Issues37

2.2.2 Issues Related to Applications43

2.3 Data Preprocessing44

2.3.1 Aggregation45

2.3.2 Sampling47

2.3.3 Dimensionality Reduction50

2.3.4 Feature Subset Selection52

2.3.5 Feature Creation55

2.3.6 Discretization and Binarization57

2.3.7 Variable Transformation63

2.4 Measures of Similarity and Dissimilarity65

2.4.1 Basics66

2.4.2 Similarity and Dissimilarity between Simple Attributes67

2.4.3 Dissimilarities between Data Objects69

2.4.4 Similarities between Data Objects72

2.4.5 Examples of Proximity Measures73

2.4.6 Issues in Proximity Calculation80

2.4.7 Selecting the Right Proximity Measure83

2.5 Bibliographic Notes84

2.6 Exercises88

3 Exploring Data97

3.1 The Iris Data Set98

3.2 Summary Statistics98

3.2.1 Frequencies and the Mode99

3.2.2 Percentiles100

3.2.3 Measures of Location:Mean and Median101

3.2.4 Measures of Spread:Range and Variance102

3.2.5 Multivariate Summary Statistics104

3.2.6 Other Ways to Summarize the Data105

3.3 Visualization105

3.3.1 Motivations for Visualization105

3.3.2 General Concepts106

3.3.3 Techniques110

3.3.4 Visualizing Higher-Dimensional Data124

3.3.5 Do's and Don'ts130

3.4 OLAP and Multidimensional Data Analysis131

3.4.1 Representing Iris Data as a Multidimensional Array131

3.4.2 Multidimensional Data:The General Case133

3.4.3 Analyzing Multidimensional Data135

3.4.4 Final Comments on Multidimensional Data Analysis139

3.5 Bibliographic Notes139

3.6 Exercises141

4 Classification:Basic Concepts,Decision Trees,and Model Evaluation145

4.1 Preliminaries146

4.2 General Approach to Solving a Classification Problem148

4.3 Decision Tree Induction150

4.3.1 How a Decision Tree Works150

4.3.2 How to Build a Decision Tree151

4.3.3 Methods for Expressing Attribute Test Conditions155

4.3.4 Measures for Selecting the Best Split158

4.3.5 Algorithm for Decision Tree Induction164

4.3.6 An Example:Web Robot Detection166

4.3.7 Characteristics of Decision Tree Induction168

4.4 Model Overfitting172

4.4.1 Overfitting Due to Presence of Noise175

4.4.2 Overfitting Due to Lack of Representative Samples177

4.4.3 Overfitting and the Multiple Comparison Procedure178

4.4.4 Estimation of Generalization Errors179

4.4.5 Handling Overfitting in Decision Tree Induction184

4.5 Evaluating the Performance of a Classifier186

4.5.1 Holdout Method186

4.5.2 Random Subsampling187

4.5.3 Cross-Validation187

4.5.4 Bootstrap188

4.6 Methods for Comparing Classifiers188

4.6.1 Estimating a Confidence Interval for Accuracy189

4.6.2 Comparing the Performance of Two Models191

4.6.3 Comparing the Performance of Two Classifiers192

4.7 Bibliographic Notes193

4.8 Exercises198

5 Classification:Alternative Techniques207

5.1 Rule-Based Classifier207

5.1.1 How a Rule-Based Classifier Works209

5.1.2 Rule-Ordering Schemes211

5.1.3 How to Build a Rule-Based Classifier212

5.1.4 Direct Methods for Rule Extraction213

5.1.5 Indirect Methods for Rule Extraction221

5.1.6 Characteristics of Rule-Based Classifiers223

5.2 Nearest-Neighbor classifiers223

5.2.1 Algorithm225

5.2.2 Characteristics of Nearest-Neighbor Classifiers226

5.3 Bayesian Classifiers227

5.3.1 Bayes Theorem228

5.3.2 Using the Bayes Theorem for Classification229

5.3.3 Na?ve Bayes Classifier231

5.3.4 Bayes Error Rate238

5.3.5 Bayesian Belief Networks240

5.4 Artificial Neural Network(ANN)246

5.4.1 Perceptron247

5.4.2 Multilayer Artificial Neural Network251

5.4.3 Characteristics of ANN255

5.5 Support Vector Machine (SVM)256

5.5.1 Maximum Margin Hyperplanes256

5.5.2 Linear SVM:Separable Case259

5.5.3 Linear SVM:Nonseparable Case266

5.5.4 Nonlinear SVM270

5.5.5 Characteristics of SVM276

5.6 Ensemble Methods276

5.6.1 Rationale for Ensemble Method277

5.6.2 Methods for Constructing an Ensemble Classifier278

5.6.3 Bias-Variance Decomposition281

5.6.4 Bagging283

5.6.5 Boosting285

5.6.6 Random Forests290

5.6.7 Empirical Comparison among Ensemble Methods294

5.7 Class Imbalance Problem294

5.7.1 Alternative Metrics295

5.7.2 The Receiver Operating Characteristic Curve298

5.7.3 Cost-Sensitive Learning302

5.7.4 Sampling-Based Approaches305

5.8 Multiclass Problem306

5.9 Bibliographic Notes309

5.10 Exercises315

6 Association Analysis:Basic Concepts and Algorithms327

6.1 Problem Definition328

6.2 Frequent Itemset Generation332

6.2.1 The Apriori Principle333

6.2.2 Frequent Itemset Generation in the Apriori Algorithm335

6.2.3 Candidate Generation and Pruning338

6.2.4 Support Counting342

6.2.5 Computational Complexity345

6.3 Rule Generation349

6.3.1 Confidence-Based Pruning350

6.3.2 Rule Generation in Apriori Algorithm350

6.3.3 An Example:Congressional Voting Records352

6.4 Compact Representation of Frequent Itemsets353

6.4.1 Maximal Frequent Itemsets354

6.4.2 Closed Frequent Itemsets355

6.5 Alternatire Methods for Generating Frequent Itemsets359

6.6 FP-Growth Algorithm363

6.6.1 FP-Tree Representation363

6.6.2 Frequent Itemset Generation in FP-Growth Algorithm366

6.7 Evaluation of Association Patterns370

6.7.1 Objective Measures of Interestingness371

6.7.2 Measures beyond Pairs of Binary Variables382

6.7.3 Simpson's Paradox384

6.8 Effect of Skewed Support Distribution386

6.9 Bibliographic Notes390

6.10 Exercises404

7 Association Analysis:Advanced Concepts415

7.1 Handling Categorical Attributes415

7.2 Handling Continuous Attributes418

7.2.1 Discretization-Based Methods418

7.2.2 Statistics-Based Methods422

7.2.3 Non-discretization Methods424

7.3 Handling a Concept Hierarchy426

7.4 Sequential Patterns429

7.4.1 Problem Formulation429

7.4.2 Sequential Pattern Discovery431

7.4.3 Timing Constraints436

7.4.4 Alternative Counting Schemes439

7.5 Subgraph Patterns442

7.5.1 Graphs and Subgraphs443

7.5.2 Frequent Subgraph Mining444

7.5.3 Apriori-like Method447

7.5.4 Candidate Generation448

7.5.5 Candidate Pruning453

7.5.6 Support Counting457

7.6 Infrequent Patterns457

7.6.1 Negative Patterns458

7.6.2 Negatively Correlated Patterns458

7.6.3 Comparisons among Infrequent Patterns,Negative Patterns,and Negatively Correlated Patterns460

7.6.4 Techniques for Mining Interesting Infrequent Patterns461

7.6.5 Techniques Based on Mining Negative Patterns463

7.6.6 Techniques Based on Support Expectation465

7.7 Bibliographic Notes469

7.8 Exercises473

8 Cluster Analysis:Basic Concepts and Algorithms487

8.1 Overview490

8.1.1 What Is Cluster Analysis?490

8.1.2 Different Types of Clusterings491

8.1.3 Different Types of Clusters493

8.2 K-means496

8.2.1 The Basic K-means Algorithm497

8.2.2 K-means:Additional Issues506

8.2.3 Bisecting K-means508

8.2.4 K-means and Different Types of Clusters510

8.2.5 Strengths and Weaknesses510

8.2.6 K-means as an Optimization Problem513

8.3 Agglomerative Hierarchical Clustering515

8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm516

8.3.2 Specific Techniques518

8.3.3 The Lance-Williams Formula for Cluster Proximity524

8.3.4 Key Issues in Hierarchical Clustering524

8.3.5 Strengths and Weaknesses526

8.4 DBSCAN526

8.4.1 Traditional Density:Center-Based Approach527

8.4.2 The DBSCAN Algorithm528

8.4.3 Strengths and Weaknesses530

8.5 Cluster Evaluation532

8.5.1 Overview533

8.5.2 Unsupervised Cluster Evaluation Using Cohesion and Separation536

8.5.3 Unsupervised Cluster Evaluation Using the Proximity Matrix542

8.5.4 Unsupervised Evaluation of Hierarchical Clustering544

8.5.5 Determining the Correct Number of Clusters546

8.5.6 Clustering Tendency547

8.5.7 Supervised Measures of Cluster Validity548

8.5.8 Assessing the Significance of Cluster Validity Measures553

8.6 Bibliographic Notes555

8.7 Exercises559

9 Cluster Analysis:Additional Issues and Algorithms569

9.1 Characteristics of Data,Clusters,and Clustering Algorithms570

9.1.1 Example:Comparing K-means and DBSCAN570

9.1.2 Data Characteristics571

9.1.3 Cluster Characteristics573

9.1.4 General Characteristics of Clustering Algorithms575

9.2 Prototype-Based Clustering577

9.2.1 Fuzzy Clustering577

9.2.2 Clustering Using Mixture Models583

9.2.3 Self-Organizing Maps(SOM)594

9.3 Density-Based Clustering600

9.3.1 Grid-Based Clustering601

9.3.2 Subspace Clustering604

9.3.3 DENCLUE:A Kernel-Based Scheme for Density-Based Clustering608

9.4 Graph-Based Clustering612

9.4.1 Sparsification613

9.4.2 Minimum Spanning Tree(MST)Clustering614

9.4.3 OPOSSUM:Optimal Partitioning of Sparse Similarities Using METIS616

9.4.4 Chameleon:Hierarchical Clustering with Dynamic Modeling616

9.4.5 Shared Nearest Neighbor Similarity622

9.4.6 The Jarvis-Patrick Clustering Algorithm625

9.4.7 SNN Density627

9.4.8 SNN Density-Based Clustering629

9.5 Scalable Clustering Algorithms630

9.5.1 Scalability:General Issues and Approaches630

9.5.2 BIRCH633

9.5.3 CURE635

9.6 Which Clustering Algorithm?639

9.7 Bibliographic Notes643

9.8 Exercises647

10 Anomaly Detection651

10.1 Preliminaries653

10.1.1 Causes of Anomalies653

10.1.2 Approaches to Anomaly Detection654

10.1.3 The Use of Class Labels655

10.1.4 Issues656

10.2 Statistical Approaches658

10.2.1 Detecting Outliers in a Univariate Normal Distribution659

10.2.2 Outliers in a Multivariate Normal Distribution661

10.2.3 A Mixture Model Approach for Anomaly Detection662

10.2.4 Strengths and Weaknesses665

10.3 Proximity-Based Outlier Detection666

10.3.1 Strengths and Weaknesses666

10.4 Density-Based Outlier Detection668

10.4.1 Detection of Outliers Using Relative Density669

10.4.2 Strengths and Weaknesses670

10.5 Clustering-Based Techniques671

10.5.1 Assessing the Extent to Which an Object Belongs to a Cluster672

10.5.2 Impact of Outliers on the Initial Clustering674

10.5.3 The Number of Clusters to Use674

10.5.4 Strengths and Weaknesses674

10.6 Bibliographic Notes675

10.7 Exercises680

Appendix A Linear Algebra685

A.1 Vectors685

A.1.1 Definition685

A.1.2 Vector Addition and Multiplication by a Scalar685

A.1.3 Vector Spaces687

A.1.4 The Dot Product,Orthogonality,and Orthogonal Projections688

A.1.5 Vectors and Data Analysis690

A.2 Matrices691

A.2.1 Matrices:Definitions691

A.2.2 Matrices:Addition and Multiplication by a Scalar692

A.2.3 Matrices:Multiplication693

A.2.4 Linear Transformations and Inverse Matrices695

A.2.5 Eigenvalue and Singular Value Decomposition697

A.2.6 Matrices and Data Analysis699

A.3 Bibliographic Notes700

Appendix B Dimensionality Reduction701

B.1 PCA and SVD701

B.1.1 Principal Components Analysis(PCA)701

B.1.2 SVD706

B.2 Other Dimensionality Reduction Techniques708

B.2.1 Factor Analysis708

B.2.2 Locally Linear Embedding (LLE)710

B.2.3 Multidimensional Scaling,FastMap,and ISOMAP712

B.2.4 Common Issues715

B.3 Bibliographic Notes716

Appendix C Probability and Statistics719

C.1 Probability719

C.1.1 Expected Values722

C.2 Statistics723

C.2.1 Point Estimation724

C.2.2 Central Limit Theorem724

C.2.3 Interval Estimation725

C.3 Hypothesis Testing726

Appendix D Regression729

D.1 Preliminaries729

D.2 Simple Linear Regression730

D.2.1 Least Square Method731

D.2.2 Analyzing Regression Errors733

D.2.3 Analyzing Goodness of Fit735

D.3 Multivariate Linear Regression736

D.4 Alternative Least-Square Regression Methods737

Appendix E Optimization739

E.1 Unconstrained Optimization739

E.1.1 Numerical Methods742

E.2 Constrained Optimization746

E.2.1 Equality Constraints746

E.2.2 Inequality Constraints747

Author Index750

Subject Index758

Copyright © 2025  最新更新

热门推荐