Ensurepass

QUESTION 21

Given the following sample of numbers from a distribution:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89

What are the five numbers that summarizethis distribution(the fivenumbersummary of samplepercentiles)?

 A. 1, 3, 8, 34, 89 B. 1, 4, 13, 34, 89 C. 1, 1.5, 5, 24.5, 89 D. 1, 2.5, 8, 27.5, 89

QUESTION 22

Which two machinelearning algorithmshould you consideras likely to benefitfrom discretizing continuousfeatures?

 A. Support vector machine B. Na飗e Bayes C. Decision trees D. Logistic regression E. Singular value decomposition

Reference:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2656082/

QUESTION 23

Given the following sample of numbers from a distribution:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89

How do high-level languageslike Apache Hive and Apache Pigefficiently calculateapproximatelypercentiles for a distribution?

 A. They sort all of the input samples and the lookup the samples for each percentile B. They maintain index of input data as it is loaded into HDFS and load them into memory C. They use pivots to assign each observations to the reducer that calculate each percentile D. They assign sample observations to buckets and then aggregate the buckets to compute the approximations

QUESTION 24

You are about tosamplea 100-dimensinalunit-cube. To adequatelysample any single givendimension, youneed onlycapture 10 points. Howmany pointsdo you need to orderto sample the complete100-dimensionalunitcube adequately?

 A. 10010 B. 1010 C. Log2(100) D. 100 E. 1000 F. 1010

QUESTION 25

Under what two conditions doesstochasticgradientdescentoutperform2nd-order optimizationtechniques such asiterativelyreweightedleast squares?

 A. When the volume of input data is so large and diverse that a 2nd-order optimization technique can be fit to a sample of the data B. When the model’s estimates must be updated in real-time in order to account for newobservations. C. When the input data can easily fit into memory on a single machine, but we want to calculate confidence intervals for all of the parameters in the model. D. When we are required to find the parameters that return the optimal value of the objective function.

QUESTION 26

Which two techniquesshould you use to avoidoverfittinga classification model to a data set?

 A. Include a small number “noise” features that are not through to be correlated with the dependent variable. B. Replicate features that are through to be significant predicators of the dependent variable multiple time for each observation. C. Separate your input data into a training set that is used for fitting and a test set that is used forevaluating the model’s performance D. Include a regularization term in the model’s objective function to control how precisely the model fits the data E. Preprocess the data to exclude a typical observation from the model input

QUESTION 27

You have a large file of N records(one per line), andwant to randomlysample 10% them.You have two functions thatareperfect random numbergenerators (through they are a bit slow):

Random_uniform ()generates a uniformlydistributed numberin the interval [0, 1]random_permotation(M)generates a random permutationof the number O throughM -1.

Below are three different functionsthat implement the sampling.

Method A

For line in file:

If random_uniform () < 0.1;

Print line

Method B

i = 0

for line in file:

if i % 10 = =0;

print line

i += 1

Method C

idxs =random_permotation (N)[: (N/10)]

i = 0

for line in file:

if i inidxs:

print line

i +=1

Which method will have the best runtime performance?

 A. Method A B. Method B C. Method C

QUESTION 28

Refer to the exhibit.

Which point in the figure is the median?

 A. A B. B C. C

QUESTION 29

Which recommender system technique isdomain specific?

 A. Content-based collaboration filtering B. Item-based collaborative filtering C. User-based collaborative filtering D. Na飗e Bayes classifier

Reference:http://www.cs.cmu.edu/~srosenth/papers/Rosenthal_RecSys09.pdf

QUESTION 30

Which bestdescribesthe primaryfunction of Flume?

 A. Flume is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with an infrastructure consisting of sources and sinks for importing and evaluating large data sets B. Flume acts as a Hadoop filesystem for log files C. Flume Imports data from SQL/relational database into your Hadoop cluster D. Flume provides a query languages for Hadoop similar to SQL E. Flume is a distributed server for collecting and moving large amount of data into HDFS as it’s produced from streaming data flows

Free VCE & PDF File for Cloudera DS-200 Real Exam

Instant Access to Free VCE Files: CompTIA | VMware | SAP …
Instant Access to Free PDF Files: CompTIA | VMware | SAP …