Ensurepass

Exam A

QUESTION 1

What is the result of thefollowing command (thedatabase username is foo and password is bar)?

\$ sqoop list-tables – -connect jdbc :mysql: / /localhost/databasename – -table – – usernamefoo – -password bar

QUESTION 2

You are building ak-nearest neighborclassifier (k-NN) on a labeled set of points in ahigh- dimensionalspace.You determine that theclassifier has alargeerroron thetraining data.What is the most likelyproblem?

 A. sqoop lists only those tables in the specified MySql database that have not already been imported into FDFS B. sqoop returns an error C. sqoop lists the available tables from the database D. sqoopimports all the tables from SQLHDFS
 A. High-dimensional spaces effectively make local neighborhoods global B. k-NN compotation does not coverage in high dimensions C. k was too small D. The VC-dimension of a k-NN classifier is too high

QUESTION 3

 A. It does not require you to make strong assumptions about the data because it is a non- parametric B. It significantly reduces the size of the parameter space, thus reducing the risk of over fitting C. It allows you to reduce bias with no tradeoff in variance D. It guarantees convergence of the estimator

QUESTION 4

You have a large file of N records (one per line), and want to randomly sample 10% them. You have two functions that are perfect random number generators (through they are a bit slow):

Random_uniform () generates a uniformly distributed number in the interval [0, 1] random_permotation (M) generates a random permutation of the number O through M -1.

Below are three different functions that implement the sampling.

Method A

For line in file:

If random_uniform () < 0.1;

Print line

Method B

i = 0

for line in file:

if i % 10 = = 0;

print line

i += 1

Method C

idxs = random_permotation (N) [: (N/10)]

i = 0

for line in file:

if i in idxs:

print line

i +=1

Which method is least likely to give you exactly 10% of your dat
a?

 A. Method A B. Method B C. Method C

QUESTION 5

There are 20 patients with acute lymphoblastic leukemia (ALL) and 32 patients with acute myeloid leukemia (AML), both variants of a blood cancer.

The makeup of the groups as follows:

Each individual has an expression value for each of 10000 different genes. The expression

value for each gene is a continuous value between -1 and 1.

With which type of plot can you encodethe most amount of the datavisually?

 A. A heat map sorting the individuals by group B. A histogram of the expression values C. A scatter plot of two largest principal components

QUESTION 6

Function is convex if the linesegment between two points,a and b is greater than equal to the value of the a xb

Which two functions are convex?

 A. X1/2 B. Ex C. 2x-1 D. 1-x2

QUESTION 7

Consider the followingsample froma distributionthat containsa continuousX and label Y that iseither A or B:

Which is the best cut point forX if you want todiscretizethese values into twobucketsin a way thatminimizes the sumof chi-squarevalues?

 A. X8 B. X6 C. X5 D. X4 E. X2

QUESTION 8

Why should stop an interactive machinelearningalgorithm assoon as the performanceof the model on a test set stops improving?

 A. To avoid the need for cross-validating the model B. To prevent overfitting C. To increase the VC (VAPNIK-Chervonenkis) dimension for the model D. To keep the number of terms in the model as possible E. To maintain the highest VC (Vapnik-Chervonenkis) dimension for the model

QUESTION 9

You have a large m x n data matrix

M.You decide you want to perform dimension reduction/clustering on your data and have decide to use the singular value decomposition (SVD; also called principal components analysis PCA)

Refer to the passageabove.

What representsthe SVDof the Matrix standardMgiventhe following information:

U is m x munitary

Visn x nunitary

S is m x ndiagonal

Q isn x n invertible

D is n x ndiagonal

L is m x mlower triangular

U is m x m upper triangular

 A. M = U S V B. M = U P C. M = Q D Q-1 D. M = L U

QUESTION 10

You are building a system to perform outlier detection for a large online retailer. You need to build a system to detect if the total dollar value of sales are outside the norm for each U.S. state, as determined from the physical location of the buyer for each purchase.

The retailer’s data sourcesare scatteredacross multiple systems and databases and are unorganized with little coordination or shared data or keys between the various data sources.

Below are the sources of data available to you. Determine which three will give you the smallest set of data sources but still allow you to implement the outlier detector by state.

 A. Database of employees that Includes only the employee ID, start date, and department B. Database of users that contains only their user ID, name, and a list of every Item the user has viewed C. Transaction log that contains only basket ID, basket amount, time of sale completion, and a session ID D. Database of user sessions that includes only session ID, corresponding user ID, and the corresponding IP address E. External database mapping IP addresses to geographic locations F. Database of items that includes only the item name, item ID, and warehouse location G. Database of shipments that includes only the basket ID, shipment address, shipment date, and shipment method