Ensurepass

QUESTION 31

What is default delimiterfor Hive tables?

 A. ^A (Control-A) B. , (comma) C. t (tab) D. : (colon)

Reference:http://blog.spryinc.com/2013/10/four-useful-tricks-for-working-with- hive.html(change the delimiter when exporting hive table)

QUESTION 32

Consider the following sample from a distribution that contains a continuous X and label Y that is either A or B: Whichis the best choice of cut points for X if you want to discretizethese values intothree bucketsthat minimizesthe sum of chi-square values?

 A. X5 and X8 B. X4 and X6 C. X3 and X8 D. X3 and X6 E. X2 and X9

QUESTION 33

A company has 20 software engineersworking to fix on a project. Over the pastweek, the teamhas fixed 100 bugs. Althoughthe average number of bugs. Althoughthe average number of bugsfixed per engineerid five.None of the engineer fixed exactlyfive bugs

lastweek.

You want to understand how productive each engineeris atfixing bugs. Whatis the best way to visualize the distribution of bugfixesperengineer?

 A. A bar chart of engineers vs. number of bugs fixed B. A scatter plot of engineers vs. number of bugs fixed C. A normal distribution of the mean and standard deviation of bug fixes per engineer D. A histogram that groups engineers to together based on the number of bugs they fixed

QUESTION 34

Assuming the trends shownin this chartcontinue,what would we expectthe value of the revenueto be in Q1 of 2013?

 A. \$125,000 B. \$170,000 C. \$220,000 D. \$250,000

QUESTION 35

You have a data file that containstwo trillion records, one record per line(comma

separated). Each record liststwo friends and uniquemessage sent betweenthem.Their name
s will not havecommas.

Michael, John,Pabst,Blue Ribbon

Tiffany, James, BMX Racing

John, Michael, Natural Lemon Flavor

Analyze thepseudo code examplesbelow and determine which set ofmappers and reducersin the below pseudo codesnippets will solvefor the mean numberof messages each usersends to all of the friends?

For example pseudo codemay havethree friends to whomhe sends 6, 10, and200 messages, respectively, so Michael’s mean would be(6+10+200)/3.The solution may require a pipelineof two MapReduce jobs.

 A. def mapper1 (line): key1, key2, message = line.split (` , ‘) emit ( (key1, key2) , 1) def reducer1(key, values): emit (key, sum(values)) def mapper2(key, value): key1, key2 = key / / unpack both friends name into separate keys emit (key1, value) def reducer2(key, values): emit (key, mean (values) ) B. def mapper1 (line): key1, key2, message = line.split (` , ‘) emit ( (key1, key2) , 1) emit ( (key1, key2) , 1) def reducer1(key, values): emit (key, sum(values)) def mapper2(key, value): key1, key2 = key / / unpack both friends name into separate keys emit (key1, value) def reducer2(key, values): emit (key, mean (values) ) C. def mapper1 (line): key1, key2, message = line.split (` , ‘) emit ( (key1, key2) , 1) emit ( (key1, key2) , 1) def reducer1(key, values): emit (key, sum(values)) D. defmapper (line): Key1, key2, message =line.split(` , ‘) Sort (key1, key2) / /a fiven pair will always besorted the same Emit( ( key 1, key2), 1) Def reducer1(key, values) : Emit (key, sum (values) ) Def Mapper2 (key, value) Key1, key2 = key / / unpack both friends names into separate keys Emit (key1, value) Emit(key2, value) Def reducer2(key, values); Emit (key, mean (values) )

QUESTION 36

There are 20 patients with acute lymphoblastic leukemia (ALL) and 32 patients with acute myeloid leukemia (AML), both variants of a blood cancer.

The makeup of the groups as follows: Each individual has an expression value for each of 10000 different genes. The expression value for each gene is a continuous value between -1 and 1.

With which type of plot can you encode the most amount of the data visually?

You choose to performagglomerativehierarchicalclusteringon the 10,000features.How much RAMdo you need to holdthe distance Matrix, assumingeach distance value is64-bit double?

 A. ~ 800 MB B. ~ 400 MB C. ~ 160 KB D. ~ 4 MB

QUESTION 37

What are two defining features of RMSE (root-mean square error or root-mean-square deviation)?

 A. It is sensitive to outliers B. It is the mean value of recommendations of the K-equal partitions in the input data C. It is the square of the median value of the error where error is the difference between predicted rating and actual ratings D. It is appropriate for numeric data E. It considers the order of recommendations

QUESTION 38

You have user profile records in anOLTP database that you want to join with web serverlogs which you havealready ingested into HDFS. What is the best wayto acquire the user profile for use in HDFS?

 A. Ingest with Hadoop streaming B. Ingest with Apache Flume C. Ingest using Hive’s LOAD DATA command D. Ingest using Sqoop E. Ingest using Pig’s LOAD command

QUESTION 39

In what way can Hadoop be used to improvethe performance ofLIoyd’salgorithm for k- means clusteringon large data sets?

 A. Parallelizing the centroid computations to improve numerical stability B. Distributing the updates of the cluster centroids C. Reducing the number of iterations required for the centroids to converge D. Mapping the input data into a non-Euclidean metric space

QUESTION 40

Refer to the exhibit. Which point in the figure is the mean?

 A. A B. B C. C