Which project gives you a distributed, Scalable, data store that allows you random, realtime read/write access to hundreds of terabytes of data?

















Answer: A

Explanation: Use Apache HBase when you need random, realtime read/write access to your Big Data.

Note: This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.



Linear and modular scalability.

Strictly consistent reads and writes.

Automatic and configurable sharding of tables

Automatic failover support between RegionServers.

Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.

Easy to use Java API for client access.

Block cache and Bloom Filters for real-time queries.

Query predicate push down via server side Filters

Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options


Extensible jruby-based (JIRB) shell

Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX






Reference:http://hbase.apache.org/(when would I use HBase? First sentence)




Which two of the following are valid statements? (Choose two)



HDFS is optimized for storing a large number of files smaller than the HDFS block size.


HDFS has the Characteristic of supporting a “write once, read many” data access model.


HDFS is a distributed file system that replaces ext3 or ext4 on Linux nodes in a Hadoopcluster.


HDFS is a distributed file system that runs on top of native OS filesystems and is well suited to storage of very large data sets.


Answer: BD

Explanation: B: HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.


* Hadoop Distributed File System: A distributed file system that provides high-throughput access to application data.

* DFS is designed to support very large files.


Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers




You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values. Which interface should your class implement?



Combiner <Text, IntWritable, Text, IntWritable>


Mapper <Text, IntWritable, Text, IntWritable>






Reducer <Text, Text, IntWritable, IntWritable>


Reducer <Text, IntWritable, Text, IntWritable>


Combiner <Text, Text, IntWritable, IntWritable>


Answer: D




What is the standard configuration of slave nodes in a Hadoop cluster?



Each slave node runs a JobTracker and a DataNode daemon.


Each slave node runs a TaskTracker and a DataNode daemon.


Each slave node either runs a TaskTracker or a DataNode daemon, but not both.


Each slave node runs a DataNode daemon, but only a fraction of the slave nodes run TaskTrackers.


Each slave node runs a TaskTracker, but only a fraction of the slave nodes run DataNode daemons.


Answer: B

Explanation: Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate JVM process.

Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM proc

One or Multiple instances of Task Instance is run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.


Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers,What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?




What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?



You will not be able to compress the intermediate data.






You will longer be able to take advantage of a Combiner.


By using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.


There are no concerns with this approach. It is always advisable to use multiple reduces.


Answer: C

Explanation: Multiple reducers and total ordering


If your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred-site.xml has been set to a number larger than 1, or because you’ve used the -r option to specify the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner to distribute records across the reducers. Use of the HashPartitioner means that you can’t concatenate your output files to create a single sorted output file. To do this you’ll need total ordering,


Reference: Sorting text files with MapReduce




What happens in a MapReduce job when you set the number of reducers to one?



A single reducer gathers and processes all the output from all the mappers. The output is written in as many separate files as there are mappers.


A single reducer gathers and processes all the output from all the mappers. The output is written to a single file in HDFS.


Setting the number of reducers to one creates a processing bottleneck, and since the number of reducers as specified by the programmer is used as a reference value only, the MapReduce runtime provides a default setting for the number of reducers.


Setting the number of reducers to one is invalid, and an exception is thrown.


Answer: A

Explanation: * It is legal to set the number of reduce-tasks to zero if no reduction is desired.


In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing






them out tothe FileSystem.

* Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job.reduces to zero. The MapReduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.




You want to run Hadoop jobs on your development workstation for testing before you submit them to your production cluster. Which mode of operation in Hadoop allows you to most closely simulate a production cluster while using a single machine?



Run all the nodes in your production cluster as virtual machines on your development workstation.


Run the hadoop command with the t local and the s file:///options.


Run the DataNode, TaskTracker, NameNode and JobTracker daemons on a single machine.


Run simldooop, the Apache open-source software for simulating Hadoop clusters.


Answer: A

Explanation: Hosting on local VMs


As well as large-scale cloud infrastructures, there is another deployment pattern: local VMs on desktop systems or other development machines. This is a good tactic if your physical machines run windows and you need to bring up a Linux system running Hadoop, and/or you want to simulate the complexity of a small Hadoop cluster.


Have enough RAM for the VM to not swap.

Don’t try and run more than one VM per physical host, it will only make things slower. use file: URLs to access persistent input and output data. consider making the default filesystem a file: URL so that all storage is really on the physical host. It’s often faster and preserves data better.




Custom programmer-defined counters in MapReduce are:







Lightweight devices for bookkeeping within MapReduce programs.


Lightweight devices for ensuring the correctness of a MapReduce program. Mappers Increment counters, and reducers decrement counters. If at the end of the program the counters read zero, then you are sure that the job completed correctly.


Lightweight devices for synchronization within MapReduce programs. You can use counters to coordinate execution between a mapper and a reducer.


Answer: A

Explanation: Counters are a useful channel for gathering statistics about the job; for quality-control, or for application-level statistics. They are also useful for problem diagnosis. Hadoop maintains some built-in counters for every job, which reports various metrics for your job.


Hadoop MapReduce also allows the user to define a set of user-defined counters that can be incremented (or decremented by specifying a negative value as the parameter), by the driver, mapper or the reducer.


Reference: Iterative MapReduce and Counters, Introduction to Iterative MapReduce and Counters


http://hadooptutorial.wikispaces.com/Iterative+MapReduce+and+Counters(counters, second paragraph)




You need to perform statistical analysis in your MapReduce job and would like to call methods in the Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file. Which is the best way to make this library available to your MapReducer job at runtime?



Have your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable before you submit your job.


Have your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment variable to its location.


When submitting the job on the command line, specify the ibjars option followed by the JAR file path.


Package your code and the Apache Commands Math library into a zip file named JobJar.zip






Answer: C

Explanation: The usage of the jar command is like this,


Usage: hadoop jar <jar> [mainClass] args…


If you want the commons-math3.jar to be available for all the tasks you can do any one ofthese

1. Copy the jar file in $HADOOP_HOME/lib dir or

2. Use the generic option -libjars.




In a MapReduce job with 500 map tasks, how many map task attempts will there be?



It depends on the number of reduces in the job.


Between 500 and 1000.


At most 500.


At least 500.


Exactly 500.


Answer: D

Explanation:From Cloudera Training Course:

Task attempt is a particular instance of an attempt to execute a task ?There will be at least as many task attempts as there are tasks ?If a task attempt fails, another will be started by the JobTracker ?Speculative execution can also result in more task attempts than completed tasks

Free VCE & PDF File for Cloudera CCD-470 Real Exam

Instant Access to Free VCE Files: CompTIA | VMware | SAP …
Instant Access to Free PDF Files: CompTIA | VMware | SAP …

Comments are closed.