Documente Academic
Documente Profesional
Documente Cultură
About HappiestMinds
Next Gen IT Consultancy Company launched Aug 2011 . Head office in Bangalore, India, have offices in USA, UK, Canada, Australia and Singapore. Core focus on disruptive technologies like Big Data/Analytics, Cloud, Mobile and Social. Raised USD 45M Series A Funding from prominent VCs , Intel Capital, Canaan Partners and founders. 45 + Client Globally, 800 + Employees. About Myself : Dibyendu is Big Data Architect at HappiestMinds where he is involved in architecting and developing solutions on a Hadoop-based analytics and search platform. In the past few years, he has worked on complex data analytics related projects that utilize Hadoop, HBase, and real time analytics. Before HappiestMinds, he worked at EMC, FairIsaac, Cisco, IBM etc.
This Presentation.
.will explores the design and challenges HappiestMinds faced while implementing a storage and search infrastructure for a library procurement system where books/documents/artifacts related records are stored in Apache HBase. Upon bulk insert of book records into HBase, the Elasticsearch index is built offline using MapReduce but there are certain use cases where the records need to be re-indexed in Elasticsearch using Region Observer Coprocessors.
Data Pre Processing Data ingestion to Hadoop Data Loading : Map Reduce Bulk Data upload to HBase table Data Indexing : Map Reduce Incremental Data Indexing to ElasticSearch Part of the document is indexed. User Search: User Search Data. Search engine display results. Full data access request fetch from HBase. User Update: User update HBase record. Update will propagate to Search Cluster.
1 3
5b
User Search
4
5a
Region Server
Observer Coprocessors
Two types of Coprocessor observer, which are like triggers in conventional databases. endpoint, dynamic RPC endpoints that resemble stored procedures. Observer Coprocessor : Callback functions/hooks for every explicit API method MasterObserver Hooks into HMaster API RegionObserver Hooks into Region related operations WALObserver Hooks into write-ahead log operations
For each index you can specify: Number of shards Each index has fixed number of shards Number of replicas Each shard can have 0-many replicas, can be changed dynamically
Solution
Use ElasticSearch Node Client. Client Node does not hold index but have knowledge of complete Cluster. Use HBASE-6505 to share Node Client across Regions in a RegionServer.
HBase 6505
RegionCoprocessorEnvironment provides a getSharedData() method, which returns a ConcurrentMap, which is held by the RegionCoprocessorHost as a weak reference (in a special map with strongly referenced keys and weakly referenced values), and held strongly by the RegionEnvironment. That way if the coprocessor is blacklisted the coprocessors environment is removed, and any shared data is immediately available for garbage collection. This shared data is per RegionServer. As long as there is at least one region observer or endpoint active this shared data is not garbage collected and can be accessed to share state between the remaining coprocessors of the same class.
HBase
ES
V1 C2
Conflict
V1 V2(CP)
HBase ES V1
C1
C2
V2 (Update success)
HBase
ES
V1 C2
Conflict
V1 V2(M/R)
HBase ES
C1
V1 C2
Conflict
Search and Update should only be successful when the Version of ElasticSearch and Version of HBase is same during the update.
Solution..
1. Data Load from Source to HBase will insert a document with Put call. 2. postPut coprocessor will perform incrementColumnValue for a version column.
Solution..
3. Same Version number will be propagated to ElasticSearch during Map Reduce based bulk indexing. ElasticSearch support version number supplied externally. 4. Step 1-3 will repeat for any new data upload. 5. During search and update , the client will perform checkAndPut () call.
5i. Client perform search and get the Version number from ElasticSearch 5ii. Client construct a Put with new Version No = Old Version + 1 5iii. Client perform checkAndPut, and check for old Version number before doing Put. 5iv. postCheckAndPut Coprocessor invoked to propagate the successful Put to Search Cluster. 5v. After this step the Version Number of HBase column and ElasticSearch version will be equal.
Solution..
Thanks Dibyendu.B@happiestminds.com