上節課將到了Receiver是如何不斷的接收數據的,并且接收到的數據的元數據會匯報給ReceiverTracker,下面我們看看ReceiverTracker具體的功能及實現。
成都創新互聯是一家朝氣蓬勃的網站建設公司。公司專注于為企業提供信息化建設解決方案。從事網站開發,網站制作,網站設計,網站模板,微信公眾號開發,軟件開發,小程序開發,十余年建站對成都VR全景等多個領域,擁有豐富的網站運維經驗。
一、 ReceiverTracker主要的功能:
在Executor上啟動Receivers。
停止Receivers 。
更新Receiver接收數據的速率(也就是限流)
不斷的等待Receivers的運行狀態,只要Receivers停止運行,就重新啟動Receiver。也就是Receiver的容錯功能。
接受Receiver的注冊。
借助ReceivedBlockTracker來管理Receiver接收數據的元數據。
匯報Receiver發送過來的錯誤信息
ReceiverTracker 管理了一個消息通訊體ReceiverTrackerEndpoint,用來與Receiver或者ReceiverTracker 進行消息通信。
在ReceiverTracker的start方法中,實例化了ReceiverTrackerEndpoint,并且在Executor上啟動Receivers:
/** Start the endpoint and receiver execution thread. */ def start(): Unit = synchronized { if (isTrackerStarted) { throw new SparkException("ReceiverTracker already started") } if (!receiverInputStreams.isEmpty) { endpoint = ssc.env.rpcEnv.setupEndpoint( "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv)) if (!skipReceiverLaunch) launchReceivers() logInfo("ReceiverTracker started") trackerState = Started } }
啟動Receivr,其實是ReceiverTracker給ReceiverTrackerEndpoint發送了一個本地消息,ReceiverTrackerEndpoint將Receiver封裝成RDD以job的方式提交給集群運行。
endpoint.send(StartAllReceivers(receivers))
這里的endpoint就是ReceiverTrackerEndpoint的引用。
Receiver啟動后,會向ReceiverTracker注冊,注冊成功才算正式啟動了。
override protected def onReceiverStart(): Boolean = { val msg = RegisterReceiver( streamId, receiver.getClass.getSimpleName, host, executorId, endpoint) trackerEndpoint.askWithRetry[Boolean](msg) }
當Receiver端接收到數據,達到一定的條件需要將數據寫入BlockManager,并且將數據的元數據匯報給ReceiverTracker:
/** Store block and report it to driver */ def pushAndReportBlock( receivedBlock: ReceivedBlock, metadataOption: Option[Any], blockIdOption: Option[StreamBlockId] ) { val blockId = blockIdOption.getOrElse(nextBlockId) val time = System.currentTimeMillis val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock) logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms") val numRecords = blockStoreResult.numRecords val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult) trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo)) logDebug(s"Reported block $blockId") }
當ReceiverTracker收到元數據后,會在線程池中啟動一個線程來寫數據:
case AddBlock(receivedBlockInfo) => if (WriteAheadLogUtils.isBatchingEnabled(ssc.conf, isDriver = true)) { walBatchingThreadPool.execute(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { if (active) { context.reply(addBlock(receivedBlockInfo)) } else { throw new IllegalStateException("ReceiverTracker RpcEndpoint shut down.") } } }) } else { context.reply(addBlock(receivedBlockInfo)) }
數據的元數據是交由ReceivedBlockTracker管理的。
數據最終被寫入到streamIdToUnallocatedBlockQueues中:一個流對應一個數據塊信息的隊列。
private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo] private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue]
每當Streaming 觸發job時,會將隊列中的數據分配成一個batch,并將數據寫入timeToAllocatedBlocks數據結構。
private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks] .... def allocateBlocksToBatch(batchTime: Time): Unit = synchronized { if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) { val streamIdToBlocks = streamIds.map { streamId => (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true)) }.toMap val allocatedBlocks = AllocatedBlocks(streamIdToBlocks) if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) { timeToAllocatedBlocks.put(batchTime, allocatedBlocks) lastAllocatedBatchTime = batchTime } else { logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery") } } else { // This situation occurs when: // 1. WAL is ended with BatchAllocationEvent, but without BatchCleanupEvent, // possibly processed batch job or half-processed batch job need to be processed again, // so the batchTime will be equal to lastAllocatedBatchTime. // 2. Slow checkpointing makes recovered batch time older than WAL recovered // lastAllocatedBatchTime. // This situation will only occurs in recovery time. logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery") } }
可見一個batch會包含多個流的數據。
每當Streaming 的一個job運行完畢后:
private def handleJobCompletion(job: Job, completedTime: Long) { val jobSet = jobSets.get(job.time) jobSet.handleJobCompletion(job) job.setEndTime(completedTime) listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo)) logInfo("Finished job " + job.id + " from job set of time " + jobSet.time) if (jobSet.hasCompleted) { jobSets.remove(jobSet.time) jobGenerator.onBatchCompletion(jobSet.time) logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format( jobSet.totalDelay / 1000.0, jobSet.time.toString, jobSet.processingDelay / 1000.0 )) listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo)) } ...
JobScheduler會調用handleJobCompletion方法,最終會觸發
jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration)
這里的maxRememberDuration是DStream中每個時刻生成的RDD保留的最長時間。
def cleanupOldBatches(cleanupThreshTime: Time, waitForCompletion: Boolean): Unit = synchronized { require(cleanupThreshTime.milliseconds < clock.getTimeMillis()) val timesToCleanup = timeToAllocatedBlocks.keys.filter { _ < cleanupThreshTime }.toSeq logInfo("Deleting batches " + timesToCleanup) if (writeToLog(BatchCleanupEvent(timesToCleanup))) { timeToAllocatedBlocks --= timesToCleanup writeAheadLogOption.foreach(_.clean(cleanupThreshTime.milliseconds, waitForCompletion)) } else { logWarning("Failed to acknowledge batch clean up in the Write Ahead Log.") } }
而最后
listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))
這個代碼會調用
case batchCompleted: StreamingListenerBatchCompleted => listener.onBatchCompleted(batchCompleted) ... 一路跟著下去... /** * A RateController that sends the new rate to receivers, via the receiver tracker. */ private[streaming] class ReceiverRateController(id: Int, estimator: RateEstimator) extends RateController(id, estimator) { override def publish(rate: Long): Unit = ssc.scheduler.receiverTracker.sendRateUpdate(id, rate) }
/** Update a receiver's maximum ingestion rate */ def sendRateUpdate(streamUID: Int, newRate: Long): Unit = synchronized { if (isTrackerStarted) { endpoint.send(UpdateReceiverRateLimit(streamUID, newRate)) } }
case UpdateReceiverRateLimit(streamUID, newRate) => for (info <- receiverTrackingInfos.get(streamUID); eP <- info.endpoint) { eP.send(UpdateRateLimit(newRate)) }
發送調整速率的消息給Receiver,Receiver接到消息后,最終通過BlockGenerator來調整數據的寫入的時間,而控制數據流的速率。
case UpdateRateLimit(eps) => logInfo(s"Received a new rate limit: $eps.") registeredBlockGenerators.foreach { bg => bg.updateRate(eps) }
備注:
1、DT大數據夢工廠微信公眾號DT_Spark
2、IMF晚8點大數據實戰YY直播頻道號:68917580
3、新浪微博: http://www.weibo.com/ilovepains
當前文章:第11課:SparkStreaming源碼解讀之Driver中的ReceiverTracker架構設計以及具體實現徹底研究
文章網址:http://m.newbst.com/article40/iigceo.html
成都網站建設公司_創新互聯,為您提供網站維護、企業建站、搜索引擎優化、網站制作、靜態網站、電子商務
聲明:本網站發布的內容(圖片、視頻和文字)以用戶投稿、用戶轉載內容為主,如果涉及侵權請盡快告知,我們將會在第一時間刪除。文章觀點不代表本網站立場,如需處理請聯系客服。電話:028-86922220;郵箱:631063699@qq.com。內容未經允許不得轉載,或轉載時需注明來源: 創新互聯