提供高质量的essay代写,Paper代写,留学作业代写-天才代写

首頁 > > 詳細

代做ECE 795 -代做R程序、R編程代寫、調試R


ECE 795 -- Advanced Big Data Analytics

Final Project: Comprehensive Design of Big Data Analyses

Assigned: March 3, 2020 Spring 2020
Project Demonstration: April 14 and 16, 2020
Project and Report: April 16, 2020


In this project, you will need to leverage the knowledge and tools discussed in this course to
design a comprehensive workflow of big data analysis. Please select one task from the following
(first come, first serve) and each task only allow five people to work on (Task 1 allows six people
to work on, each aims for a different format conversion path). Please make sure to provide
sufficient comments on your code to get full credits. For the sake of space, the references, hints,
and some requirements are not included here. Please find the complete descriptions of each
task in GitHub.
Task 1: Large Scale Web Record Format Conversion
1. Download the provided CSV data from the link and store it in HDFS.
2. Pick one of the data format conversion paths in the following:
a. CSV to XML to JSON
b. CSV to XML to YAML
c. CSV to JSON to XML
d. CSV to JSON to YAML
e. CSV to YAML to XML
f. CSV to YAML to JSON
3. Implement a PySpark application to pre-process the raw data if necessary and convert
the original CSV data to the first data format you chose in Step 2. Afterwards, covert the
data again to the second data format in Step 2.
4. Repeat Step 3 after increasing the number of workers to 3 and 4 in the cluster. Compare
the computing times before and after the changes and plot the figure "Computing time vs.
#workers".
5. Note: there will be two sets of CSV files as the inputs. One is a large number of small
CSV files and the other is a single large CSV file as input data. Please make sure your
PySpark application can handle both cases. Performance analysis between two input sets
should be provided.
Task 2: Stack Overflow Data Analysis in PySpark
1. Use Google Cloud BigQuery API to load the provided data into HDFS.
2. Use PySpark to read the data from your clusters.
3. Analyze the data and answer the following questions:
a. How many questions are posted from Sept. 1st, 2019 to Dec. 31st, 2019?
b. What is the percentage of questions that have been answered over the above
period?
c. How long on average were the questions answered on website over the above
period?
4. Using questions provided in Step 3 as the examples, do more data analyses for the given
dataset and try to find different types of useful information. Please implement all the
analyses using PySpark codes and justify your conclusion of the analyses with the results
of the codes. Your report can design to cover tasks as follows.
a. Find one way to improve the answer rate for a question.
b. Generate an analysis of the user changes in Overflow over the last twelve years.
c. Generate a review of topical trends during the previous twelve years.
5. The complexity and novelty of the analyses will have an impact on the scoring. External
data are allowed to be used along with Stack Overflow data.

Task 3: Publication Analysis for chosen Universities from Google Scholar
1. Pick a list of universities and search on the profile page of Google Scholar website.
2. Implement a program to identify the top 300 professors (ranked by total citations) from
the homepage of each university by a web crawler, find a complete paper list from the
homepage of each identified professor, and store all the related web pages in HDFS.
a. https://scholar.google.com/citations?view_op=view_orgorg=165895925669911
47599hl=enoi=io Homepage of University of Miami.
b. https://scholar.google.com/citations?hl=enuser=7fQX_pYAAAAJ Homepage of
Prof. A Parasuraman, who has the top citations in University of Miami.
c. The number of total papers collected is no fewer than 1,000,000 (more than 10
universities).
d. Cstart and Pagesize are the parameters in URL to scan the paper list.
3. Find the fastest way to analyze the best co-author of each professor and justify why your
method is the fastest one.
4. Use PySpark codes to partition the collected papers in various ways and analyze the
collected data. Please justify your conclusion of the analysis with the results of the codes.
Your report can design to cover tasks as follows (The complexity and novelty of the
analyses will have an impact on the scoring).
a. Generate an analysis of the best department in each university.
b. Generate a review of popular research keywords in each university during the
previous years.
Task 4: Word Count on Streaming Tweets
1. Incorporate Cloud Dataflow and configure it correctly in Google Cloud Platform.
2. Using Cloud Dataflow to import Tweets from Twitter API with the keywords of your
selection.
3. Use PySpark to do word count for all the newly coming tweets in a configurable interval
(as small as possible) and save the results.
4. Please test the word count system and report the smallest interval it supports (e.g., 1
min). Please explain what the bottleneck is to achieve a smaller interval.
5. Write a PySpark application to count the number of tweets with the word count between
a given range.
6. Plot the distribution of tweet word count for a given time interval.
7. Compare the performance of computing the word count distribution based on raw data
and based on the results. Try it multiple times when the numbers of tweets are different.
Plot the figure "computing time vs. the number of tweets".

Please turn in a written report of your project (no more than 6 pages in the same template as in
the first project) including:
? Instructions on how to compile and run your program
? Documented program listings
? The design of your implementation
? Detailed discussion of your implementation and analyses
? Necessary diagram(s), flowchart(s), pseudo code(s), etc. for your implementation
? A conclusion, summarizing your understanding and analyses
? A list of references, if any.

The final report (submitted to blackboard) and codes are due on April 16, 2020. Project
demonstration (no more than 8 minutes) is on April 14, 2020 (Tasks 1 2), and April 16, 2020
(Tasks 2 3 4).

聯系我們
  • QQ:1067665373
  • 郵箱:1067665373@qq.com
  • 工作時間:8:00-23:00
  • 微信:Essay_Cheery
熱點文章
程序代寫更多圖片

聯系我們 - QQ: 1067665373 微信:Essay_Cheery
? 2021 uk-essays.net
程序代寫網!

在線客服

售前咨詢
售后咨詢
微信號
Essay_Cheery
微信
全优代写 - 北美Essay代写,Report代写,留学生论文代写作业代写 北美顶级代写|加拿大美国论文作业代写服务-最靠谱价格低-CoursePass 论文代写等留学生作业代做服务,北美网课代修领导者AssignmentBack 北美最专业的线上写作专家:网课代修,网课代做,CS代写,程序代写 代码代写,CS编程代写,java代写北美最好的一站式学术代写服务机构 美国essay代写,作业代写,✔美国网课代上-最靠谱最低价 美国代写服务,作业代写,CS编程代写,java代写,python代写,c++/c代写 代写essay,作业代写,金融代写,business代写-留学生代写平台 北美代写,美国作业代写,网课代修,Assignment代写-100%原创 北美作业代写,【essay代写】,作业【assignment代写】,网课代上代考