This dataset is from DBLP and Nature Journal. It contains 5 parts: Mobicom, Sigcomm, Infocom, Mobihoc and Nature. Each of them conatins infomation of authors and papers from 2008 to 2017.
Name paper article
Eg: Yi Shi Theoretical Results on Base Station Movement Problem for Sensor Network.2008
If you have any question or idea about the dataset, you are welcomed to drop an email to email@example.com.
This is the dataset of AceMentor. The dataset contains one file:
1. MentorYearCoPub1015: predicted mentor relationship of over 22 million links. Each pair contains fields of StudentID, MentorID, Probability, StartYear, FinishYear, NumOfCooperatedPublication in the order.
The prediction has been randomly validated while around 80% relationship can be confirmed from scholars’ homepages (if we set minimum Probability = 0.85).
Our scheme is to predict the mentor relationship through the cooperation between researchers in publications. We first collected the facts dispersed on the Internet and clean the data by filtering to ensure that those relationship remained do involve cooperation in publications. Then we extract 10 features including the number of years and publications of the cooperation, cooperation percentage etc., indicating the cooperation disparity and intimacy. We adopted multi-layer perceptrons, MLPs, to deal with simulation and forecasting of hidden mentor relationships. The accuracy of the model can reach 90%. Then we predicted the mentor relationship probability for 4 million renowned researchers based on publication cooperation with our trained model.
If you have any question or idea about the dataset, you are welcomed to drop an email to firstname.lastname@example.org.
We generate a large set of scholarly topics information in the field of Computer Science with reliable ground-truth communities based on Microsoft Academic Graph (MAG). The MAG dataset contains over 100 million scientific papers with titles, references, publish time, and sets of “Field of Study” (FoS).
We extract the topic information with 17 features listed below and do the time serialization to all the features. Then we measure the features in different metrics and predict the future trends of all the topics in this dataset.
Dataset.csv: Each line means one year of a topic. The meaning of each number is corresponding to the first line(header) in the csv.
If you find theses useful, we would be grateful if you could cite the following paper:- Jinghao Zhao, Hao Wu, Fengyu Deng, Wentian Bao, Wencheng Tang, Luoyi Fu, Xinbing Wang, "Maximum Value Matters: Finding Hot Topics in Scholarly Fields"
There are four co-authorship networks in different areas from Microsoft Academic Graph(MAG).From those,we can construct a group of networks with equivalent sets of nodes and set up the correspondence of nodes as ground-turth based on the unique identifier of authors in MAG.The communities are assigned based on the institution information of authors.The four networks can be combined into six pairs, in which one is set as the published network and the other as the auxiliary network.
|network 1||Undirected||41035||76822||co-authorship networks from Microsoft Academic Graph(MAG)|
|network 2||Directed||55147||112448||co-authorship networks from Microsoft Academic Graph(MAG)|
|network 3||Directed||51257||103644||co-authorship networks from Microsoft Academic Graph(MAG)|
|network 4||Directed||43973||91225||co-authorship networks from Microsoft Academic Graph(MAG)|
If you find theses useful, we would be grateful if you could cite the following paper:- X. Fu, L. Fu, Z. Hu, Z. Xu and X. Wang,“De-anonymization of Social Networks with Communities: When Quantification Meets Algorithms”, arXiv preprint arXiv:1703.09028, 2017.
This dataset is crawled from Renren (http://www.renren-inc.com/en/ ), which is one of leading real name social networking services in China. Renren requires the user to provide their basic information such as name and learning experience, based on which it recommends people that the user may know and thus builds up a schoolmate-based network among users. This dataset is crawled in year 2016, including 10,031 users. Each user in the dataset has two items:
— Popularity: The number of views a user receives.
— Friends: The number of friends of a user.
In the file, each line means a user, where the first item is the user’s popularity and the second item is the number of the user’s friends.
This dataset is collected from the MovieLens dataset available at https://grouplens.org/datasets/movielens/. In the original dataset, the edge weights between users and items, namely the users' ratings on items are decimal ratings in (0,5]. In our modified dataset, we map the decimal ratings to interger ratings in range [1,10].
This dataset is collected from the AudioSrobbler dataset available at http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html. In the original dataset we are provided with users' play counts for each music artist they have listened to. In our modified dataset, we mapped play counts to bounded edge weights between users and items i.e. users' ratings as integers in [1,5].
This dataset is collected from the BookCrossing dataset available at http://www2.informatik.uni-freiburg.de/~cziegler/BX/. In the original dataset, we are provided with the users' implicit and explicit ratings on books. In our modified dataset, we use integers in [1,10] to present the explicit user ratings and exculde ratings of 0, which denote implicit ratings.
We constract two real-world weighted temporal text networks:
This dataset is extracted from data provided by Acemap, which has merged several data sources such as DBLP, Microsoft Academic Graph, ACM Digital Library and IEEEXplore.we select out 80271 authors who have published at least one paper on top journals or conferences in computer science region.The posts of each author are the titles of his papers.And each interaction between two users means once citation between them.Note that posts and interactions whose time is earlier than Jan 1, 2000 are removed.
Weibo is one of the most popular micro-blog platform in China.We select 25088 active, interacted users and crawled micro-blogs they published in recent 5 years.In this dataset, each post represents one micro-blog and each interaction denotes once retweeting.
1. docs.txt: Each line means a post. user(tab)time(tab)words. Words are space seperated
2. links.txt: each line means one directed link. user1(tab)user2(tab)time_pairs. Time_pairs are tab seperated and each time_pair is recorded as out_time(space)in_time.
Because the text in Weobo is in Chinese which is not convenient to use in C++, we map each word as a numerical id, and the mapping is save in 'word_dict.json' as standard json format.
The exact size of each dataset is shown as follows:
We generate a large set of 32 temporal text networks with reliable ground-truth communities based on Microsoft Academic Graph (MAG). The MAG dataset contains over 100 million scientific papers with titles, references, publish time, and sets of “Field of Study” (FoS). Totally, there are over 50,000 FoS labels, organized in a four-level hierarchical manner, starting from top L0 levels such as “Mathematics”, “Physics”, “Computer Science” to middle L1 levels such as “Statistics”, “Quantum mechanics”, “Datamining”, and ending with bottom L3 levels such as “Complex manifold”, “Oseen equations”, and “K-optimal pattern discovery”.
We construct a temporal text network by sampling an academic citation network for each L1 level FoS under Computer Science (CS) field. We further define each FoS label as a ground-truth community since all members (i.e., papers) of the same community are in the same subarea of science and possess the same property. Besides, we treat the publish time and title of each paper as its corresponding temporal and textual attributes.
And the exact size of each dataset is shown as follows: