Building a custom model

Depending on your research, you might want to build a custom model to do your literature surveys in. This tutorial goes through the steps needed to do this from scratch.

Note: You do not need to do this if you’re just using the pre-trained model. This is only for the use-case where you’d like to build a model of your own!

[1]:
import chaotic_neural as cn

print('Running chaotic_neural version:', cn.__version__)
print('Running gensim version:', cn.gensim.__version__)
print('Running numpy version:', cn.np.__version__)
Running chaotic_neural version: 0.0.3
Running gensim version: 4.0.1
Running numpy version: 1.20.3

1. Running a simple ArXiv query and printing the results

For various queries, use the following from the ArXiv API.


prefix

explanation

ti

Title

au

Author

abs

Abstract

co

Comment

jr

Journal Reference

cat

Subject Category

rn

Report Number

id

Id (use id_list instead)

all

All of the above

If while running the query you get a ConnectionResetError: [Errno 104] Connection reset by peer, it’s probably because you’ve made too many queries in too short a period of time, and might be best to wait for a while before trying again, and using a larger delay_sec while running the make_feeds() function.

[2]:
n_papers = 3

feed = cn.run_simple_query(search_query='au:iyer_kartheik', max_results = n_papers)

cn.print_feed_entries(feed, num_entries = n_papers)
e-print metadata
arxiv-id: 2104.06514v1
Published: 2021-04-13T20:58:32Z
Title:  Star Formation Histories from SEDs and CMDs Agree: Evidence for
  Synchronized Star Formation in Local Volume Dwarf Galaxies over the Past 3
  Gyr
Authors:  Charlotte Olsen, Eric Gawiser, Kartheik Iyer, Kristen B. W. McQuinn, Benjamin D. Johnson, Grace Telford, Anna C. Wright, Adam Broussard, Peter Kurczynski
abs page link: http://arxiv.org/abs/2104.06514v1
pdf link: http://arxiv.org/pdf/2104.06514v1
Journal reference: No journal ref found
Comments: Accepted for publication in ApJ, 25 pages, 18 figures, 3 tables


 --------------


e-print metadata
arxiv-id: 2010.01132v1
Published: 2020-10-02T18:00:00Z
Title:  IQ Collaboratory II: The Quiescent Fraction of Isolated, Low Mass
  Galaxies Across Simulations and Observations
Authors:  Claire M Dickey, Tjitske K Starkenburg, Marla Geha, ChangHoon Hahn, Daniel Anglés-Alcázar, Ena Choi, Romeel Davé, Shy Genel, Kartheik G Iyer, Ariyeh H Maller, Nir Mandelker, Rachel S Somerville, L Y Aaron Yung
abs page link: http://arxiv.org/abs/2010.01132v1
pdf link: http://arxiv.org/pdf/2010.01132v1
Journal reference: No journal ref found
Comments: 19 pages, 8 figures. Figure 4 presents the main result. Code used in
  this work may be accessed at github.com/IQcollaboratory/orchard. Submitted to
  ApJ


 --------------


e-print metadata
arxiv-id: 2007.07916v1
Published: 2020-07-15T18:00:49Z
Title:  The Diversity and Variability of Star Formation Histories in Models of
  Galaxy Evolution
Authors:  Kartheik G. Iyer, Sandro Tacchella, Shy Genel, Christopher C. Hayward, Lars Hernquist, Alyson M. Brooks, Neven Caplar, Romeel Davé, Benedikt Diemer, John C. Forbes, Eric Gawiser, Rachel S. Somerville, Tjitske K. Starkenburg
abs page link: http://arxiv.org/abs/2007.07916v1
pdf link: http://arxiv.org/pdf/2007.07916v1
Journal reference: No journal ref found
Comments: 31 pages, 17 figures (+ appendix). Resubmitted to MNRAS after
  responding to referee's comments. Comments are welcome!


 --------------


2. Next, we want to generalize this to a large set of feeds corresponding to a particular topic

This can be done using the make_feeds() and read_corpus() functions. Here we’ll create a corpus consisting of the 30,000 most recent ‘astrophysics of galaxies’ (astro-ph.GA) category of papers in the ArXiv (specified by the max_setsize argument). If you’d like to try this with a different category, please check arxiv.org for the full list. If there aren’t as many papers as specified by max_setsize, it’ll get as many as it can get. For better processing, the queries are broken down into chunk (specified by the chunksize argument).

Note for scraping large amounts of data from the API user manual:

In cases where the API needs to be called multiple times in a row, we encourage you to play nice and incorporate a 3 second delay in your code. The detailed examples below illustrate how to do this in a variety of languages. Because of speed limitations in our implementation of the API, the maximum number of results returned from a single call (max_results) is limited to 30000 in slices of at most 2000 at a time, using the max_results and start query parameters. For example to retrieve matches 6001-8000: http://export.arxiv.org/api/query?search_query=all:electron&start=6000&max_results=8000

Large result sets put considerable load on the server and also take a long time to render. We recommend to refine queries which return more than 1,000 results, or at least request smaller slices. For bulk metadata harvesting or set information, etc., the OAI-PMH interface is more suitable. A request with max_results >30,000 will result in an HTTP 400 error code with appropriate explanation. A request for 30000 results will typically take a little over 2 minutes to return a response of over 15MB. Requests for fewer results are much faster and correspondingly smaller.


Note: bioRxiv has a similar (although not identical) API, but I haven’t yet had the chance to implement it within chaotic_neural yet. If you are interested in helping set this up, please get in touch with me or open an issue on GitHub!

[3]:
bigquery = 'cat:astro-ph.GA'
gal_feeds = cn.make_feeds(arxiv_query = bigquery, chunksize = 30, max_setsize = 30000, delay_sec = 0.1)
100%|██████████| 1000/1000 [15:53<00:00,  1.05it/s]
[4]:
with open("gal_feeds.pkl", "wb") as fp:   #Pickling
    cn.pickle.dump(gal_feeds, fp)
[5]:
with open("gal_feeds.pkl", "rb") as fp:
    gal_feeds = cn.pickle.load(fp)
[6]:
# Let's print the status of the feeds for recordkeeping purposes

from datetime import date
bigquery = 'cat:astro-ph.GA'
today = date.today()
d2 = today.strftime("%B %d, %Y")
print("Updated feed for query: \"%s\" with %i most recent feeds as of:" %(bigquery, len(gal_feeds)*len(gal_feeds[0].entries)), d2)
Updated feed for query: "cat:astro-ph.GA" with 30000 most recent feeds as of: May 23, 2021

3. Training the model

Having collected our feeds from the ArXiv (up to date as of today), we can now train our doc2vec model on the abstracts corresponding to each paper. This corresponds to using the

[7]:
d2 = today.strftime("%d%B%Y")
train_start_time = cn.time.time()

# I'm running the training here with 100 epochs and 12 workers (correpsonding to my machine)
# but you might want to change this to whatever works best for you.
model, train_corpus, test_corpus = cn.build_and_train_model(gal_feeds,
                                                            fname_tag = 'astro-ph-GA-'+d2,
                                                            cn_dir='../../chaotic_neural/',
                                                            vector_size = 50, min_count = 2,
                                                            epochs = 100, workers = 12)

train_end_time = cn.time.time()
print('Training done. Time taken: %.2f mins.' %((train_end_time-train_start_time)/60))
100%|██████████| 1000/1000 [00:04<00:00, 229.62it/s]
100%|██████████| 1000/1000 [00:04<00:00, 237.41it/s]
100%|██████████| 1000/1000 [00:00<00:00, 6218.52it/s]
Training done. Time taken: 5.11 mins.

4. Loading the trained model and checking that it works!

[8]:
d2 = today.strftime("%d%B%Y")
modeldata = cn.load_trained_doc2vec_model(fname_tag = 'astro-ph-GA-'+d2, cn_dir='../../chaotic_neural/',)
[9]:
similar_papers = cn.list_similar_papers(modeldata, '2007.07916', input_type='arxiv_id', show_summary=True)
ArXiv id:  2007.07916
Title: The Diversity and Variability of Star Formation Histories in Models of
  Galaxy Evolution
-----------------------------
Most similar/relevant papers:
-----------------------------
0 The Diversity and Variability of Star Formation Histories in Models of
  Galaxy Evolution  (Corrcoef: 0.99 )
Summary:------
Quenching can induce $\sim0.4-1$ dex of additional power on timescales $>1$ Gyr. The dark matter accretion histories of galaxies have remarkably self-similar PSDs and are coherent with the in-situ star formation on timescales $>3$ Gyr. There is considerable diversity among the different models in their (i) power due to SFR variability at a given timescale, (ii) amount of correlation with adjacent timescales (PSD slope), (iii) evolution of median PSDs with stellar mass, and (iv) presence and locations of breaks in the PSDs. The PSD framework is a useful space to study the SFHs of galaxies since model predictions vary widely.

1 A Method to Measure the Unbiased Decorrelation Timescale of the AGN
  Variable Signal from Structure Functions  (Corrcoef: 0.66 )
Summary:------
We show that the signal decorrelation timescale can be measured directly from the SF as the timescale matching the amplitude 0.795 of the flat SF part (at long timescales), and only then the measurement is independent of the ACF PE power.

2 Surrogate modelling the Baryonic Universe II: on forward modelling the
  colours of individual and populations of galaxies  (Corrcoef: 0.65 )
Summary:------
We additionally provide a model-independent fitting function capturing how the level of unresolved star formation variability translates into imprecision in predictions for galaxy colours; our fitting function can be used to determine the minimal SFH model that reproduces colours with some target precision.

3 Impact of an AGN featureless continuum on estimation of stellar
  population properties  (Corrcoef: 0.64 )
Summary:------
At the empirical AGN detection threshold $x_{\mathrm{AGN}}\simeq 0.26$ that we previously inferred in a pilot study on this subject, our results show that the neglect of a PL component in spectral fitting can lead to an overestimation by $\sim$2 dex in stellar mass and by up to $\sim$1 and $\sim$4 dex in the light- and mass-weighted mean stellar age, respectively, whereas the light- and mass-weighted mean stellar metallicity are underestimated by up to $\sim$0.3 and $\sim$0.6 dex, respectively.

4 The gas fractions of dark matter haloes hosting simulated $\sim L^\star$
  galaxies are governed by the feedback history of their black holes  (Corrcoef: 0.63 )
Summary:------
We examine the origin of scatter in the relationship between the gas fraction and mass of dark matter haloes hosting present-day $\sim L^\star$ central galaxies in the EAGLE simulations.

5 Optical variability of AGN in the PTF/iPTF survey  (Corrcoef: 0.61 )
Summary:------
We utilize both the structure function (SF) and power spectrum density (PSD) formalisms to search for links between the optical variability and the physical parameters of the accreting supermassive black holes that power the quasars.
This effect is also seen in the SF analysis of the (i)PTF data, and in a PSD analysis of quasars in the SDSS Stripe 82.

6 Reionization with galaxies and active galactic nuclei  (Corrcoef: 0.60 )
Summary:------
We explore a wide range of combinations for the escape fraction of ionizing photons (redshift-dependent, constant and scaling with stellar mass) from both star formation ($\langle f_{\rm esc}^{\rm sf} \rangle$) and AGN ($f_{\rm esc}^{\rm bh}$) to find: (i) the ionizing budget is dominated by stellar radiation from low stellar mass ($M_*<10^9 {\rm M_\odot}$ ) galaxies at $z>6$ with the AGN contribution (driven by $M_{bh}>10^6 {\rm M_\odot}$ black holes in $M_* > 10^9 {\rm M_\odot}$ galaxies) dominating at lower redshifts; (ii) AGN only contribute $10-25\%$ to the cumulative ionizing emissivity by $z=4$ for the models that match the observed reionization constraints; (iii) if the stellar mass dependence of $\langle f_{\rm esc}^{\rm sf} \rangle$ is shallower than $f_{\rm esc}^{\rm bh}$, at $z<7$ a transition stellar mass exists above which AGN dominate the escaping ionizing photon production rate; (iv) the transition stellar mass decreases with decreasing redshift.

7 Building Blocks of the Milky Way's Accreted Spheroid  (Corrcoef: 0.60 )
Summary:------
Combining the Munich-Groningen semi-analytical model of galaxy formation with the high resolution Aquarius simulations of dark matter haloes, we study the assembly history of the stellar spheroids of six Milky Way-mass galaxies, focussing on building block properties such as mass, age and metallicity.

8 The origin of the $α$-enhancement of massive galaxies  (Corrcoef: 0.59 )
Summary:------
In the absence of feedback from active galactic nuclei (AGN), however, $[\alpha/\mathrm{Fe}]_{\ast}$ in $M_{\ast} > 10^{10.5}$ M$_{\odot}$ galaxies is roughly constant with stellar mass and decreases with mean stellar age, extending the trends found for lower-mass galaxies in both simulations with and without AGN.

9 JINGLE -- IV. Dust, HI gas and metal scaling laws in the local Universe  (Corrcoef: 0.58 )
Summary:------
We find that these scaling laws for galaxies with $-1.0\lesssim \log M_{\text{HI}}$/$M_{\star}\lesssim0$ can be reproduced using closed-box models with high fractions (37-89$\%$) of supernova dust surviving a reverse shock, relatively low grain growth efficiencies ($\epsilon$=30-40), and long dus lifetimes (1-2\,Gyr).

[ ]: