Handling of mexico-city survey data for scenario generation#26
Handling of mexico-city survey data for scenario generation#26
Conversation
|
@rakow What do you think about merging this branch into master? To be able to handle the mexican dataset I had to perform some changes on the general scripts (preparation.py, init.py ...). So it would cost us / me some more work to make the changes on the general scripts modifiable or better said to make the general data handling script more flexible -> able to handle a wider spectrum of specific datasets, which are not assuming the application of german law (like MID and SrV). |
|
Thank you, I really like the idea to make the scripts more generally applicable. I will take a look at what you did in the next weeks. |
|
Whenever you find the time feel free to contact me about this as I already have some ideas on what segments to generalize. |
| # Augment data using p_weight | ||
| if augment > 1: | ||
| df = augment_persons(df, augment) | ||
| # in the cdmx case we do not need to do p_weight * augment = 5 (see method augment_persons) |
There was a problem hiding this comment.
prepare_persons should probably be split into multiple function so you can only use these parts that you want in your scenario
There was a problem hiding this comment.
Would you do this by defining sub-methods / -functions inside of prepare_persons? I can try to do that if that's the way you want to go
There was a problem hiding this comment.
I will try to do it, as it requires changing the API and design a little bit.
matsim/scenariogen/data/__init__.py
Outdated
| present_on_day: bool | ||
| reporting_day: int | ||
| n_trips: int | ||
| home_district: str = "" |
There was a problem hiding this comment.
This should belong to the household ?
There was a problem hiding this comment.
Household already has location and geometry. Is an additional attribute needed?
There was a problem hiding this comment.
You are right, BUT for the simple routing in the next activity sampling step (because survey data does not provide leg length) this information is needed. It is added to the persons, because I do not want to have to read the whole households.csv in the next step just for one parameter (as the persons / activities datasets already are huge files).
There was a problem hiding this comment.
I see the problem, but I generally don't like duplicating information. CSV reading should be superfast, is it really a concern?
There was a problem hiding this comment.
Yes, we are talking about 4GB combined only for persons.csv and activities.csv already.. Therefore I cannot run it on my hardware and have to run it on the math cluster, which is annoying for debugging and testing. You have to take into account that we are talking about an area with about 20 million inhabitants, which is way above what we are usually handling (Berlin Brandenburg e.g.)
With this PR a new dataformat for surveys "eodmx" is added. The code, which uses the data formats is adapted, such that it can handle the new data format. The survey EOD2017 (Encuesta Origen Destino) is undertaken for the metropolitan area of Mexico City (ZMVM) by INEGI (Instituto Nacional de Estadística y Geografía), the mexican secretary for statistics and geography.